Skip to content

MongoDB GridFS Large File Storage and Management: Advanced Distributed File Systems and Binary Data Operations for Enterprise Applications

Modern applications require sophisticated file storage capabilities that can handle large binary files, multimedia content, and document management while providing distributed access, version control, and efficient streaming. Traditional file system approaches struggle with scalability, metadata management, and integration with database operations, leading to complex architecture with separate storage systems, synchronization challenges, and operational overhead that complicates application development and deployment.

MongoDB GridFS provides comprehensive large file storage through distributed binary data management, efficient chunk-based storage, integrated metadata handling, and streaming capabilities that enable seamless file operations within database transactions. Unlike traditional file systems that require separate storage infrastructure and complex synchronization, GridFS integrates file storage directly into MongoDB with automatic chunking, replica set distribution, and transactional consistency.

The Traditional Large File Storage Challenge

Conventional approaches to large file storage in application architectures face significant limitations:

-- Traditional file storage management - complex infrastructure with limited integration capabilities

-- Basic file metadata tracking table with minimal functionality
CREATE TABLE file_metadata (
    file_id SERIAL PRIMARY KEY,
    file_name VARCHAR(255) NOT NULL,
    file_path TEXT NOT NULL,
    file_type VARCHAR(100),
    mime_type VARCHAR(100),

    -- Basic file information (limited metadata)
    file_size_bytes BIGINT,
    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(100),

    -- Storage location tracking (manual management)
    storage_location VARCHAR(200),
    storage_server VARCHAR(100),
    storage_partition VARCHAR(50),

    -- Basic versioning (very limited)
    version_number INTEGER DEFAULT 1,
    is_current_version BOOLEAN DEFAULT true,
    parent_file_id INTEGER REFERENCES file_metadata(file_id),

    -- Access control (basic)
    access_permissions VARCHAR(50) DEFAULT 'private',
    owner_user_id INTEGER,

    -- File status
    file_status VARCHAR(50) DEFAULT 'active',
    checksum VARCHAR(64),

    -- Backup and replication tracking
    backup_status VARCHAR(50) DEFAULT 'pending',
    last_backup_time TIMESTAMP,
    replication_status VARCHAR(50) DEFAULT 'single'
);

-- File chunk storage simulation (very basic)
CREATE TABLE file_chunks (
    chunk_id SERIAL PRIMARY KEY,
    file_id INTEGER REFERENCES file_metadata(file_id),
    chunk_number INTEGER NOT NULL,
    chunk_size_bytes INTEGER NOT NULL,

    -- Chunk storage (can't actually store binary data efficiently)
    chunk_data TEXT, -- Base64 encoded - very inefficient
    chunk_checksum VARCHAR(64),

    -- Storage tracking
    storage_location VARCHAR(200),
    compression_applied BOOLEAN DEFAULT false,
    compression_ratio DECIMAL(5,2),

    UNIQUE(file_id, chunk_number)
);

-- Manual file upload processing function (very limited functionality)
CREATE OR REPLACE FUNCTION process_file_upload(
    file_name_param VARCHAR(255),
    file_path_param TEXT,
    file_size_param BIGINT,
    chunk_size_param INTEGER DEFAULT 1048576 -- 1MB chunks
) RETURNS TABLE (
    upload_success BOOLEAN,
    file_id INTEGER,
    total_chunks INTEGER,
    processing_time_seconds INTEGER,
    error_message TEXT
) AS $$
DECLARE
    new_file_id INTEGER;
    total_chunks_count INTEGER;
    chunk_counter INTEGER := 1;
    processing_start TIMESTAMP;
    processing_end TIMESTAMP;
    upload_error TEXT := '';
    upload_result BOOLEAN := true;
    simulated_chunk_data TEXT;
BEGIN
    processing_start := clock_timestamp();

    BEGIN
        -- Calculate total chunks needed
        total_chunks_count := CEILING(file_size_param::DECIMAL / chunk_size_param);

        -- Create file metadata record
        INSERT INTO file_metadata (
            file_name, file_path, file_size_bytes, 
            storage_location, checksum
        )
        VALUES (
            file_name_param, file_path_param, file_size_param,
            '/storage/files/' || EXTRACT(YEAR FROM CURRENT_DATE) || '/' || 
            EXTRACT(MONTH FROM CURRENT_DATE) || '/',
            MD5(file_name_param || file_size_param::TEXT) -- Basic checksum
        )
        RETURNING file_metadata.file_id INTO new_file_id;

        -- Simulate chunk processing (very basic)
        WHILE chunk_counter <= total_chunks_count LOOP
            -- Calculate chunk size for this chunk
            DECLARE
                current_chunk_size INTEGER;
            BEGIN
                IF chunk_counter = total_chunks_count THEN
                    current_chunk_size := file_size_param - ((chunk_counter - 1) * chunk_size_param);
                ELSE
                    current_chunk_size := chunk_size_param;
                END IF;

                -- Simulate chunk data (can't actually handle binary data efficiently)
                simulated_chunk_data := 'chunk_' || chunk_counter || '_data_placeholder';

                -- Insert chunk record
                INSERT INTO file_chunks (
                    file_id, chunk_number, chunk_size_bytes, 
                    chunk_data, chunk_checksum, storage_location
                )
                VALUES (
                    new_file_id, chunk_counter, current_chunk_size,
                    simulated_chunk_data,
                    MD5(simulated_chunk_data),
                    '/storage/chunks/' || new_file_id || '/' || chunk_counter
                );

                chunk_counter := chunk_counter + 1;

                -- Simulate processing time
                PERFORM pg_sleep(0.01);
            END;
        END LOOP;

        -- Update file status
        UPDATE file_metadata 
        SET file_status = 'available',
            modified_date = clock_timestamp()
        WHERE file_id = new_file_id;

    EXCEPTION WHEN OTHERS THEN
        upload_result := false;
        upload_error := SQLERRM;

        -- Cleanup on failure
        DELETE FROM file_chunks WHERE file_id = new_file_id;
        DELETE FROM file_metadata WHERE file_id = new_file_id;
    END;

    processing_end := clock_timestamp();

    RETURN QUERY SELECT 
        upload_result,
        new_file_id,
        total_chunks_count,
        EXTRACT(SECONDS FROM processing_end - processing_start)::INTEGER,
        CASE WHEN NOT upload_result THEN upload_error ELSE NULL END;

END;
$$ LANGUAGE plpgsql;

-- Basic file download function (very limited streaming capabilities)
CREATE OR REPLACE FUNCTION download_file_chunks(file_id_param INTEGER)
RETURNS TABLE (
    chunk_number INTEGER,
    chunk_size_bytes INTEGER,
    chunk_data TEXT,
    download_order INTEGER
) AS $$
BEGIN
    -- Simple chunk retrieval (no streaming, no optimization)
    RETURN QUERY
    SELECT 
        fc.chunk_number,
        fc.chunk_size_bytes,
        fc.chunk_data,
        fc.chunk_number as download_order
    FROM file_chunks fc
    WHERE fc.file_id = file_id_param
    ORDER BY fc.chunk_number;

    -- Update download statistics (basic tracking)
    UPDATE file_metadata 
    SET modified_date = CURRENT_TIMESTAMP
    WHERE file_id = file_id_param;
END;
$$ LANGUAGE plpgsql;

-- Execute file upload simulation
SELECT * FROM process_file_upload('large_document.pdf', '/uploads/large_document.pdf', 50000000, 1048576);

-- Basic file management and cleanup
WITH file_storage_analysis AS (
    SELECT 
        fm.file_id,
        fm.file_name,
        fm.file_size_bytes,
        fm.created_date,
        fm.file_status,
        COUNT(fc.chunk_id) as total_chunks,
        SUM(fc.chunk_size_bytes) as total_chunk_size,

        -- Storage efficiency calculation (basic)
        CASE 
            WHEN fm.file_size_bytes > 0 THEN
                (SUM(fc.chunk_size_bytes)::DECIMAL / fm.file_size_bytes) * 100
            ELSE 0
        END as storage_efficiency_percent,

        -- Age analysis
        EXTRACT(DAYS FROM CURRENT_TIMESTAMP - fm.created_date) as file_age_days,

        -- Basic categorization
        CASE 
            WHEN fm.file_size_bytes > 100 * 1024 * 1024 THEN 'large'
            WHEN fm.file_size_bytes > 10 * 1024 * 1024 THEN 'medium'
            ELSE 'small'
        END as file_size_category

    FROM file_metadata fm
    LEFT JOIN file_chunks fc ON fm.file_id = fc.file_id
    WHERE fm.created_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY fm.file_id, fm.file_name, fm.file_size_bytes, fm.created_date, fm.file_status
)
SELECT 
    fsa.file_name,
    fsa.file_size_category,
    ROUND(fsa.file_size_bytes / 1024.0 / 1024.0, 2) as file_size_mb,
    fsa.total_chunks,
    ROUND(fsa.storage_efficiency_percent, 1) as storage_efficiency_percent,
    fsa.file_age_days,
    fsa.file_status,

    -- Storage recommendations (very basic)
    CASE 
        WHEN fsa.storage_efficiency_percent < 95 THEN 'check_chunk_integrity'
        WHEN fsa.file_age_days > 365 AND fsa.file_status = 'active' THEN 'consider_archiving'
        WHEN fsa.total_chunks = 0 THEN 'missing_chunks'
        ELSE 'normal'
    END as recommendation

FROM file_storage_analysis fsa
ORDER BY fsa.file_size_bytes DESC, fsa.created_date DESC;

-- Basic file cleanup (manual process)
WITH old_files AS (
    SELECT file_id, file_name, file_size_bytes
    FROM file_metadata
    WHERE created_date < CURRENT_DATE - INTERVAL '2 years'
    AND file_status = 'archived'
),
cleanup_chunks AS (
    DELETE FROM file_chunks
    WHERE file_id IN (SELECT file_id FROM old_files)
    RETURNING file_id, chunk_size_bytes
),
cleanup_files AS (
    DELETE FROM file_metadata
    WHERE file_id IN (SELECT file_id FROM old_files)
    RETURNING file_id, file_size_bytes
)
SELECT 
    COUNT(DISTINCT cf.file_id) as files_cleaned,
    SUM(cf.file_size_bytes) as total_space_freed_bytes,
    ROUND(SUM(cf.file_size_bytes) / 1024.0 / 1024.0 / 1024.0, 2) as space_freed_gb,
    COUNT(cc.file_id) as chunks_cleaned
FROM cleanup_files cf
LEFT JOIN cleanup_chunks cc ON cf.file_id = cc.file_id;

-- Problems with traditional file storage approaches:
-- 1. Inefficient binary data handling in relational databases
-- 2. Manual chunk management with no automatic optimization
-- 3. Limited streaming capabilities and poor performance for large files
-- 4. No built-in replication or distributed storage features
-- 5. Basic metadata management with limited search capabilities
-- 6. Complex backup and recovery procedures for file data
-- 7. No transactional consistency between file operations and database operations
-- 8. Limited scalability for high-volume file storage requirements
-- 9. No built-in compression or space optimization features
-- 10. Manual versioning and access control management

MongoDB GridFS provides comprehensive large file storage with advanced binary data management:

// MongoDB GridFS Advanced File Storage - comprehensive binary data management with streaming capabilities
const { MongoClient, GridFSBucket } = require('mongodb');
const fs = require('fs');
const { createReadStream, createWriteStream } = require('fs');
const { pipeline } = require('stream');
const { promisify } = require('util');
const crypto = require('crypto');
const { EventEmitter } = require('events');

// Comprehensive MongoDB GridFS File Manager
class AdvancedGridFSFileManager extends EventEmitter {
  constructor(connectionString, gridFSConfig = {}) {
    super();
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;
    this.gridFSBuckets = new Map();

    // Advanced GridFS configuration
    this.config = {
      // Bucket configuration
      defaultBucket: gridFSConfig.defaultBucket || 'fs',
      customBuckets: gridFSConfig.customBuckets || {},
      chunkSizeBytes: gridFSConfig.chunkSizeBytes || 261120, // 255KB default

      // File management settings
      enableMetadataIndexing: gridFSConfig.enableMetadataIndexing !== false,
      enableVersionControl: gridFSConfig.enableVersionControl || false,
      enableCompression: gridFSConfig.enableCompression || false,
      enableEncryption: gridFSConfig.enableEncryption || false,

      // Storage optimization
      enableAutomaticCleanup: gridFSConfig.enableAutomaticCleanup || false,
      enableDeduplication: gridFSConfig.enableDeduplication || false,
      enableThumbnailGeneration: gridFSConfig.enableThumbnailGeneration || false,

      // Performance configuration
      enableParallelUploads: gridFSConfig.enableParallelUploads || false,
      maxConcurrentUploads: gridFSConfig.maxConcurrentUploads || 5,
      enableStreamingOptimization: gridFSConfig.enableStreamingOptimization || false,

      // Access control and security
      enableAccessControl: gridFSConfig.enableAccessControl || false,
      defaultPermissions: gridFSConfig.defaultPermissions || 'private',
      enableAuditLogging: gridFSConfig.enableAuditLogging || false,

      // Backup and replication
      enableBackupIntegration: gridFSConfig.enableBackupIntegration || false,
      enableReplicationMonitoring: gridFSConfig.enableReplicationMonitoring || false,

      // File processing
      enableContentAnalysis: gridFSConfig.enableContentAnalysis || false,
      enableVirusScan: gridFSConfig.enableVirusScan || false,
      enableFormatValidation: gridFSConfig.enableFormatValidation || false
    };

    // File management state
    this.activeUploads = new Map();
    this.activeDownloads = new Map();
    this.fileOperations = new Map();
    this.uploadQueue = [];

    // Performance metrics
    this.metrics = {
      totalFilesStored: 0,
      totalBytesStored: 0,
      averageUploadSpeed: 0,
      averageDownloadSpeed: 0,
      storageEfficiency: 0
    };

    this.initializeGridFS();
  }

  async initializeGridFS() {
    console.log('Initializing advanced GridFS file management...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Initialize default GridFS bucket
      this.initializeBucket(this.config.defaultBucket);

      // Initialize custom buckets
      for (const [bucketName, bucketConfig] of Object.entries(this.config.customBuckets)) {
        this.initializeBucket(bucketName, bucketConfig);
      }

      // Setup metadata indexing
      if (this.config.enableMetadataIndexing) {
        await this.setupMetadataIndexing();
      }

      // Setup file processing pipeline
      await this.setupFileProcessingPipeline();

      // Initialize monitoring and metrics
      await this.setupMonitoringAndMetrics();

      console.log('Advanced GridFS file management initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS:', error);
      throw error;
    }
  }

  initializeBucket(bucketName, bucketConfig = {}) {
    const bucket = new GridFSBucket(this.db, {
      bucketName: bucketName,
      chunkSizeBytes: bucketConfig.chunkSizeBytes || this.config.chunkSizeBytes
    });

    this.gridFSBuckets.set(bucketName, {
      bucket: bucket,
      config: bucketConfig,
      stats: {
        totalFiles: 0,
        totalBytes: 0,
        averageFileSize: 0,
        lastActivity: new Date()
      }
    });

    console.log(`Initialized GridFS bucket: ${bucketName}`);
  }

  async setupMetadataIndexing() {
    console.log('Setting up metadata indexing for GridFS...');

    try {
      // Create indexes on files collection for efficient queries
      for (const [bucketName, bucketInfo] of this.gridFSBuckets.entries()) {
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        // Files collection indexes
        await filesCollection.createIndex(
          { filename: 1, uploadDate: -1 },
          { background: true }
        );

        await filesCollection.createIndex(
          { 'metadata.contentType': 1, uploadDate: -1 },
          { background: true }
        );

        await filesCollection.createIndex(
          { 'metadata.tags': 1 },
          { background: true }
        );

        await filesCollection.createIndex(
          { length: -1, uploadDate: -1 },
          { background: true }
        );

        // Chunks collection optimization
        await chunksCollection.createIndex(
          { files_id: 1, n: 1 },
          { unique: true, background: true }
        );
      }

    } catch (error) {
      console.error('Error setting up metadata indexing:', error);
      throw error;
    }
  }

  async uploadFile(filePath, options = {}) {
    console.log(`Starting file upload: ${filePath}`);

    const uploadId = this.generateUploadId();
    const startTime = Date.now();

    try {
      // Validate file exists and get stats
      const fileStats = await fs.promises.stat(filePath);

      // Prepare upload configuration
      const uploadConfig = {
        uploadId: uploadId,
        filePath: filePath,
        fileName: options.filename || path.basename(filePath),
        bucketName: options.bucket || this.config.defaultBucket,
        contentType: options.contentType || this.detectContentType(filePath),

        // File metadata
        metadata: {
          originalPath: filePath,
          fileSize: fileStats.size,
          uploadedBy: options.uploadedBy || 'system',
          uploadDate: new Date(),
          contentType: options.contentType || this.detectContentType(filePath),

          // Custom metadata
          tags: options.tags || [],
          category: options.category || 'general',
          permissions: options.permissions || this.config.defaultPermissions,

          // Processing configuration
          processOnUpload: options.processOnUpload || false,
          generateThumbnail: options.generateThumbnail || false,
          enableCompression: options.enableCompression || this.config.enableCompression,

          // Checksums for integrity
          checksums: {}
        },

        // Upload progress tracking
        progress: {
          bytesUploaded: 0,
          totalBytes: fileStats.size,
          percentComplete: 0,
          uploadSpeed: 0,
          estimatedTimeRemaining: 0
        }
      };

      // Get GridFS bucket
      const bucketInfo = this.gridFSBuckets.get(uploadConfig.bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket not found: ${uploadConfig.bucketName}`);
      }

      // Store upload state
      this.activeUploads.set(uploadId, uploadConfig);

      // Calculate file checksum before upload
      if (this.config.enableDeduplication) {
        uploadConfig.metadata.checksums.md5 = await this.calculateFileChecksum(filePath, 'md5');
        uploadConfig.metadata.checksums.sha256 = await this.calculateFileChecksum(filePath, 'sha256');

        // Check for duplicate files
        const duplicate = await this.findDuplicateFile(uploadConfig.metadata.checksums.sha256, uploadConfig.bucketName);
        if (duplicate && options.skipDuplicates) {
          this.emit('duplicateDetected', {
            uploadId: uploadId,
            duplicateFileId: duplicate._id,
            fileName: uploadConfig.fileName
          });

          return {
            success: true,
            uploadId: uploadId,
            fileId: duplicate._id,
            isDuplicate: true,
            fileName: uploadConfig.fileName,
            fileSize: duplicate.length
          };
        }
      }

      // Create upload stream
      const uploadStream = bucketInfo.bucket.openUploadStream(uploadConfig.fileName, {
        chunkSizeBytes: bucketInfo.config.chunkSizeBytes || this.config.chunkSizeBytes,
        metadata: uploadConfig.metadata
      });

      // Create read stream from file
      const fileReadStream = createReadStream(filePath);

      // Track upload progress
      const progressTracker = this.createProgressTracker(uploadId, uploadConfig);

      // Pipeline streams with error handling
      const pipelineAsync = promisify(pipeline);

      await pipelineAsync(
        fileReadStream,
        progressTracker,
        uploadStream
      );

      // Update upload completion
      const endTime = Date.now();
      const duration = endTime - startTime;
      const fileId = uploadStream.id;

      uploadConfig.fileId = fileId;
      uploadConfig.status = 'completed';
      uploadConfig.duration = duration;
      uploadConfig.uploadSpeed = (fileStats.size / 1024 / 1024) / (duration / 1000); // MB/s

      // Update bucket statistics
      bucketInfo.stats.totalFiles++;
      bucketInfo.stats.totalBytes += fileStats.size;
      bucketInfo.stats.averageFileSize = bucketInfo.stats.totalBytes / bucketInfo.stats.totalFiles;
      bucketInfo.stats.lastActivity = new Date();

      // Post-processing
      if (uploadConfig.metadata.processOnUpload) {
        await this.processUploadedFile(fileId, uploadConfig);
      }

      // Update system metrics
      this.updateMetrics(uploadConfig);

      // Cleanup
      this.activeUploads.delete(uploadId);

      this.emit('uploadCompleted', {
        uploadId: uploadId,
        fileId: fileId,
        fileName: uploadConfig.fileName,
        fileSize: fileStats.size,
        duration: duration,
        uploadSpeed: uploadConfig.uploadSpeed
      });

      console.log(`File upload completed: ${uploadConfig.fileName} (${fileId})`);

      return {
        success: true,
        uploadId: uploadId,
        fileId: fileId,
        fileName: uploadConfig.fileName,
        fileSize: fileStats.size,
        duration: duration,
        uploadSpeed: uploadConfig.uploadSpeed,
        bucketName: uploadConfig.bucketName
      };

    } catch (error) {
      console.error(`File upload failed for ${uploadId}:`, error);

      // Update upload state
      const uploadConfig = this.activeUploads.get(uploadId);
      if (uploadConfig) {
        uploadConfig.status = 'failed';
        uploadConfig.error = error.message;
      }

      this.emit('uploadFailed', {
        uploadId: uploadId,
        fileName: options.filename || path.basename(filePath),
        error: error.message
      });

      return {
        success: false,
        uploadId: uploadId,
        error: error.message
      };
    }
  }

  createProgressTracker(uploadId, uploadConfig) {
    const { Transform } = require('stream');

    return new Transform({
      transform(chunk, encoding, callback) {
        // Update progress
        uploadConfig.progress.bytesUploaded += chunk.length;
        uploadConfig.progress.percentComplete = 
          (uploadConfig.progress.bytesUploaded / uploadConfig.progress.totalBytes) * 100;

        // Calculate upload speed
        const currentTime = Date.now();
        const timeElapsed = (currentTime - uploadConfig.startTime) / 1000; // seconds
        uploadConfig.progress.uploadSpeed = 
          (uploadConfig.progress.bytesUploaded / 1024 / 1024) / timeElapsed; // MB/s

        // Estimate time remaining
        const remainingBytes = uploadConfig.progress.totalBytes - uploadConfig.progress.bytesUploaded;
        uploadConfig.progress.estimatedTimeRemaining = 
          remainingBytes / (uploadConfig.progress.uploadSpeed * 1024 * 1024);

        // Emit progress update
        this.emit('uploadProgress', {
          uploadId: uploadId,
          progress: uploadConfig.progress
        });

        callback(null, chunk);
      }.bind(this)
    });
  }

  async downloadFile(fileId, downloadPath, options = {}) {
    console.log(`Starting file download: ${fileId} -> ${downloadPath}`);

    const downloadId = this.generateDownloadId();
    const startTime = Date.now();

    try {
      // Get bucket
      const bucketName = options.bucket || this.config.defaultBucket;
      const bucketInfo = this.gridFSBuckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket not found: ${bucketName}`);
      }

      // Get file metadata
      const fileMetadata = await this.getFileMetadata(fileId, bucketName);
      if (!fileMetadata) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Prepare download configuration
      const downloadConfig = {
        downloadId: downloadId,
        fileId: fileId,
        downloadPath: downloadPath,
        bucketName: bucketName,
        fileSize: fileMetadata.length,

        // Download progress tracking
        progress: {
          bytesDownloaded: 0,
          totalBytes: fileMetadata.length,
          percentComplete: 0,
          downloadSpeed: 0,
          estimatedTimeRemaining: 0
        }
      };

      // Store download state
      this.activeDownloads.set(downloadId, downloadConfig);

      // Create download stream
      const downloadStream = bucketInfo.bucket.openDownloadStream(fileId);

      // Create write stream to file
      const fileWriteStream = createWriteStream(downloadPath);

      // Track download progress
      const progressTracker = this.createDownloadProgressTracker(downloadId, downloadConfig);

      // Pipeline streams
      const pipelineAsync = promisify(pipeline);

      await pipelineAsync(
        downloadStream,
        progressTracker,
        fileWriteStream
      );

      // Update download completion
      const endTime = Date.now();
      const duration = endTime - startTime;
      downloadConfig.duration = duration;
      downloadConfig.downloadSpeed = (fileMetadata.length / 1024 / 1024) / (duration / 1000); // MB/s

      // Cleanup
      this.activeDownloads.delete(downloadId);

      this.emit('downloadCompleted', {
        downloadId: downloadId,
        fileId: fileId,
        fileName: fileMetadata.filename,
        fileSize: fileMetadata.length,
        duration: duration,
        downloadSpeed: downloadConfig.downloadSpeed
      });

      console.log(`File download completed: ${fileMetadata.filename} (${fileId})`);

      return {
        success: true,
        downloadId: downloadId,
        fileId: fileId,
        fileName: fileMetadata.filename,
        fileSize: fileMetadata.length,
        duration: duration,
        downloadSpeed: downloadConfig.downloadSpeed
      };

    } catch (error) {
      console.error(`File download failed for ${downloadId}:`, error);

      // Cleanup
      this.activeDownloads.delete(downloadId);

      this.emit('downloadFailed', {
        downloadId: downloadId,
        fileId: fileId,
        error: error.message
      });

      return {
        success: false,
        downloadId: downloadId,
        fileId: fileId,
        error: error.message
      };
    }
  }

  createDownloadProgressTracker(downloadId, downloadConfig) {
    const { Transform } = require('stream');

    return new Transform({
      transform(chunk, encoding, callback) {
        // Update progress
        downloadConfig.progress.bytesDownloaded += chunk.length;
        downloadConfig.progress.percentComplete = 
          (downloadConfig.progress.bytesDownloaded / downloadConfig.progress.totalBytes) * 100;

        // Calculate download speed
        const currentTime = Date.now();
        const timeElapsed = (currentTime - downloadConfig.startTime) / 1000; // seconds
        downloadConfig.progress.downloadSpeed = 
          (downloadConfig.progress.bytesDownloaded / 1024 / 1024) / timeElapsed; // MB/s

        // Emit progress update
        this.emit('downloadProgress', {
          downloadId: downloadId,
          progress: downloadConfig.progress
        });

        callback(null, chunk);
      }.bind(this)
    });
  }

  async getFileMetadata(fileId, bucketName = null) {
    console.log(`Getting file metadata: ${fileId}`);

    try {
      bucketName = bucketName || this.config.defaultBucket;
      const filesCollection = this.db.collection(`${bucketName}.files`);

      const fileMetadata = await filesCollection.findOne({ _id: fileId });
      return fileMetadata;

    } catch (error) {
      console.error(`Error getting file metadata for ${fileId}:`, error);
      throw error;
    }
  }

  async searchFiles(searchCriteria, options = {}) {
    console.log('Searching files with criteria:', searchCriteria);

    try {
      const bucketName = options.bucket || this.config.defaultBucket;
      const filesCollection = this.db.collection(`${bucketName}.files`);

      // Build search query
      const query = {};

      // Text search on filename
      if (searchCriteria.filename) {
        query.filename = { $regex: searchCriteria.filename, $options: 'i' };
      }

      // Content type filter
      if (searchCriteria.contentType) {
        query['metadata.contentType'] = searchCriteria.contentType;
      }

      // Size range filter
      if (searchCriteria.sizeRange) {
        query.length = {};
        if (searchCriteria.sizeRange.min) {
          query.length.$gte = searchCriteria.sizeRange.min;
        }
        if (searchCriteria.sizeRange.max) {
          query.length.$lte = searchCriteria.sizeRange.max;
        }
      }

      // Date range filter
      if (searchCriteria.dateRange) {
        query.uploadDate = {};
        if (searchCriteria.dateRange.from) {
          query.uploadDate.$gte = new Date(searchCriteria.dateRange.from);
        }
        if (searchCriteria.dateRange.to) {
          query.uploadDate.$lte = new Date(searchCriteria.dateRange.to);
        }
      }

      // Tags filter
      if (searchCriteria.tags) {
        query['metadata.tags'] = { $in: searchCriteria.tags };
      }

      // Category filter
      if (searchCriteria.category) {
        query['metadata.category'] = searchCriteria.category;
      }

      // Execute search with pagination
      const limit = options.limit || 50;
      const skip = options.skip || 0;
      const sort = options.sort || { uploadDate: -1 };

      const files = await filesCollection
        .find(query)
        .sort(sort)
        .limit(limit)
        .skip(skip)
        .toArray();

      // Get total count for pagination
      const totalCount = await filesCollection.countDocuments(query);

      return {
        success: true,
        files: files.map(file => ({
          fileId: file._id,
          filename: file.filename,
          length: file.length,
          uploadDate: file.uploadDate,
          contentType: file.metadata?.contentType,
          tags: file.metadata?.tags || [],
          category: file.metadata?.category,
          checksums: file.metadata?.checksums || {}
        })),
        totalCount: totalCount,
        currentPage: Math.floor(skip / limit) + 1,
        totalPages: Math.ceil(totalCount / limit)
      };

    } catch (error) {
      console.error('Error searching files:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async deleteFile(fileId, bucketName = null) {
    console.log(`Deleting file: ${fileId}`);

    try {
      bucketName = bucketName || this.config.defaultBucket;
      const bucketInfo = this.gridFSBuckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket not found: ${bucketName}`);
      }

      // Get file metadata before deletion
      const fileMetadata = await this.getFileMetadata(fileId, bucketName);
      if (!fileMetadata) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Delete file from GridFS
      await bucketInfo.bucket.delete(fileId);

      // Update bucket statistics
      bucketInfo.stats.totalFiles = Math.max(0, bucketInfo.stats.totalFiles - 1);
      bucketInfo.stats.totalBytes = Math.max(0, bucketInfo.stats.totalBytes - fileMetadata.length);
      if (bucketInfo.stats.totalFiles > 0) {
        bucketInfo.stats.averageFileSize = bucketInfo.stats.totalBytes / bucketInfo.stats.totalFiles;
      }
      bucketInfo.stats.lastActivity = new Date();

      this.emit('fileDeleted', {
        fileId: fileId,
        fileName: fileMetadata.filename,
        fileSize: fileMetadata.length,
        bucketName: bucketName
      });

      console.log(`File deleted successfully: ${fileMetadata.filename} (${fileId})`);

      return {
        success: true,
        fileId: fileId,
        fileName: fileMetadata.filename,
        fileSize: fileMetadata.length
      };

    } catch (error) {
      console.error(`Error deleting file ${fileId}:`, error);
      return {
        success: false,
        fileId: fileId,
        error: error.message
      };
    }
  }

  async getStorageStatistics(bucketName = null) {
    console.log(`Getting storage statistics${bucketName ? ' for bucket: ' + bucketName : ''}`);

    try {
      const statistics = {};

      if (bucketName) {
        // Get statistics for specific bucket
        const bucketInfo = this.gridFSBuckets.get(bucketName);
        if (!bucketInfo) {
          throw new Error(`GridFS bucket not found: ${bucketName}`);
        }

        statistics[bucketName] = await this.calculateBucketStatistics(bucketName, bucketInfo);
      } else {
        // Get statistics for all buckets
        for (const [name, bucketInfo] of this.gridFSBuckets.entries()) {
          statistics[name] = await this.calculateBucketStatistics(name, bucketInfo);
        }
      }

      // Calculate system-wide statistics
      const systemStats = {
        totalBuckets: this.gridFSBuckets.size,
        totalFiles: Object.values(statistics).reduce((sum, bucket) => sum + bucket.fileCount, 0),
        totalBytes: Object.values(statistics).reduce((sum, bucket) => sum + bucket.totalBytes, 0),
        averageFileSize: 0,
        storageEfficiency: this.metrics.storageEfficiency
      };

      if (systemStats.totalFiles > 0) {
        systemStats.averageFileSize = systemStats.totalBytes / systemStats.totalFiles;
      }

      return {
        success: true,
        bucketStatistics: statistics,
        systemStatistics: systemStats,
        retrievalTime: new Date()
      };

    } catch (error) {
      console.error('Error getting storage statistics:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async calculateBucketStatistics(bucketName, bucketInfo) {
    const filesCollection = this.db.collection(`${bucketName}.files`);
    const chunksCollection = this.db.collection(`${bucketName}.chunks`);

    // Basic file statistics
    const fileStats = await filesCollection.aggregate([
      {
        $group: {
          _id: null,
          fileCount: { $sum: 1 },
          totalBytes: { $sum: '$length' },
          averageFileSize: { $avg: '$length' },
          largestFile: { $max: '$length' },
          smallestFile: { $min: '$length' }
        }
      }
    ]).toArray();

    // Content type distribution
    const contentTypeStats = await filesCollection.aggregate([
      {
        $group: {
          _id: '$metadata.contentType',
          count: { $sum: 1 },
          totalBytes: { $sum: '$length' }
        }
      },
      { $sort: { count: -1 } },
      { $limit: 10 }
    ]).toArray();

    // Chunk statistics
    const chunkStats = await chunksCollection.aggregate([
      {
        $group: {
          _id: null,
          totalChunks: { $sum: 1 },
          averageChunkSize: { $avg: { $binarySize: '$data' } }
        }
      }
    ]).toArray();

    const baseStats = fileStats[0] || {
      fileCount: 0,
      totalBytes: 0,
      averageFileSize: 0,
      largestFile: 0,
      smallestFile: 0
    };

    return {
      fileCount: baseStats.fileCount,
      totalBytes: baseStats.totalBytes,
      averageFileSize: Math.round(baseStats.averageFileSize || 0),
      largestFile: baseStats.largestFile,
      smallestFile: baseStats.smallestFile,
      contentTypes: contentTypeStats,
      totalChunks: chunkStats[0]?.totalChunks || 0,
      averageChunkSize: Math.round(chunkStats[0]?.averageChunkSize || 0),
      storageEfficiency: this.calculateStorageEfficiency(bucketName)
    };
  }

  // Utility methods

  generateUploadId() {
    return `upload_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  generateDownloadId() {
    return `download_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }

  detectContentType(filePath) {
    const path = require('path');
    const ext = path.extname(filePath).toLowerCase();

    const mimeTypes = {
      '.pdf': 'application/pdf',
      '.doc': 'application/msword',
      '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
      '.jpg': 'image/jpeg',
      '.jpeg': 'image/jpeg',
      '.png': 'image/png',
      '.gif': 'image/gif',
      '.mp4': 'video/mp4',
      '.mp3': 'audio/mpeg',
      '.zip': 'application/zip',
      '.txt': 'text/plain',
      '.json': 'application/json',
      '.xml': 'application/xml'
    };

    return mimeTypes[ext] || 'application/octet-stream';
  }

  async calculateFileChecksum(filePath, algorithm = 'sha256') {
    return new Promise((resolve, reject) => {
      const hash = crypto.createHash(algorithm);
      const stream = createReadStream(filePath);

      stream.on('data', (data) => {
        hash.update(data);
      });

      stream.on('end', () => {
        resolve(hash.digest('hex'));
      });

      stream.on('error', (error) => {
        reject(error);
      });
    });
  }

  async findDuplicateFile(checksum, bucketName) {
    const filesCollection = this.db.collection(`${bucketName}.files`);
    return await filesCollection.findOne({
      'metadata.checksums.sha256': checksum
    });
  }

  calculateStorageEfficiency(bucketName) {
    // Simplified storage efficiency calculation
    // In a real implementation, this would analyze compression ratios, deduplication, etc.
    return 85.0; // Placeholder
  }

  updateMetrics(uploadConfig) {
    this.metrics.totalFilesStored++;
    this.metrics.totalBytesStored += uploadConfig.progress.totalBytes;

    // Update average upload speed
    const totalUploads = this.metrics.totalFilesStored;
    this.metrics.averageUploadSpeed = 
      ((this.metrics.averageUploadSpeed * (totalUploads - 1)) + uploadConfig.uploadSpeed) / totalUploads;
  }

  async setupFileProcessingPipeline() {
    // Setup file processing pipeline for thumbnails, content analysis, etc.
    console.log('Setting up file processing pipeline...');
  }

  async setupMonitoringAndMetrics() {
    // Setup monitoring and metrics collection
    console.log('Setting up monitoring and metrics...');
  }

  async processUploadedFile(fileId, uploadConfig) {
    // Process uploaded file (thumbnails, analysis, etc.)
    console.log(`Processing uploaded file: ${fileId}`);
  }

  async shutdown() {
    console.log('Shutting down GridFS file manager...');

    try {
      // Wait for active uploads/downloads to complete
      if (this.activeUploads.size > 0) {
        console.log(`Waiting for ${this.activeUploads.size} uploads to complete...`);
      }

      if (this.activeDownloads.size > 0) {
        console.log(`Waiting for ${this.activeDownloads.size} downloads to complete...`);
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('GridFS file manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }
}

// Benefits of MongoDB GridFS Advanced File Storage:
// - Efficient binary data storage with automatic chunking and compression
// - Integrated metadata management with full-text search capabilities
// - Streaming upload and download with progress tracking and optimization
// - Built-in replication and distributed storage through MongoDB replica sets
// - Transactional consistency between file operations and database operations
// - Advanced file processing pipeline with thumbnail generation and content analysis
// - Comprehensive version control and access management capabilities
// - SQL-compatible file operations through QueryLeaf integration
// - Enterprise-grade security, encryption, and audit logging
// - Production-ready scalability with automatic load balancing and optimization

module.exports = {
  AdvancedGridFSFileManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage Design and Implementation Patterns

Implement comprehensive GridFS workflows for enterprise file management:

// Enterprise-grade GridFS with advanced distributed file management capabilities
class EnterpriseGridFSManager extends AdvancedGridFSFileManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableDistributedProcessing: true,
      enableContentDeliveryNetwork: true,
      enableAdvancedSecurity: true,
      enableComplianceAuditing: true,
      enableGlobalReplication: true
    };

    this.setupEnterpriseCapabilities();
    this.initializeDistributedProcessing();
    this.setupContentDeliveryNetwork();
  }

  async implementAdvancedFileStrategy() {
    console.log('Implementing enterprise file management strategy...');

    const fileStrategy = {
      // Multi-tier storage strategy
      storageTiers: {
        hotStorage: {
          criteria: 'accessed_within_30_days',
          chunkSize: 261120,
          compressionLevel: 6,
          replicationFactor: 3
        },
        coldStorage: {
          criteria: 'accessed_30_to_90_days_ago',
          chunkSize: 1048576,
          compressionLevel: 9,
          replicationFactor: 2
        },
        archiveStorage: {
          criteria: 'accessed_more_than_90_days_ago',
          chunkSize: 4194304,
          compressionLevel: 9,
          replicationFactor: 1
        }
      },

      // Content delivery optimization
      contentDelivery: {
        enableGlobalDistribution: true,
        enableEdgeCaching: true,
        enableImageOptimization: true,
        enableVideoTranscoding: true
      },

      // Advanced processing
      fileProcessing: {
        enableMachineLearning: true,
        enableContentRecognition: true,
        enableAutomaticTagging: true,
        enableThreatDetection: true
      }
    };

    return await this.deployEnterpriseStrategy(fileStrategy);
  }

  async setupAdvancedSecurity() {
    console.log('Setting up enterprise security for file operations...');

    const securityConfig = {
      // File encryption
      encryptionAtRest: true,
      encryptionInTransit: true,
      encryptionKeyRotation: true,

      // Access control
      roleBasedAccess: true,
      attributeBasedAccess: true,
      dynamicPermissions: true,

      // Threat protection
      malwareScanning: true,
      contentFiltering: true,
      dataLossPrevention: true
    };

    return await this.deploySecurityFramework(securityConfig);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations:

-- QueryLeaf advanced GridFS operations with SQL-familiar syntax for MongoDB

-- Configure GridFS bucket with comprehensive settings
CREATE GRIDFS_BUCKET media_files 
WITH chunk_size_bytes = 261120,
     enable_compression = true,
     compression_level = 6,
     enable_encryption = true,
     enable_metadata_indexing = true,
     enable_version_control = true,
     enable_thumbnail_generation = true,
     enable_content_analysis = true,

     -- Storage optimization
     enable_deduplication = true,
     enable_automatic_cleanup = true,
     storage_tier_management = true,

     -- Access control
     default_permissions = 'private',
     enable_access_logging = true,
     enable_audit_trail = true,

     -- Performance settings
     max_concurrent_uploads = 10,
     enable_parallel_processing = true,
     enable_streaming_optimization = true,

     -- Backup and replication
     enable_backup_integration = true,
     cross_region_replication = true,
     replication_factor = 3;

-- Advanced file upload with comprehensive metadata and processing
WITH file_uploads AS (
  SELECT 
    file_id,
    filename,
    file_size_bytes,
    content_type,
    upload_timestamp,
    upload_duration_seconds,
    upload_speed_mbps,

    -- Processing results
    compression_applied,
    compression_ratio,
    thumbnail_generated,
    content_analysis_completed,
    virus_scan_status,

    -- Metadata extraction
    JSON_EXTRACT(metadata, '$.originalPath') as original_path,
    JSON_EXTRACT(metadata, '$.uploadedBy') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as file_tags,
    JSON_EXTRACT(metadata, '$.category') as file_category,
    JSON_EXTRACT(metadata, '$.permissions') as access_permissions,

    -- File integrity
    JSON_EXTRACT(metadata, '$.checksums.md5') as md5_checksum,
    JSON_EXTRACT(metadata, '$.checksums.sha256') as sha256_checksum,

    -- Processing pipeline results
    JSON_EXTRACT(metadata, '$.processingResults') as processing_results,
    JSON_EXTRACT(metadata, '$.thumbnailPath') as thumbnail_path,
    JSON_EXTRACT(metadata, '$.contentAnalysis') as content_analysis

  FROM GRIDFS_FILES('media_files')
  WHERE upload_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),

upload_performance AS (
  SELECT 
    file_category,
    content_type,
    COUNT(*) as total_uploads,
    SUM(file_size_bytes) as total_bytes_uploaded,
    AVG(upload_duration_seconds) as avg_upload_time,
    AVG(upload_speed_mbps) as avg_upload_speed,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY upload_duration_seconds) as p95_upload_time,

    -- Processing performance
    COUNT(*) FILTER (WHERE compression_applied = true) as compressed_files,
    AVG(compression_ratio) FILTER (WHERE compression_applied = true) as avg_compression_ratio,
    COUNT(*) FILTER (WHERE thumbnail_generated = true) as thumbnails_generated,
    COUNT(*) FILTER (WHERE content_analysis_completed = true) as content_analyzed,
    COUNT(*) FILTER (WHERE virus_scan_status = 'clean') as clean_files,
    COUNT(*) FILTER (WHERE virus_scan_status = 'threat_detected') as threat_files,

    -- File size distribution
    AVG(file_size_bytes) as avg_file_size_bytes,
    MAX(file_size_bytes) as largest_file_bytes,
    MIN(file_size_bytes) as smallest_file_bytes,

    -- Storage efficiency
    SUM(file_size_bytes) as original_total_bytes,
    SUM(CASE WHEN compression_applied THEN 
          file_size_bytes * (1 - compression_ratio) 
        ELSE file_size_bytes 
    END) as stored_total_bytes

  FROM file_uploads
  GROUP BY file_category, content_type
),

storage_analysis AS (
  SELECT 
    bucket_name,
    DATE_TRUNC('hour', upload_timestamp) as upload_hour,

    -- Upload volume analysis
    COUNT(*) as files_uploaded,
    SUM(file_size_bytes) as bytes_uploaded,
    AVG(upload_speed_mbps) as avg_hourly_upload_speed,

    -- Content type distribution
    COUNT(*) FILTER (WHERE content_type LIKE 'image/%') as image_files,
    COUNT(*) FILTER (WHERE content_type LIKE 'video/%') as video_files,
    COUNT(*) FILTER (WHERE content_type LIKE 'audio/%') as audio_files,
    COUNT(*) FILTER (WHERE content_type = 'application/pdf') as pdf_files,
    COUNT(*) FILTER (WHERE content_type NOT IN ('image/%', 'video/%', 'audio/%', 'application/pdf')) as other_files,

    -- Processing success rates
    COUNT(*) FILTER (WHERE virus_scan_status = 'clean') as safe_files,
    COUNT(*) FILTER (WHERE content_analysis_completed = true) as analyzed_files,
    COUNT(*) FILTER (WHERE thumbnail_generated = true) as thumbnail_files,

    -- Storage optimization metrics
    AVG(CASE WHEN compression_applied THEN compression_ratio ELSE 0 END) as avg_compression_ratio,
    SUM(CASE WHEN compression_applied THEN 
          file_size_bytes * compression_ratio 
        ELSE 0 
    END) as total_space_saved_bytes

  FROM file_uploads
  GROUP BY bucket_name, DATE_TRUNC('hour', upload_timestamp)
)

SELECT 
  up.file_category,
  up.content_type,
  up.total_uploads,

  -- Upload performance metrics
  ROUND(up.total_bytes_uploaded / 1024.0 / 1024.0, 2) as total_uploaded_mb,
  ROUND(up.avg_upload_time, 2) as avg_upload_time_seconds,
  ROUND(up.avg_upload_speed, 2) as avg_upload_speed_mbps,
  ROUND(up.p95_upload_time, 2) as p95_upload_time_seconds,

  -- Processing efficiency
  up.compressed_files,
  ROUND((up.compressed_files * 100.0) / up.total_uploads, 1) as compression_rate_percent,
  ROUND(up.avg_compression_ratio * 100, 1) as avg_compression_percent,
  up.thumbnails_generated,
  ROUND((up.thumbnails_generated * 100.0) / up.total_uploads, 1) as thumbnail_rate_percent,

  -- Content analysis results
  up.content_analyzed,
  ROUND((up.content_analyzed * 100.0) / up.total_uploads, 1) as analysis_rate_percent,

  -- Security metrics
  up.clean_files,
  up.threat_files,
  CASE 
    WHEN up.threat_files > 0 THEN 'security_issues_detected'
    ELSE 'all_files_clean'
  END as security_status,

  -- File size statistics
  ROUND(up.avg_file_size_bytes / 1024.0 / 1024.0, 2) as avg_file_size_mb,
  ROUND(up.largest_file_bytes / 1024.0 / 1024.0, 2) as largest_file_mb,
  ROUND(up.smallest_file_bytes / 1024.0, 2) as smallest_file_kb,

  -- Storage optimization
  ROUND(up.original_total_bytes / 1024.0 / 1024.0, 2) as original_storage_mb,
  ROUND(up.stored_total_bytes / 1024.0 / 1024.0, 2) as actual_storage_mb,
  ROUND(((up.original_total_bytes - up.stored_total_bytes) / up.original_total_bytes) * 100, 1) as storage_savings_percent,

  -- Performance assessment
  CASE 
    WHEN up.avg_upload_speed > 50 THEN 'excellent'
    WHEN up.avg_upload_speed > 20 THEN 'good'
    WHEN up.avg_upload_speed > 10 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as upload_performance_rating,

  -- Processing health
  CASE 
    WHEN up.threat_files > 0 THEN 'security_review_required'
    WHEN (up.thumbnails_generated * 100.0 / up.total_uploads) < 80 AND up.content_type LIKE 'image/%' THEN 'thumbnail_generation_issues'
    WHEN (up.content_analyzed * 100.0 / up.total_uploads) < 90 THEN 'content_analysis_issues'
    ELSE 'processing_healthy'
  END as processing_health_status,

  -- Optimization recommendations
  ARRAY[
    CASE WHEN up.avg_upload_speed < 10 THEN 'Optimize network bandwidth or chunk size' END,
    CASE WHEN up.avg_compression_ratio < 0.3 AND up.content_type LIKE 'image/%' THEN 'Review image compression settings' END,
    CASE WHEN (up.thumbnails_generated * 100.0 / up.total_uploads) < 50 AND up.content_type LIKE 'image/%' THEN 'Fix thumbnail generation pipeline' END,
    CASE WHEN up.threat_files > 0 THEN 'Review security scanning configuration' END,
    CASE WHEN up.p95_upload_time > 300 THEN 'Optimize upload processing for large files' END
  ]::TEXT[] as optimization_recommendations

FROM upload_performance up
ORDER BY up.total_bytes_uploaded DESC, up.total_uploads DESC;

-- Advanced file search and retrieval with comprehensive filtering
WITH file_search_results AS (
  SELECT 
    file_id,
    filename,
    content_type,
    file_size_bytes,
    upload_timestamp,

    -- Metadata extraction
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.uploadedBy') as uploaded_by,
    JSON_EXTRACT(metadata, '$.permissions') as permissions,
    JSON_EXTRACT(metadata, '$.contentAnalysis.description') as content_description,
    JSON_EXTRACT(metadata, '$.contentAnalysis.keywords') as content_keywords,
    JSON_EXTRACT(metadata, '$.processingResults.thumbnailAvailable') as has_thumbnail,
    JSON_EXTRACT(metadata, '$.processingResults.textExtracted') as has_text_content,

    -- File access patterns
    download_count,
    last_accessed,
    access_frequency_score,

    -- Storage tier information
    storage_tier,
    CASE storage_tier
      WHEN 'hot' THEN 1
      WHEN 'warm' THEN 2  
      WHEN 'cold' THEN 3
      WHEN 'archive' THEN 4
      ELSE 5
    END as tier_priority,

    -- File age and usage
    EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - upload_timestamp)) as file_age_days,
    EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - last_accessed)) as days_since_last_access

  FROM GRIDFS_FILES('media_files')
  WHERE 
    -- Content type filters
    (content_type IN ('image/jpeg', 'image/png', 'application/pdf', 'video/mp4') OR content_type LIKE '%/%')

    -- Size filters
    AND file_size_bytes BETWEEN 1024 AND 1073741824  -- 1KB to 1GB

    -- Date range filters
    AND upload_timestamp >= CURRENT_TIMESTAMP - INTERVAL '90 days'

    -- Category and tag filters
    AND (JSON_EXTRACT(metadata, '$.category') IS NOT NULL)
    AND (JSON_EXTRACT(metadata, '$.tags') IS NOT NULL)
),

file_analytics AS (
  SELECT 
    fsr.*,

    -- Content analysis scoring
    CASE 
      WHEN fsr.content_description IS NOT NULL AND fsr.content_keywords IS NOT NULL THEN 'fully_analyzed'
      WHEN fsr.content_description IS NOT NULL OR fsr.content_keywords IS NOT NULL THEN 'partially_analyzed'
      ELSE 'not_analyzed'
    END as analysis_completeness,

    -- Access pattern classification
    CASE 
      WHEN fsr.access_frequency_score > 0.8 THEN 'frequently_accessed'
      WHEN fsr.access_frequency_score > 0.4 THEN 'moderately_accessed'
      WHEN fsr.access_frequency_score > 0.1 THEN 'rarely_accessed'
      ELSE 'never_accessed'
    END as access_pattern,

    -- Storage optimization opportunities
    CASE 
      WHEN fsr.days_since_last_access > 90 AND fsr.storage_tier IN ('hot', 'warm') THEN 'candidate_for_cold_storage'
      WHEN fsr.days_since_last_access > 365 AND fsr.storage_tier != 'archive' THEN 'candidate_for_archive'
      WHEN fsr.access_frequency_score > 0.6 AND fsr.storage_tier IN ('cold', 'archive') THEN 'candidate_for_hot_storage'
      ELSE 'appropriate_storage_tier'
    END as storage_optimization,

    -- File health assessment
    CASE 
      WHEN fsr.has_thumbnail = false AND fsr.content_type LIKE 'image/%' THEN 'missing_thumbnail'
      WHEN fsr.has_text_content = false AND fsr.content_type = 'application/pdf' THEN 'text_extraction_needed'
      WHEN fsr.analysis_completeness = 'not_analyzed' AND fsr.file_age_days > 7 THEN 'analysis_overdue'
      ELSE 'healthy'
    END as file_health_status

  FROM file_search_results fsr
),

usage_patterns AS (
  SELECT 
    content_type,
    category,
    access_pattern,
    storage_tier,
    COUNT(*) as file_count,
    SUM(file_size_bytes) as total_bytes,
    AVG(download_count) as avg_downloads,
    AVG(access_frequency_score) as avg_access_score,

    -- Storage tier distribution
    COUNT(*) FILTER (WHERE storage_tier = 'hot') as hot_tier_count,
    COUNT(*) FILTER (WHERE storage_tier = 'warm') as warm_tier_count,
    COUNT(*) FILTER (WHERE storage_tier = 'cold') as cold_tier_count,
    COUNT(*) FILTER (WHERE storage_tier = 'archive') as archive_tier_count,

    -- Health metrics
    COUNT(*) FILTER (WHERE file_health_status = 'healthy') as healthy_files,
    COUNT(*) FILTER (WHERE file_health_status != 'healthy') as unhealthy_files,

    -- Optimization opportunities
    COUNT(*) FILTER (WHERE storage_optimization LIKE 'candidate_for_%') as optimization_candidates

  FROM file_analytics
  GROUP BY content_type, category, access_pattern, storage_tier
)

SELECT 
  fa.file_id,
  fa.filename,
  fa.content_type,
  ROUND(fa.file_size_bytes / 1024.0 / 1024.0, 2) as file_size_mb,
  fa.category,
  fa.tags,
  fa.uploaded_by,

  -- Access and usage information
  fa.download_count,
  fa.access_pattern,
  fa.days_since_last_access,
  ROUND(fa.access_frequency_score, 3) as access_score,

  -- Storage and optimization
  fa.storage_tier,
  fa.storage_optimization,
  fa.file_health_status,

  -- Content analysis
  fa.analysis_completeness,
  CASE WHEN fa.has_thumbnail THEN 'yes' ELSE 'no' END as thumbnail_available,
  CASE WHEN fa.has_text_content THEN 'yes' ELSE 'no' END as text_content_available,

  -- File management recommendations
  ARRAY[
    CASE WHEN fa.storage_optimization LIKE 'candidate_for_%' THEN 
           'Move to ' || REPLACE(REPLACE(fa.storage_optimization, 'candidate_for_', ''), '_storage', ' storage')
         END,
    CASE WHEN fa.file_health_status = 'missing_thumbnail' THEN 'Generate thumbnail' END,
    CASE WHEN fa.file_health_status = 'text_extraction_needed' THEN 'Extract text content' END,
    CASE WHEN fa.file_health_status = 'analysis_overdue' THEN 'Run content analysis' END,
    CASE WHEN fa.days_since_last_access > 180 AND fa.download_count = 0 THEN 'Consider deletion' END
  ]::TEXT[] as recommendations,

  -- Priority scoring for operations
  CASE 
    WHEN fa.file_health_status != 'healthy' THEN 'high'
    WHEN fa.storage_optimization LIKE 'candidate_for_%' THEN 'medium' 
    WHEN fa.analysis_completeness = 'not_analyzed' THEN 'medium'
    ELSE 'low'
  END as maintenance_priority,

  -- Search relevance scoring
  (
    CASE WHEN fa.access_frequency_score > 0.5 THEN 2 ELSE 0 END +
    CASE WHEN fa.analysis_completeness = 'fully_analyzed' THEN 1 ELSE 0 END +
    CASE WHEN fa.file_health_status = 'healthy' THEN 1 ELSE 0 END +
    CASE WHEN fa.storage_tier = 'hot' THEN 1 ELSE 0 END
  ) as relevance_score

FROM file_analytics fa
WHERE 
  -- Apply additional search filters
  fa.file_size_mb BETWEEN 1 AND 100  -- 1MB to 100MB
  AND fa.file_age_days <= 60  -- Files from last 60 days
  AND fa.analysis_completeness != 'not_analyzed'  -- Only analyzed files

ORDER BY 
  -- Primary sort by maintenance priority, then relevance
  CASE fa.maintenance_priority 
    WHEN 'high' THEN 1 
    WHEN 'medium' THEN 2 
    ELSE 3 
  END,
  relevance_score DESC,
  fa.access_frequency_score DESC,
  fa.upload_timestamp DESC
LIMIT 100;

-- GridFS storage tier management and optimization
CREATE VIEW gridfs_storage_optimization AS
WITH current_storage_state AS (
  SELECT 
    storage_tier,
    COUNT(*) as file_count,
    SUM(file_size_bytes) as total_bytes,
    AVG(file_size_bytes) as avg_file_size,
    AVG(access_frequency_score) as avg_access_frequency,
    AVG(EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - last_accessed))) as avg_days_since_access,

    -- Cost analysis (simplified model)
    SUM(file_size_bytes) * CASE storage_tier
      WHEN 'hot' THEN 0.023    -- $0.023 per GB/month
      WHEN 'warm' THEN 0.0125  -- $0.0125 per GB/month  
      WHEN 'cold' THEN 0.004   -- $0.004 per GB/month
      WHEN 'archive' THEN 0.001 -- $0.001 per GB/month
      ELSE 0.023
    END / 1024.0 / 1024.0 / 1024.0 as estimated_monthly_cost_usd,

    -- Performance characteristics  
    AVG(CASE storage_tier
      WHEN 'hot' THEN 10      -- 10ms avg access time
      WHEN 'warm' THEN 100    -- 100ms avg access time
      WHEN 'cold' THEN 1000   -- 1s avg access time  
      WHEN 'archive' THEN 15000 -- 15s avg access time
      ELSE 1000
    END) as avg_access_time_ms

  FROM GRIDFS_FILES('media_files')
  WHERE upload_timestamp >= CURRENT_TIMESTAMP - INTERVAL '365 days'
  GROUP BY storage_tier
),

optimization_opportunities AS (
  SELECT 
    file_id,
    filename,
    storage_tier,
    file_size_bytes,
    access_frequency_score,
    EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - last_accessed)) as days_since_access,
    download_count,

    -- Current cost
    file_size_bytes * CASE storage_tier
      WHEN 'hot' THEN 0.023
      WHEN 'warm' THEN 0.0125  
      WHEN 'cold' THEN 0.004
      WHEN 'archive' THEN 0.001
      ELSE 0.023
    END / 1024.0 / 1024.0 / 1024.0 as current_monthly_cost_usd,

    -- Recommended tier based on access patterns
    CASE 
      WHEN access_frequency_score > 0.7 OR days_since_access <= 7 THEN 'hot'
      WHEN access_frequency_score > 0.3 OR days_since_access <= 30 THEN 'warm'
      WHEN access_frequency_score > 0.1 OR days_since_access <= 90 THEN 'cold'
      ELSE 'archive'
    END as recommended_tier,

    -- Potential savings calculation
    CASE 
      WHEN access_frequency_score > 0.7 OR days_since_access <= 7 THEN 0.023
      WHEN access_frequency_score > 0.3 OR days_since_access <= 30 THEN 0.0125
      WHEN access_frequency_score > 0.1 OR days_since_access <= 90 THEN 0.004
      ELSE 0.001
    END as recommended_cost_per_gb

  FROM GRIDFS_FILES('media_files')
  WHERE upload_timestamp >= CURRENT_TIMESTAMP - INTERVAL '365 days'
)

SELECT 
  css.storage_tier as current_tier,
  css.file_count,
  ROUND(css.total_bytes / 1024.0 / 1024.0 / 1024.0, 2) as storage_gb,
  ROUND(css.avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,
  ROUND(css.avg_access_frequency, 3) as avg_access_frequency,
  ROUND(css.avg_days_since_access, 1) as avg_days_since_access,
  ROUND(css.estimated_monthly_cost_usd, 2) as current_monthly_cost_usd,
  ROUND(css.avg_access_time_ms, 0) as avg_access_time_ms,

  -- Optimization analysis
  (SELECT COUNT(*) 
   FROM optimization_opportunities oo 
   WHERE oo.storage_tier = css.storage_tier 
   AND oo.recommended_tier != oo.storage_tier) as files_needing_optimization,

  (SELECT SUM(ABS(oo.current_monthly_cost_usd - 
                   (oo.file_size_bytes * oo.recommended_cost_per_gb / 1024.0 / 1024.0 / 1024.0)))
   FROM optimization_opportunities oo 
   WHERE oo.storage_tier = css.storage_tier 
   AND oo.recommended_tier != oo.storage_tier) as potential_monthly_savings_usd,

  -- Tier health assessment
  CASE 
    WHEN css.avg_access_frequency < 0.1 AND css.storage_tier = 'hot' THEN 'overprovisioned'
    WHEN css.avg_access_frequency > 0.6 AND css.storage_tier IN ('cold', 'archive') THEN 'underprovisioned' 
    WHEN css.avg_days_since_access > 90 AND css.storage_tier IN ('hot', 'warm') THEN 'tier_too_hot'
    WHEN css.avg_days_since_access < 30 AND css.storage_tier IN ('cold', 'archive') THEN 'tier_too_cold'
    ELSE 'appropriately_tiered'
  END as tier_health_status,

  -- Recommendations
  CASE 
    WHEN css.avg_access_frequency < 0.1 AND css.storage_tier = 'hot' THEN 'Move files to cold or archive storage'
    WHEN css.avg_access_frequency > 0.6 AND css.storage_tier IN ('cold', 'archive') THEN 'Move files to hot storage'
    WHEN css.avg_days_since_access > 180 AND css.storage_tier != 'archive' THEN 'Consider archiving old files'
    ELSE 'Current tiering appears appropriate'
  END as optimization_recommendation

FROM current_storage_state css
ORDER BY css.estimated_monthly_cost_usd DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar syntax for MongoDB GridFS bucket configuration and management
-- 2. Advanced file upload and download operations with progress tracking
-- 3. Comprehensive metadata management and content analysis integration
-- 4. Intelligent storage tier management with cost optimization
-- 5. File search and retrieval with advanced filtering and relevance scoring
-- 6. Performance monitoring and optimization recommendations
-- 7. Enterprise security and compliance features built-in
-- 8. Automated file processing pipelines with thumbnail generation
-- 9. Storage efficiency analysis with deduplication and compression
-- 10. Production-ready file management with scalable architecture

Best Practices for Production GridFS Deployment

File Storage Architecture Design Principles

Essential principles for effective MongoDB GridFS production deployment:

  1. Bucket Design Strategy: Organize files into logical buckets based on content type, access patterns, and retention requirements
  2. Chunk Size Optimization: Configure appropriate chunk sizes based on file types and access patterns for optimal performance
  3. Metadata Management: Design comprehensive metadata schemas for efficient searching, categorization, and content management
  4. Storage Tier Strategy: Implement intelligent storage tiering based on file access frequency and business requirements
  5. Security Integration: Establish comprehensive access controls, encryption, and audit logging for enterprise security
  6. Performance Monitoring: Monitor upload/download performance, storage efficiency, and system resource utilization

Enterprise File Management

Design GridFS systems for enterprise-scale file operations:

  1. Content Processing Pipeline: Implement automated file processing for thumbnails, content analysis, and format optimization
  2. Disaster Recovery: Design backup strategies and cross-region replication for business continuity
  3. Compliance Management: Ensure file operations meet regulatory requirements and data retention policies
  4. API Integration: Build RESTful APIs and SDK integrations for seamless application development
  5. Monitoring and Alerting: Implement comprehensive monitoring for storage usage, performance, and operational health
  6. Capacity Planning: Monitor growth patterns and plan storage capacity and performance requirements

Conclusion

MongoDB GridFS provides comprehensive large file storage capabilities that enable sophisticated binary data management, efficient streaming operations, and integrated metadata handling through distributed chunk-based storage, automatic replication, and transactional consistency. The native file management tools and streaming interfaces ensure that applications can handle large files efficiently with minimal infrastructure complexity.

Key MongoDB GridFS benefits include:

  • Efficient Binary Storage: Advanced chunk-based storage with compression, deduplication, and intelligent space optimization
  • Integrated Metadata Management: Comprehensive metadata handling with full-text search, tagging, and content analysis capabilities
  • Streaming Operations: High-performance upload and download streaming with progress tracking and parallel processing
  • Distributed Architecture: Built-in replication and distributed storage through MongoDB's replica set technology
  • Transaction Integration: Full transactional consistency between file operations and database operations within MongoDB
  • SQL Accessibility: Familiar SQL-style file management operations through QueryLeaf for accessible binary data operations

Whether you're building document management systems, media streaming platforms, enterprise content repositories, or distributed file storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for sophisticated, scalable file operations.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB GridFS operations while providing SQL-familiar syntax for file storage, retrieval, and management. Advanced file processing, content analysis, and storage optimization are seamlessly handled through familiar SQL constructs, making sophisticated binary data management accessible to SQL-oriented development teams.

The combination of MongoDB GridFS's robust file storage capabilities with SQL-style file operations makes it an ideal platform for applications requiring both large file handling and familiar database management patterns, ensuring your file storage infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.