MongoDB Transaction Error Handling and Recovery Patterns: Building Resilient Applications with Advanced Error Management and Automatic Retry Strategies
Production MongoDB applications require sophisticated error handling and recovery mechanisms that can gracefully manage transaction failures, network interruptions, server unavailability, and resource constraints while maintaining data consistency and application reliability. Traditional database error handling approaches often lack the nuanced understanding of distributed system challenges, leading to incomplete transactions, data inconsistencies, and poor user experiences when dealing with complex failure scenarios.
MongoDB provides comprehensive transaction error handling capabilities through intelligent retry mechanisms, detailed error classification, and sophisticated recovery patterns that enable applications to maintain consistency and reliability even in the face of network partitions, replica set failovers, and resource contention. Unlike traditional databases that provide basic error codes and limited retry logic, MongoDB transactions integrate advanced error detection with automatic recovery strategies and detailed diagnostic information.
The Traditional Transaction Error Handling Challenge
Conventional approaches to database transaction error management in enterprise applications face significant limitations in resilience and recovery capabilities:
-- Traditional PostgreSQL transaction error handling - basic error management with limited recovery options
-- Simple transaction error tracking table
CREATE TABLE transaction_error_log (
error_id SERIAL PRIMARY KEY,
transaction_id UUID,
connection_id VARCHAR(100),
-- Basic error information
error_code VARCHAR(20),
error_message TEXT,
error_category VARCHAR(50), -- connection, constraint, timeout, etc.
-- Timing information
transaction_start_time TIMESTAMP,
error_occurred_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Context information (limited)
table_name VARCHAR(100),
operation_type VARCHAR(20), -- INSERT, UPDATE, DELETE, SELECT
affected_rows INTEGER,
-- Simple retry tracking
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
retry_successful BOOLEAN DEFAULT FALSE,
-- Manual resolution tracking
resolved_at TIMESTAMP,
resolution_method VARCHAR(100),
resolved_by VARCHAR(100)
);
-- Basic transaction state tracking
CREATE TABLE active_transactions (
transaction_id UUID PRIMARY KEY,
connection_id VARCHAR(100) NOT NULL,
start_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Simple state management
transaction_status VARCHAR(20) DEFAULT 'active', -- active, committed, rolled_back, failed
isolation_level VARCHAR(30),
read_only BOOLEAN DEFAULT FALSE,
-- Basic operation tracking
operations_count INTEGER DEFAULT 0,
tables_affected TEXT[], -- Simple array of table names
-- Timeout management (basic)
timeout_seconds INTEGER DEFAULT 300,
last_activity TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Error tracking
error_count INTEGER DEFAULT 0,
last_error_message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Manual transaction recovery procedure (limited functionality)
CREATE OR REPLACE FUNCTION recover_failed_transaction(
p_transaction_id UUID,
p_recovery_strategy VARCHAR(50) DEFAULT 'rollback'
) RETURNS TABLE (
recovery_status VARCHAR(20),
recovery_message TEXT,
operations_recovered INTEGER
) AS $$
DECLARE
v_transaction_record RECORD;
v_recovery_count INTEGER := 0;
v_retry_count INTEGER;
v_max_retries INTEGER;
BEGIN
-- Get transaction details
SELECT * INTO v_transaction_record
FROM active_transactions
WHERE transaction_id = p_transaction_id;
IF NOT FOUND THEN
RETURN QUERY SELECT 'error'::VARCHAR(20),
'Transaction not found'::TEXT,
0::INTEGER;
RETURN;
END IF;
-- Check retry limits (basic logic)
SELECT retry_count, max_retries INTO v_retry_count, v_max_retries
FROM transaction_error_log
WHERE transaction_id = p_transaction_id
ORDER BY error_occurred_at DESC
LIMIT 1;
IF v_retry_count >= v_max_retries THEN
RETURN QUERY SELECT 'failed'::VARCHAR(20),
'Maximum retries exceeded'::TEXT,
0::INTEGER;
RETURN;
END IF;
-- Simple recovery strategies
CASE p_recovery_strategy
WHEN 'rollback' THEN
BEGIN
-- Attempt to rollback (very basic)
UPDATE active_transactions
SET transaction_status = 'rolled_back',
updated_at = CURRENT_TIMESTAMP
WHERE transaction_id = p_transaction_id;
v_recovery_count := 1;
RETURN QUERY SELECT 'success'::VARCHAR(20),
'Transaction rolled back'::TEXT,
v_recovery_count::INTEGER;
EXCEPTION WHEN OTHERS THEN
RETURN QUERY SELECT 'error'::VARCHAR(20),
SQLERRM::TEXT,
0::INTEGER;
END;
WHEN 'retry' THEN
BEGIN
-- Basic retry logic (very limited)
UPDATE transaction_error_log
SET retry_count = retry_count + 1,
retry_successful = FALSE
WHERE transaction_id = p_transaction_id;
-- Reset transaction status for retry
UPDATE active_transactions
SET transaction_status = 'active',
error_count = 0,
last_error_message = NULL,
updated_at = CURRENT_TIMESTAMP
WHERE transaction_id = p_transaction_id;
v_recovery_count := 1;
RETURN QUERY SELECT 'retry'::VARCHAR(20),
'Transaction queued for retry'::TEXT,
v_recovery_count::INTEGER;
EXCEPTION WHEN OTHERS THEN
RETURN QUERY SELECT 'error'::VARCHAR(20),
SQLERRM::TEXT,
0::INTEGER;
END;
ELSE
RETURN QUERY SELECT 'error'::VARCHAR(20),
'Unknown recovery strategy'::TEXT,
0::INTEGER;
END CASE;
END;
$$ LANGUAGE plpgsql;
-- Basic transaction monitoring query (limited insights)
WITH transaction_health AS (
SELECT
DATE_TRUNC('hour', start_time) as hour_bucket,
-- Simple transaction metrics
COUNT(*) as total_transactions,
COUNT(CASE WHEN transaction_status = 'committed' THEN 1 END) as successful_transactions,
COUNT(CASE WHEN transaction_status = 'rolled_back' THEN 1 END) as rolled_back_transactions,
COUNT(CASE WHEN transaction_status = 'failed' THEN 1 END) as failed_transactions,
COUNT(CASE WHEN transaction_status = 'active' AND
EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_activity)) > timeout_seconds
THEN 1 END) as timed_out_transactions,
-- Basic performance metrics
AVG(operations_count) as avg_operations_per_transaction,
AVG(EXTRACT(EPOCH FROM (updated_at - start_time))) as avg_transaction_duration_seconds,
-- Simple error analysis
AVG(error_count) as avg_errors_per_transaction,
COUNT(CASE WHEN error_count > 0 THEN 1 END) as transactions_with_errors
FROM active_transactions
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', start_time)
),
error_analysis AS (
SELECT
DATE_TRUNC('hour', error_occurred_at) as hour_bucket,
error_category,
-- Error statistics
COUNT(*) as error_count,
COUNT(CASE WHEN retry_successful = TRUE THEN 1 END) as successful_retries,
AVG(retry_count) as avg_retry_attempts,
-- Common errors
COUNT(CASE WHEN error_code LIKE 'SQLSTATE%' THEN 1 END) as sql_state_errors,
COUNT(CASE WHEN error_message ILIKE '%timeout%' THEN 1 END) as timeout_errors,
COUNT(CASE WHEN error_message ILIKE '%connection%' THEN 1 END) as connection_errors,
COUNT(CASE WHEN error_message ILIKE '%deadlock%' THEN 1 END) as deadlock_errors
FROM transaction_error_log
WHERE error_occurred_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', error_occurred_at), error_category
)
SELECT
th.hour_bucket,
-- Transaction metrics
th.total_transactions,
th.successful_transactions,
th.failed_transactions,
ROUND((th.successful_transactions::DECIMAL / GREATEST(th.total_transactions, 1)) * 100, 2) as success_rate_percent,
-- Performance metrics
ROUND(th.avg_transaction_duration_seconds, 3) as avg_duration_seconds,
ROUND(th.avg_operations_per_transaction, 1) as avg_operations,
-- Error metrics
COALESCE(SUM(ea.error_count), 0) as total_errors,
COALESCE(SUM(ea.successful_retries), 0) as successful_retries,
COALESCE(ROUND(AVG(ea.avg_retry_attempts), 1), 0) as avg_retry_attempts,
-- Error categories
COALESCE(SUM(ea.timeout_errors), 0) as timeout_errors,
COALESCE(SUM(ea.connection_errors), 0) as connection_errors,
COALESCE(SUM(ea.deadlock_errors), 0) as deadlock_errors,
-- Health indicators
th.timed_out_transactions,
CASE
WHEN ROUND((th.successful_transactions::DECIMAL / GREATEST(th.total_transactions, 1)) * 100, 2) >= 95 THEN 'Healthy'
WHEN ROUND((th.successful_transactions::DECIMAL / GREATEST(th.total_transactions, 1)) * 100, 2) >= 90 THEN 'Warning'
ELSE 'Critical'
END as health_status
FROM transaction_health th
LEFT JOIN error_analysis ea ON th.hour_bucket = ea.hour_bucket
GROUP BY th.hour_bucket, th.total_transactions, th.successful_transactions,
th.failed_transactions, th.avg_transaction_duration_seconds,
th.avg_operations_per_transaction, th.timed_out_transactions
ORDER BY th.hour_bucket DESC;
-- Problems with traditional transaction error handling:
-- 1. Basic error categorization with limited diagnostic information
-- 2. Manual retry logic without intelligent backoff strategies
-- 3. No automatic recovery based on error type and context
-- 4. Limited visibility into transaction state and progress
-- 5. Basic timeout handling without consideration of operation complexity
-- 6. No integration with connection pool health and server status
-- 7. Manual intervention required for most recovery scenarios
-- 8. Limited support for distributed transaction patterns
-- 9. Basic error aggregation without trend analysis
-- 10. No automatic optimization based on error patterns
MongoDB's intelligent transaction error handling eliminates these limitations:
// MongoDB advanced transaction error handling - intelligent and resilient
const { MongoClient } = require('mongodb');
// Comprehensive transaction error handling and recovery system
class MongoTransactionManager {
constructor(client, options = {}) {
this.client = client;
this.options = {
// Retry configuration
maxRetryAttempts: options.maxRetryAttempts || 5,
initialRetryDelayMs: options.initialRetryDelayMs || 100,
maxRetryDelayMs: options.maxRetryDelayMs || 5000,
retryDelayMultiplier: options.retryDelayMultiplier || 2,
jitterFactor: options.jitterFactor || 0.1,
// Transaction configuration
defaultTransactionOptions: {
readConcern: { level: options.readConcernLevel || 'snapshot' },
writeConcern: { w: options.writeConcernW || 'majority', j: true },
readPreference: options.readPreference || 'primary',
maxCommitTimeMS: options.maxCommitTimeMS || 10000
},
// Error handling configuration
retryableErrorCodes: options.retryableErrorCodes || [
112, // WriteConflict
117, // ConflictingOperationInProgress
133, // FailedToSatisfyReadPreference
134, // ReadConcernMajorityNotAvailableYet
208, // ExceededTimeLimit
225, // LockTimeout
244, // TransactionTooLarge
251, // NoSuchTransaction
256, // TransactionAborted
261, // ExceededMaxTimeMS
263, // TemporarilyUnavailable
6 // HostUnreachable
],
// Monitoring configuration
enableDetailedLogging: options.enableDetailedLogging || true,
enableMetricsCollection: options.enableMetricsCollection || true
};
this.transactionMetrics = {
totalTransactions: 0,
successfulTransactions: 0,
failedTransactions: 0,
retriedTransactions: 0,
totalRetryAttempts: 0,
errorsByCode: new Map(),
errorsByCategory: new Map(),
performanceStats: {
averageTransactionDuration: 0,
transactionDurations: [],
retryDelays: [],
averageRetryDelay: 0
}
};
this.activeTransactions = new Map();
}
// Execute transaction with comprehensive error handling and retry logic
async executeTransactionWithRetry(transactionFunction, transactionOptions = {}) {
const transactionId = this.generateTransactionId();
const startTime = Date.now();
// Merge transaction options
const mergedOptions = {
...this.options.defaultTransactionOptions,
...transactionOptions
};
let attempt = 1;
let lastError = null;
let session = null;
// Track active transaction
this.activeTransactions.set(transactionId, {
id: transactionId,
startTime: startTime,
attempt: attempt,
status: 'active',
operationsExecuted: 0,
errors: []
});
try {
while (attempt <= this.options.maxRetryAttempts) {
try {
// Create new session for each attempt
session = this.client.startSession();
this.log(`Starting transaction ${transactionId}, attempt ${attempt}`);
// Update transaction tracking
this.updateTransactionStatus(transactionId, 'active', { attempt });
// Execute transaction with intelligent error handling
const result = await session.withTransaction(
async (sessionContext) => {
try {
// Execute the user-provided transaction function
const transactionResult = await transactionFunction(sessionContext, {
transactionId,
attempt,
onOperation: (operation) => this.trackOperation(transactionId, operation)
});
this.log(`Transaction ${transactionId} executed successfully on attempt ${attempt}`);
return transactionResult;
} catch (error) {
this.log(`Transaction ${transactionId} error in user function:`, error);
throw error;
}
},
mergedOptions
);
// Transaction successful
const duration = Date.now() - startTime;
this.updateTransactionStatus(transactionId, 'committed', {
duration,
totalAttempts: attempt
});
this.recordSuccessfulTransaction(transactionId, duration, attempt);
this.log(`Transaction ${transactionId} committed successfully after ${attempt} attempts (${duration}ms)`);
return {
success: true,
result: result,
transactionId: transactionId,
attempts: attempt,
duration: duration,
metrics: this.getTransactionMetrics(transactionId)
};
} catch (error) {
lastError = error;
this.log(`Transaction ${transactionId} attempt ${attempt} failed:`, error);
// Record error for analysis
this.recordTransactionError(transactionId, error, attempt);
// Analyze error and determine if retry is appropriate
const errorAnalysis = this.analyzeTransactionError(error);
if (!errorAnalysis.retryable || attempt >= this.options.maxRetryAttempts) {
// Error is not retryable or max attempts reached
this.updateTransactionStatus(transactionId, 'failed', {
finalError: error,
totalAttempts: attempt,
errorAnalysis
});
break;
}
// Calculate intelligent retry delay
const retryDelay = this.calculateRetryDelay(attempt, errorAnalysis);
this.log(`Transaction ${transactionId} will retry in ${retryDelay}ms (attempt ${attempt + 1}/${this.options.maxRetryAttempts})`);
// Update metrics
this.transactionMetrics.totalRetryAttempts++;
this.transactionMetrics.performanceStats.retryDelays.push(retryDelay);
// Wait before retry
if (retryDelay > 0) {
await this.sleep(retryDelay);
}
attempt++;
} finally {
// Always close session
if (session) {
try {
await session.endSession();
} catch (sessionError) {
this.log(`Error ending session for transaction ${transactionId}:`, sessionError);
}
}
}
}
// All retries exhausted
const totalDuration = Date.now() - startTime;
this.recordFailedTransaction(transactionId, lastError, attempt - 1, totalDuration);
this.log(`Transaction ${transactionId} failed after ${attempt - 1} attempts (${totalDuration}ms)`);
return {
success: false,
error: lastError,
transactionId: transactionId,
attempts: attempt - 1,
duration: totalDuration,
errorAnalysis: this.analyzeTransactionError(lastError),
metrics: this.getTransactionMetrics(transactionId),
recoveryRecommendations: this.generateRecoveryRecommendations(transactionId, lastError)
};
} finally {
// Clean up transaction tracking
this.activeTransactions.delete(transactionId);
}
}
// Intelligent error analysis for MongoDB transactions
analyzeTransactionError(error) {
const analysis = {
errorCode: error.code,
errorMessage: error.message,
errorName: error.name,
retryable: false,
category: 'unknown',
severity: 'medium',
recommendedAction: 'investigate',
estimatedRecoveryTime: 0,
contextualInfo: {}
};
// Categorize error based on code and message
if (error.code) {
// Transient errors that should be retried
if (this.options.retryableErrorCodes.includes(error.code)) {
analysis.retryable = true;
analysis.category = this.categorizeMongoError(error.code);
analysis.severity = 'low';
analysis.recommendedAction = 'retry';
analysis.estimatedRecoveryTime = this.estimateRecoveryTime(error.code);
}
// Specific error code analysis
switch (error.code) {
case 112: // WriteConflict
analysis.category = 'concurrency';
analysis.recommendedAction = 'retry_with_backoff';
analysis.contextualInfo.suggestion = 'Consider optimizing transaction scope to reduce conflicts';
break;
case 117: // ConflictingOperationInProgress
analysis.category = 'concurrency';
analysis.recommendedAction = 'retry_with_longer_delay';
analysis.contextualInfo.suggestion = 'Wait for conflicting operation to complete';
break;
case 133: // FailedToSatisfyReadPreference
analysis.category = 'availability';
analysis.recommendedAction = 'check_replica_set_status';
analysis.contextualInfo.suggestion = 'Verify replica set member availability';
break;
case 208: // ExceededTimeLimit
case 261: // ExceededMaxTimeMS
analysis.category = 'timeout';
analysis.recommendedAction = 'optimize_or_increase_timeout';
analysis.contextualInfo.suggestion = 'Consider breaking transaction into smaller operations';
break;
case 244: // TransactionTooLarge
analysis.category = 'resource';
analysis.retryable = false;
analysis.severity = 'high';
analysis.recommendedAction = 'reduce_transaction_size';
analysis.contextualInfo.suggestion = 'Split transaction into smaller operations';
break;
case 251: // NoSuchTransaction
analysis.category = 'state';
analysis.recommendedAction = 'restart_transaction';
analysis.contextualInfo.suggestion = 'Transaction may have been cleaned up by server';
break;
case 256: // TransactionAborted
analysis.category = 'aborted';
analysis.recommendedAction = 'retry_full_transaction';
analysis.contextualInfo.suggestion = 'Transaction was aborted due to conflict or timeout';
break;
}
}
// Network-related errors
if (error.message && (
error.message.includes('network') ||
error.message.includes('connection') ||
error.message.includes('timeout') ||
error.message.includes('unreachable')
)) {
analysis.retryable = true;
analysis.category = 'network';
analysis.recommendedAction = 'retry_with_exponential_backoff';
analysis.estimatedRecoveryTime = 5000; // 5 seconds
analysis.contextualInfo.suggestion = 'Check network connectivity and server status';
}
// Resource exhaustion errors
if (error.message && (
error.message.includes('memory') ||
error.message.includes('disk space') ||
error.message.includes('too many connections')
)) {
analysis.retryable = true;
analysis.category = 'resource';
analysis.severity = 'high';
analysis.recommendedAction = 'wait_for_resources';
analysis.estimatedRecoveryTime = 10000; // 10 seconds
analysis.contextualInfo.suggestion = 'Monitor server resource usage';
}
return analysis;
}
categorizeMongoError(errorCode) {
const errorCategories = {
112: 'concurrency', // WriteConflict
117: 'concurrency', // ConflictingOperationInProgress
133: 'availability', // FailedToSatisfyReadPreference
134: 'availability', // ReadConcernMajorityNotAvailableYet
208: 'timeout', // ExceededTimeLimit
225: 'concurrency', // LockTimeout
244: 'resource', // TransactionTooLarge
251: 'state', // NoSuchTransaction
256: 'aborted', // TransactionAborted
261: 'timeout', // ExceededMaxTimeMS
263: 'availability', // TemporarilyUnavailable
6: 'network' // HostUnreachable
};
return errorCategories[errorCode] || 'unknown';
}
estimateRecoveryTime(errorCode) {
const recoveryTimes = {
112: 100, // WriteConflict - quick retry
117: 500, // ConflictingOperationInProgress - wait for operation
133: 2000, // FailedToSatisfyReadPreference - wait for replica
134: 1000, // ReadConcernMajorityNotAvailableYet - wait for majority
208: 5000, // ExceededTimeLimit - wait before retry
225: 200, // LockTimeout - quick retry
251: 100, // NoSuchTransaction - immediate retry
256: 300, // TransactionAborted - short wait
261: 3000, // ExceededMaxTimeMS - moderate wait
263: 1000, // TemporarilyUnavailable - short wait
6: 5000 // HostUnreachable - wait for network
};
return recoveryTimes[errorCode] || 1000;
}
// Calculate intelligent retry delay with exponential backoff and jitter
calculateRetryDelay(attemptNumber, errorAnalysis) {
// Base delay calculation with exponential backoff
let baseDelay = Math.min(
this.options.initialRetryDelayMs * Math.pow(this.options.retryDelayMultiplier, attemptNumber - 1),
this.options.maxRetryDelayMs
);
// Adjust based on error analysis
if (errorAnalysis.estimatedRecoveryTime > 0) {
baseDelay = Math.max(baseDelay, errorAnalysis.estimatedRecoveryTime);
}
// Add jitter to prevent thundering herd
const jitterRange = baseDelay * this.options.jitterFactor;
const jitter = (Math.random() * 2 - 1) * jitterRange; // Random value between -jitterRange and +jitterRange
const finalDelay = Math.max(0, Math.floor(baseDelay + jitter));
this.log(`Calculated retry delay: base=${baseDelay}ms, jitter=${jitter.toFixed(1)}ms, final=${finalDelay}ms`);
return finalDelay;
}
// Generate recovery recommendations based on error patterns
generateRecoveryRecommendations(transactionId, error) {
const recommendations = [];
const errorAnalysis = this.analyzeTransactionError(error);
// Category-specific recommendations
switch (errorAnalysis.category) {
case 'concurrency':
recommendations.push({
type: 'optimization',
priority: 'medium',
description: 'Optimize transaction scope to reduce write conflicts',
actions: [
'Consider breaking large transactions into smaller operations',
'Review document access patterns for optimization opportunities',
'Implement optimistic locking where appropriate'
]
});
break;
case 'timeout':
recommendations.push({
type: 'configuration',
priority: 'high',
description: 'Address transaction timeout issues',
actions: [
'Increase maxCommitTimeMS if operations are legitimately slow',
'Optimize query performance with proper indexing',
'Consider breaking complex operations into smaller transactions'
]
});
break;
case 'resource':
recommendations.push({
type: 'scaling',
priority: 'high',
description: 'Address resource constraints',
actions: [
'Monitor server resource usage (CPU, memory, disk)',
'Consider vertical or horizontal scaling',
'Implement connection pooling optimization'
]
});
break;
case 'network':
recommendations.push({
type: 'infrastructure',
priority: 'high',
description: 'Address network connectivity issues',
actions: [
'Check network connectivity between application and database',
'Verify MongoDB server status and availability',
'Consider implementing circuit breaker pattern'
]
});
break;
case 'availability':
recommendations.push({
type: 'deployment',
priority: 'high',
description: 'Address replica set availability',
actions: [
'Check replica set member status',
'Verify read preference configuration',
'Monitor replica lag and catch-up status'
]
});
break;
}
// Pattern-based recommendations
const transactionHistory = this.getTransactionHistory(transactionId);
if (transactionHistory && transactionHistory.errors.length > 1) {
// Check for recurring error patterns
const errorCodes = transactionHistory.errors.map(e => e.code);
const uniqueErrorCodes = [...new Set(errorCodes)];
if (uniqueErrorCodes.length === 1) {
recommendations.push({
type: 'pattern',
priority: 'high',
description: 'Recurring error pattern detected',
actions: [
`Address root cause of error ${uniqueErrorCodes[0]}`,
'Consider implementing circuit breaker pattern',
'Review application architecture for reliability improvements'
]
});
}
}
return recommendations;
}
// Advanced transaction monitoring and metrics collection
recordSuccessfulTransaction(transactionId, duration, attempts) {
this.transactionMetrics.totalTransactions++;
this.transactionMetrics.successfulTransactions++;
if (attempts > 1) {
this.transactionMetrics.retriedTransactions++;
}
// Update performance statistics
this.transactionMetrics.performanceStats.transactionDurations.push(duration);
// Keep only recent durations for average calculation
if (this.transactionMetrics.performanceStats.transactionDurations.length > 1000) {
this.transactionMetrics.performanceStats.transactionDurations =
this.transactionMetrics.performanceStats.transactionDurations.slice(-500);
}
// Recalculate average
this.transactionMetrics.performanceStats.averageTransactionDuration =
this.transactionMetrics.performanceStats.transactionDurations.reduce((sum, d) => sum + d, 0) /
this.transactionMetrics.performanceStats.transactionDurations.length;
this.log(`Transaction ${transactionId} metrics recorded: duration=${duration}ms, attempts=${attempts}`);
}
recordFailedTransaction(transactionId, error, attempts, duration) {
this.transactionMetrics.totalTransactions++;
this.transactionMetrics.failedTransactions++;
if (attempts > 1) {
this.transactionMetrics.retriedTransactions++;
}
// Record error statistics
const errorCode = error.code || 'unknown';
const currentCount = this.transactionMetrics.errorsByCode.get(errorCode) || 0;
this.transactionMetrics.errorsByCode.set(errorCode, currentCount + 1);
const errorCategory = this.categorizeMongoError(error.code);
const currentCategoryCount = this.transactionMetrics.errorsByCategory.get(errorCategory) || 0;
this.transactionMetrics.errorsByCategory.set(errorCategory, currentCategoryCount + 1);
this.log(`Transaction ${transactionId} failure recorded: error=${errorCode}, attempts=${attempts}, duration=${duration}ms`);
}
recordTransactionError(transactionId, error, attempt) {
const transaction = this.activeTransactions.get(transactionId);
if (transaction) {
transaction.errors.push({
attempt: attempt,
error: error,
timestamp: new Date(),
errorCode: error.code,
errorMessage: error.message,
analysis: this.analyzeTransactionError(error)
});
}
}
updateTransactionStatus(transactionId, status, additionalInfo = {}) {
const transaction = this.activeTransactions.get(transactionId);
if (transaction) {
transaction.status = status;
transaction.lastUpdated = new Date();
Object.assign(transaction, additionalInfo);
}
}
trackOperation(transactionId, operation) {
const transaction = this.activeTransactions.get(transactionId);
if (transaction) {
transaction.operationsExecuted++;
transaction.lastOperation = {
type: operation.type,
collection: operation.collection,
timestamp: new Date()
};
}
}
getTransactionMetrics(transactionId) {
const transaction = this.activeTransactions.get(transactionId);
return {
transactionId: transactionId,
operationsExecuted: transaction ? transaction.operationsExecuted : 0,
errors: transaction ? transaction.errors : [],
status: transaction ? transaction.status : 'unknown',
startTime: transaction ? transaction.startTime : null,
duration: transaction ? Date.now() - transaction.startTime : 0
};
}
getTransactionHistory(transactionId) {
return this.activeTransactions.get(transactionId);
}
// Comprehensive transaction health monitoring
getTransactionHealthReport() {
const report = {
timestamp: new Date(),
overall: {
totalTransactions: this.transactionMetrics.totalTransactions,
successfulTransactions: this.transactionMetrics.successfulTransactions,
failedTransactions: this.transactionMetrics.failedTransactions,
retriedTransactions: this.transactionMetrics.retriedTransactions,
totalRetryAttempts: this.transactionMetrics.totalRetryAttempts,
successRate: this.transactionMetrics.totalTransactions > 0 ?
(this.transactionMetrics.successfulTransactions / this.transactionMetrics.totalTransactions) * 100 : 0,
retryRate: this.transactionMetrics.totalTransactions > 0 ?
(this.transactionMetrics.retriedTransactions / this.transactionMetrics.totalTransactions) * 100 : 0
},
performance: {
averageTransactionDuration: this.transactionMetrics.performanceStats.averageTransactionDuration,
averageRetryDelay: this.transactionMetrics.performanceStats.retryDelays.length > 0 ?
this.transactionMetrics.performanceStats.retryDelays.reduce((sum, d) => sum + d, 0) /
this.transactionMetrics.performanceStats.retryDelays.length : 0,
totalRecentTransactions: this.transactionMetrics.performanceStats.transactionDurations.length
},
errors: {
byCode: Object.fromEntries(this.transactionMetrics.errorsByCode),
byCategory: Object.fromEntries(this.transactionMetrics.errorsByCategory),
mostCommonError: this.getMostCommonError(),
mostCommonCategory: this.getMostCommonErrorCategory()
},
activeTransactions: {
count: this.activeTransactions.size,
transactions: Array.from(this.activeTransactions.values()).map(t => ({
id: t.id,
status: t.status,
duration: Date.now() - t.startTime,
attempts: t.attempt,
operationsExecuted: t.operationsExecuted,
errorCount: t.errors ? t.errors.length : 0
}))
}
};
return report;
}
getMostCommonError() {
let maxCount = 0;
let mostCommonError = null;
for (const [errorCode, count] of this.transactionMetrics.errorsByCode.entries()) {
if (count > maxCount) {
maxCount = count;
mostCommonError = { code: errorCode, count: count };
}
}
return mostCommonError;
}
getMostCommonErrorCategory() {
let maxCount = 0;
let mostCommonCategory = null;
for (const [category, count] of this.transactionMetrics.errorsByCategory.entries()) {
if (count > maxCount) {
maxCount = count;
mostCommonCategory = { category: category, count: count };
}
}
return mostCommonCategory;
}
// Utility methods
generateTransactionId() {
return `txn_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
log(message, error = null) {
if (this.options.enableDetailedLogging) {
const timestamp = new Date().toISOString();
if (error) {
console.log(`[${timestamp}] ${message}`, error);
} else {
console.log(`[${timestamp}] ${message}`);
}
}
}
}
// Example usage with comprehensive error handling
async function demonstrateTransactionErrorHandling() {
const client = new MongoClient('mongodb://localhost:27017');
await client.connect();
const transactionManager = new MongoTransactionManager(client, {
maxRetryAttempts: 3,
initialRetryDelayMs: 100,
maxRetryDelayMs: 5000,
enableDetailedLogging: true,
enableMetricsCollection: true
});
try {
// Example transaction with comprehensive error handling
const result = await transactionManager.executeTransactionWithRetry(
async (session, context) => {
const { transactionId, attempt } = context;
console.log(`Executing business logic for transaction ${transactionId}, attempt ${attempt}`);
const db = client.db('ecommerce');
const ordersCollection = db.collection('orders');
const inventoryCollection = db.collection('inventory');
const accountsCollection = db.collection('accounts');
// Track operations for monitoring
context.onOperation({ type: 'insert', collection: 'orders' });
context.onOperation({ type: 'update', collection: 'inventory' });
context.onOperation({ type: 'update', collection: 'accounts' });
// Complex business transaction
const order = {
orderId: `order_${Date.now()}`,
customerId: 'customer_123',
items: [
{ productId: 'prod_456', quantity: 2, price: 29.99 },
{ productId: 'prod_789', quantity: 1, price: 49.99 }
],
totalAmount: 109.97,
status: 'pending',
createdAt: new Date()
};
// Insert order
const orderResult = await ordersCollection.insertOne(order, { session });
// Update inventory
for (const item of order.items) {
const inventoryUpdate = await inventoryCollection.updateOne(
{ productId: item.productId, quantity: { $gte: item.quantity } },
{ $inc: { quantity: -item.quantity } },
{ session }
);
if (inventoryUpdate.modifiedCount === 0) {
throw new Error(`Insufficient inventory for product ${item.productId}`);
}
}
// Update customer account
await accountsCollection.updateOne(
{ customerId: order.customerId },
{
$inc: { totalOrders: 1, totalSpent: order.totalAmount },
$set: { lastOrderDate: new Date() }
},
{ session }
);
return {
orderId: order.orderId,
orderResult: orderResult,
message: 'Order processed successfully'
};
},
{
// Custom transaction options
maxCommitTimeMS: 15000,
readConcern: { level: 'snapshot' },
writeConcern: { w: 'majority', j: true }
}
);
if (result.success) {
console.log('Transaction completed successfully:', result);
} else {
console.error('Transaction failed after all retries:', result);
}
// Get comprehensive health report
const healthReport = transactionManager.getTransactionHealthReport();
console.log('Transaction Health Report:', JSON.stringify(healthReport, null, 2));
} catch (error) {
console.error('Unexpected error:', error);
} finally {
await client.close();
}
}
// Benefits of MongoDB intelligent transaction error handling:
// - Automatic retry logic with exponential backoff and jitter
// - Intelligent error classification and recovery recommendations
// - Comprehensive transaction state tracking and monitoring
// - Advanced performance metrics and health reporting
// - Context-aware error analysis and recovery strategies
// - Built-in support for MongoDB-specific error patterns
// - Detailed logging and diagnostic information
// - Integration with MongoDB driver optimization features
// - Automatic detection of retryable vs. non-retryable errors
// - Production-ready resilience and reliability patterns
Advanced Error Recovery Patterns
Sophisticated recovery strategies for production-grade MongoDB applications:
// Advanced MongoDB error recovery patterns for enterprise resilience
class MongoResilienceManager {
constructor(client, options = {}) {
this.client = client;
this.transactionManager = new MongoTransactionManager(client, options);
this.recoveryStrategies = new Map();
this.circuitBreakers = new Map();
this.healthCheckers = new Map();
this.options = {
// Circuit breaker configuration
circuitBreakerThreshold: options.circuitBreakerThreshold || 5,
circuitBreakerTimeout: options.circuitBreakerTimeout || 60000,
circuitBreakerVolumeThreshold: options.circuitBreakerVolumeThreshold || 10,
// Health check configuration
healthCheckInterval: options.healthCheckInterval || 30000,
healthCheckTimeout: options.healthCheckTimeout || 5000,
// Recovery configuration
enableAutomaticRecovery: options.enableAutomaticRecovery || true,
maxRecoveryAttempts: options.maxRecoveryAttempts || 3
};
this.initialize();
}
initialize() {
// Set up circuit breakers for different operation types
this.setupCircuitBreakers();
// Initialize health monitoring
this.startHealthMonitoring();
// Register recovery strategies
this.registerRecoveryStrategies();
}
setupCircuitBreakers() {
const operationTypes = ['transaction', 'query', 'update', 'insert', 'delete'];
operationTypes.forEach(opType => {
this.circuitBreakers.set(opType, {
state: 'closed', // closed, open, half-open
failureCount: 0,
lastFailureTime: null,
successCount: 0,
totalRequests: 0,
threshold: this.options.circuitBreakerThreshold,
timeout: this.options.circuitBreakerTimeout,
volumeThreshold: this.options.circuitBreakerVolumeThreshold
});
});
}
// Execute operation with circuit breaker protection
async executeWithCircuitBreaker(operationType, operation) {
const circuitBreaker = this.circuitBreakers.get(operationType);
if (!circuitBreaker) {
throw new Error(`No circuit breaker configured for operation type: ${operationType}`);
}
// Check circuit breaker state
const canExecute = this.checkCircuitBreaker(circuitBreaker);
if (!canExecute) {
throw new Error(`Circuit breaker is OPEN for ${operationType}. Service temporarily unavailable.`);
}
try {
// Execute operation
const result = await operation();
// Record success
this.recordCircuitBreakerSuccess(circuitBreaker);
return result;
} catch (error) {
// Record failure
this.recordCircuitBreakerFailure(circuitBreaker);
throw error;
}
}
checkCircuitBreaker(circuitBreaker) {
const now = Date.now();
switch (circuitBreaker.state) {
case 'closed':
return true;
case 'open':
// Check if timeout has elapsed
if (now - circuitBreaker.lastFailureTime >= circuitBreaker.timeout) {
circuitBreaker.state = 'half-open';
return true;
}
return false;
case 'half-open':
return true;
default:
return false;
}
}
recordCircuitBreakerSuccess(circuitBreaker) {
circuitBreaker.successCount++;
circuitBreaker.totalRequests++;
if (circuitBreaker.state === 'half-open') {
// Reset circuit breaker on successful half-open request
circuitBreaker.state = 'closed';
circuitBreaker.failureCount = 0;
}
}
recordCircuitBreakerFailure(circuitBreaker) {
circuitBreaker.failureCount++;
circuitBreaker.totalRequests++;
circuitBreaker.lastFailureTime = Date.now();
// Check if should open circuit
if (circuitBreaker.totalRequests >= circuitBreaker.volumeThreshold &&
circuitBreaker.failureCount >= circuitBreaker.threshold) {
circuitBreaker.state = 'open';
console.log(`Circuit breaker opened due to ${circuitBreaker.failureCount} failures`);
}
}
// Comprehensive transaction execution with full resilience features
async executeResilientTransaction(transactionFunction, options = {}) {
const operationType = 'transaction';
return await this.executeWithCircuitBreaker(operationType, async () => {
// Execute transaction with comprehensive error handling
const result = await this.transactionManager.executeTransactionWithRetry(
transactionFunction,
options
);
// If transaction failed, attempt recovery if enabled
if (!result.success && this.options.enableAutomaticRecovery) {
const recoveryResult = await this.attemptTransactionRecovery(result);
if (recoveryResult && recoveryResult.success) {
return recoveryResult;
}
}
return result;
});
}
// Intelligent transaction recovery based on error patterns
async attemptTransactionRecovery(failedResult) {
const { error, transactionId, attempts, errorAnalysis } = failedResult;
console.log(`Attempting recovery for failed transaction ${transactionId}`);
// Get appropriate recovery strategy
const recoveryStrategy = this.getRecoveryStrategy(errorAnalysis);
if (!recoveryStrategy) {
console.log(`No recovery strategy available for error category: ${errorAnalysis.category}`);
return null;
}
try {
const recoveryResult = await recoveryStrategy.execute(failedResult);
console.log(`Recovery attempt completed for transaction ${transactionId}:`, recoveryResult);
return recoveryResult;
} catch (recoveryError) {
console.error(`Recovery failed for transaction ${transactionId}:`, recoveryError);
return null;
}
}
registerRecoveryStrategies() {
// Network connectivity recovery
this.recoveryStrategies.set('network', {
execute: async (failedResult) => {
console.log('Executing network recovery strategy');
// Wait for network to recover
await this.waitForNetworkRecovery();
// Check server connectivity
const healthOk = await this.performHealthCheck();
if (healthOk) {
console.log('Network recovery successful, retrying transaction');
// Could retry the transaction here if the original function is available
return { success: true, recovered: true, strategy: 'network' };
}
return { success: false, recovered: false, strategy: 'network' };
}
});
// Resource recovery
this.recoveryStrategies.set('resource', {
execute: async (failedResult) => {
console.log('Executing resource recovery strategy');
// Wait for resources to become available
await this.waitForResourceAvailability();
// Check resource status
const resourcesOk = await this.checkResourceStatus();
if (resourcesOk) {
console.log('Resource recovery successful');
return { success: true, recovered: true, strategy: 'resource' };
}
return { success: false, recovered: false, strategy: 'resource' };
}
});
// Availability recovery (replica set issues)
this.recoveryStrategies.set('availability', {
execute: async (failedResult) => {
console.log('Executing availability recovery strategy');
// Check replica set status
const replicaSetOk = await this.checkReplicaSetHealth();
if (replicaSetOk) {
console.log('Availability recovery successful');
return { success: true, recovered: true, strategy: 'availability' };
}
// Wait for replica set to recover
await this.waitForReplicaSetRecovery();
const recoveredReplicaSetOk = await this.checkReplicaSetHealth();
return {
success: recoveredReplicaSetOk,
recovered: recoveredReplicaSetOk,
strategy: 'availability'
};
}
});
}
getRecoveryStrategy(errorAnalysis) {
return this.recoveryStrategies.get(errorAnalysis.category);
}
// Health monitoring and recovery assistance
startHealthMonitoring() {
setInterval(async () => {
try {
await this.performComprehensiveHealthCheck();
} catch (error) {
console.error('Health monitoring error:', error);
}
}, this.options.healthCheckInterval);
}
async performComprehensiveHealthCheck() {
const healthStatus = {
timestamp: new Date(),
overall: 'unknown',
components: {}
};
try {
// Check basic connectivity
healthStatus.components.connectivity = await this.checkConnectivity();
// Check replica set status
healthStatus.components.replicaSet = await this.checkReplicaSetHealth();
// Check resource status
healthStatus.components.resources = await this.checkResourceStatus();
// Check circuit breaker status
healthStatus.components.circuitBreakers = this.getCircuitBreakerStatus();
// Check transaction manager health
healthStatus.components.transactionManager = this.transactionManager.getTransactionHealthReport();
// Determine overall health
const componentStatuses = Object.values(healthStatus.components);
const healthyComponents = componentStatuses.filter(status =>
status === true || (typeof status === 'object' && status.healthy !== false)
);
if (healthyComponents.length === componentStatuses.length) {
healthStatus.overall = 'healthy';
} else if (healthyComponents.length >= componentStatuses.length * 0.7) {
healthStatus.overall = 'degraded';
} else {
healthStatus.overall = 'unhealthy';
}
// Store health status
this.lastHealthStatus = healthStatus;
return healthStatus;
} catch (error) {
healthStatus.overall = 'error';
healthStatus.error = error.message;
return healthStatus;
}
}
async checkConnectivity() {
try {
const admin = this.client.db('admin');
await admin.command({ ping: 1 }, { maxTimeMS: this.options.healthCheckTimeout });
return true;
} catch (error) {
return false;
}
}
async checkReplicaSetHealth() {
try {
const admin = this.client.db('admin');
const status = await admin.command({ replSetGetStatus: 1 });
// Check if majority of members are healthy
const healthyMembers = status.members.filter(member =>
member.health === 1 && ['PRIMARY', 'SECONDARY'].includes(member.stateStr)
);
return {
healthy: healthyMembers.length >= Math.ceil(status.members.length / 2),
totalMembers: status.members.length,
healthyMembers: healthyMembers.length,
primaryAvailable: status.members.some(m => m.stateStr === 'PRIMARY')
};
} catch (error) {
// Might not be a replica set or insufficient privileges
return { healthy: true, note: 'Replica set status unavailable' };
}
}
async checkResourceStatus() {
try {
const admin = this.client.db('admin');
const serverStatus = await admin.command({ serverStatus: 1 });
const memUsage = serverStatus.mem.resident / serverStatus.mem.virtual;
const connectionUsage = serverStatus.connections.current / serverStatus.connections.available;
return {
healthy: memUsage < 0.9 && connectionUsage < 0.9,
memoryUsage: memUsage,
connectionUsage: connectionUsage,
connections: serverStatus.connections,
memory: serverStatus.mem
};
} catch (error) {
return { healthy: false, error: error.message };
}
}
getCircuitBreakerStatus() {
const status = {};
for (const [opType, breaker] of this.circuitBreakers.entries()) {
status[opType] = {
state: breaker.state,
failureCount: breaker.failureCount,
successCount: breaker.successCount,
totalRequests: breaker.totalRequests,
failureRate: breaker.totalRequests > 0 ?
(breaker.failureCount / breaker.totalRequests) * 100 : 0
};
}
return status;
}
// Recovery assistance methods
async waitForNetworkRecovery() {
const maxWaitTime = 30000; // 30 seconds
const checkInterval = 1000; // 1 second
let waited = 0;
while (waited < maxWaitTime) {
try {
const connected = await this.checkConnectivity();
if (connected) {
return true;
}
} catch (error) {
// Continue waiting
}
await this.sleep(checkInterval);
waited += checkInterval;
}
return false;
}
async waitForResourceAvailability() {
const maxWaitTime = 60000; // 60 seconds
const checkInterval = 5000; // 5 seconds
let waited = 0;
while (waited < maxWaitTime) {
try {
const resourceStatus = await this.checkResourceStatus();
if (resourceStatus.healthy) {
return true;
}
} catch (error) {
// Continue waiting
}
await this.sleep(checkInterval);
waited += checkInterval;
}
return false;
}
async waitForReplicaSetRecovery() {
const maxWaitTime = 120000; // 2 minutes
const checkInterval = 10000; // 10 seconds
let waited = 0;
while (waited < maxWaitTime) {
try {
const replicaStatus = await this.checkReplicaSetHealth();
if (replicaStatus.healthy) {
return true;
}
} catch (error) {
// Continue waiting
}
await this.sleep(checkInterval);
waited += checkInterval;
}
return false;
}
async performHealthCheck() {
const health = await this.performComprehensiveHealthCheck();
return health.overall === 'healthy' || health.overall === 'degraded';
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Get comprehensive resilience report
getResilienceReport() {
return {
timestamp: new Date(),
circuitBreakers: this.getCircuitBreakerStatus(),
transactionHealth: this.transactionManager.getTransactionHealthReport(),
lastHealthCheck: this.lastHealthStatus,
recoveryStrategies: Array.from(this.recoveryStrategies.keys()),
configuration: {
circuitBreakerThreshold: this.options.circuitBreakerThreshold,
circuitBreakerTimeout: this.options.circuitBreakerTimeout,
healthCheckInterval: this.options.healthCheckInterval,
automaticRecoveryEnabled: this.options.enableAutomaticRecovery
}
};
}
}
SQL-Style Error Handling with QueryLeaf
QueryLeaf provides familiar approaches to MongoDB transaction error handling and monitoring:
-- QueryLeaf transaction error handling with SQL-familiar syntax
-- Monitor transaction error patterns
SELECT
DATE_TRUNC('hour', error_timestamp) as hour_bucket,
error_category,
error_code,
-- Error statistics
COUNT(*) as error_count,
COUNT(DISTINCT transaction_id) as affected_transactions,
AVG(retry_attempts) as avg_retry_attempts,
COUNT(CASE WHEN recovery_successful = true THEN 1 END) as successful_recoveries,
-- Performance impact
AVG(transaction_duration_ms) as avg_failed_transaction_duration,
AVG(time_to_failure_ms) as avg_time_to_failure,
-- Recovery metrics
AVG(recovery_time_ms) as avg_recovery_time,
MAX(recovery_time_ms) as max_recovery_time,
-- Success rates
ROUND((COUNT(CASE WHEN recovery_successful = true THEN 1 END)::DECIMAL / COUNT(*)) * 100, 2) as recovery_success_rate
FROM TRANSACTION_ERROR_LOG()
WHERE error_timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', error_timestamp), error_category, error_code
ORDER BY hour_bucket DESC, error_count DESC;
-- Analyze transaction resilience patterns
WITH transaction_resilience AS (
SELECT
transaction_id,
transaction_type,
-- Transaction characteristics
operation_count,
total_duration_ms,
retry_attempts,
-- Error analysis
first_error_code,
first_error_category,
total_errors,
-- Recovery analysis
recovery_strategy_used,
recovery_successful,
recovery_duration_ms,
-- Final outcome
final_status, -- committed, failed, recovered
-- Timing analysis
created_at,
completed_at
FROM TRANSACTION_HISTORY()
WHERE created_at >= NOW() - INTERVAL '7 days'
),
resilience_patterns AS (
SELECT
transaction_type,
first_error_category,
-- Volume metrics
COUNT(*) as transaction_count,
COUNT(CASE WHEN final_status = 'committed' THEN 1 END) as successful_transactions,
COUNT(CASE WHEN final_status = 'recovered' THEN 1 END) as recovered_transactions,
COUNT(CASE WHEN final_status = 'failed' THEN 1 END) as failed_transactions,
-- Retry analysis
AVG(retry_attempts) as avg_retry_attempts,
MAX(retry_attempts) as max_retry_attempts,
COUNT(CASE WHEN retry_attempts > 0 THEN 1 END) as transactions_with_retries,
-- Recovery analysis
COUNT(CASE WHEN recovery_strategy_used IS NOT NULL THEN 1 END) as recovery_attempts,
COUNT(CASE WHEN recovery_successful = true THEN 1 END) as successful_recoveries,
AVG(recovery_duration_ms) as avg_recovery_duration,
-- Performance metrics
AVG(total_duration_ms) as avg_total_duration,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_duration_ms) as p95_duration,
-- Success rates
ROUND((COUNT(CASE WHEN final_status IN ('committed', 'recovered') THEN 1 END)::DECIMAL / COUNT(*)) * 100, 2) as overall_success_rate,
ROUND((COUNT(CASE WHEN recovery_successful = true THEN 1 END)::DECIMAL /
GREATEST(COUNT(CASE WHEN recovery_strategy_used IS NOT NULL THEN 1 END), 1)) * 100, 2) as recovery_success_rate
FROM transaction_resilience
GROUP BY transaction_type, first_error_category
)
SELECT
transaction_type,
first_error_category,
-- Volume and success metrics
transaction_count,
successful_transactions,
recovered_transactions,
failed_transactions,
overall_success_rate,
-- Retry patterns
avg_retry_attempts,
max_retry_attempts,
ROUND((transactions_with_retries::DECIMAL / transaction_count) * 100, 2) as retry_rate_percent,
-- Recovery effectiveness
recovery_attempts,
successful_recoveries,
recovery_success_rate,
avg_recovery_duration,
-- Performance characteristics
avg_total_duration,
p95_duration,
-- Health assessment
CASE
WHEN overall_success_rate >= 99 THEN 'Excellent'
WHEN overall_success_rate >= 95 THEN 'Good'
WHEN overall_success_rate >= 90 THEN 'Fair'
ELSE 'Poor'
END as resilience_grade,
-- Recommendations
CASE
WHEN recovery_success_rate < 50 AND recovery_attempts > 0 THEN 'Improve recovery strategies'
WHEN avg_retry_attempts > 3 THEN 'Review retry configuration'
WHEN failed_transactions > successful_transactions * 0.1 THEN 'Investigate error root causes'
ELSE 'Performance acceptable'
END as recommendation
FROM resilience_patterns
ORDER BY transaction_count DESC, overall_success_rate ASC;
-- Real-time transaction health monitoring
SELECT
-- Current status
COUNT(CASE WHEN status = 'active' THEN 1 END) as active_transactions,
COUNT(CASE WHEN status = 'retrying' THEN 1 END) as retrying_transactions,
COUNT(CASE WHEN status = 'recovering' THEN 1 END) as recovering_transactions,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_transactions,
-- Recent performance (last 5 minutes)
AVG(CASE WHEN completed_at >= NOW() - INTERVAL '5 minutes'
THEN duration_ms END) as recent_avg_duration_ms,
COUNT(CASE WHEN completed_at >= NOW() - INTERVAL '5 minutes'
AND final_status = 'committed' THEN 1 END) as recent_successful_transactions,
COUNT(CASE WHEN completed_at >= NOW() - INTERVAL '5 minutes'
AND final_status = 'failed' THEN 1 END) as recent_failed_transactions,
-- Error rates
ROUND((COUNT(CASE WHEN error_occurred_at >= NOW() - INTERVAL '5 minutes' THEN 1 END)::DECIMAL /
GREATEST(COUNT(CASE WHEN created_at >= NOW() - INTERVAL '5 minutes' THEN 1 END), 1)) * 100, 2)
as recent_error_rate_percent,
-- Circuit breaker status
COUNT(CASE WHEN circuit_breaker_state = 'open' THEN 1 END) as open_circuit_breakers,
COUNT(CASE WHEN circuit_breaker_state = 'half-open' THEN 1 END) as half_open_circuit_breakers,
-- Recovery metrics
COUNT(CASE WHEN recovery_in_progress = true THEN 1 END) as active_recoveries,
AVG(CASE WHEN recovery_completed_at >= NOW() - INTERVAL '5 minutes'
THEN recovery_duration_ms END) as recent_avg_recovery_time_ms,
-- Health indicators
CASE
WHEN COUNT(CASE WHEN status = 'failed' THEN 1 END) >
COUNT(CASE WHEN status = 'active' THEN 1 END) * 0.5 THEN 'Critical'
WHEN COUNT(CASE WHEN circuit_breaker_state = 'open' THEN 1 END) > 0 THEN 'Degraded'
WHEN COUNT(CASE WHEN status = 'retrying' THEN 1 END) >
COUNT(CASE WHEN status = 'active' THEN 1 END) * 0.3 THEN 'Warning'
ELSE 'Healthy'
END as overall_health_status,
NOW() as report_timestamp
FROM ACTIVE_TRANSACTION_STATUS()
CROSS JOIN CIRCUIT_BREAKER_STATUS()
CROSS JOIN RECOVERY_STATUS();
-- Transaction error prevention and optimization
CREATE ALERT TRANSACTION_ERROR_PREVENTION
ON TRANSACTION_ERROR_LOG()
WHEN (
-- High error rate
(SELECT COUNT(*) FROM TRANSACTION_ERROR_LOG()
WHERE error_timestamp >= NOW() - INTERVAL '5 minutes') > 10
OR
-- Circuit breaker opened
(SELECT COUNT(*) FROM CIRCUIT_BREAKER_STATUS()
WHERE state = 'open') > 0
OR
-- Recovery failing
(SELECT AVG(CASE WHEN recovery_successful = true THEN 1.0 ELSE 0.0 END)
FROM TRANSACTION_ERROR_LOG()
WHERE error_timestamp >= NOW() - INTERVAL '15 minutes'
AND recovery_strategy_used IS NOT NULL) < 0.5
)
NOTIFY ['[email protected]', '[email protected]']
WITH MESSAGE TEMPLATE '''
{% raw %}
Transaction Error Alert
Current Status:
- Recent Errors (5 min): {{ recent_error_count }}
- Open Circuit Breakers: {{ open_circuit_breaker_count }}
- Active Recoveries: {{ active_recovery_count }}
- Recovery Success Rate: {{ recovery_success_rate }}%
Top Error Categories:
{{ top_error_categories }}
Recommended Actions:
{{ error_prevention_recommendations }}
Dashboard: https://monitoring.company.com/mongodb/transactions
{% endraw %}
'''
EVERY 1 MINUTES;
-- QueryLeaf transaction error handling provides:
-- 1. SQL-familiar error monitoring and analysis
-- 2. Comprehensive transaction resilience reporting
-- 3. Real-time health monitoring and alerting
-- 4. Intelligent error pattern detection
-- 5. Recovery strategy effectiveness analysis
-- 6. Circuit breaker status monitoring
-- 7. Performance impact assessment
-- 8. Automated prevention and optimization recommendations
-- 9. Integration with MongoDB's native error handling
-- 10. Production-ready operational visibility
Best Practices for MongoDB Transaction Error Handling
Error Classification Strategy
Optimal error handling configuration for different application patterns:
- High-Frequency Applications: Aggressive retry policies with intelligent backoff
- Mission-Critical Systems: Comprehensive recovery strategies with circuit breakers
- Batch Processing: Extended timeout configurations with resource monitoring
- Real-time Applications: Fast-fail approaches with immediate fallback mechanisms
- Microservices: Distributed error handling with service-level circuit breakers
- Analytics Workloads: Specialized error handling for long-running operations
Recovery Strategy Guidelines
Essential patterns for production transaction recovery:
- Automatic Retry Logic: Exponential backoff with jitter for transient failures
- Circuit Breaker Pattern: Prevent cascading failures with intelligent state management
- Health Monitoring: Continuous assessment of system and transaction health
- Recovery Automation: Context-aware recovery strategies for different error types
- Performance Monitoring: Track error impact on application performance
- Operational Alerting: Proactive notification of error patterns and recovery issues
Conclusion
MongoDB transaction error handling and recovery requires sophisticated strategies that balance reliability, performance, and operational complexity. By implementing intelligent retry mechanisms, comprehensive error classification, and automated recovery patterns, applications can maintain consistency and reliability even when facing distributed system challenges.
Key error handling benefits include:
- Intelligent Recovery: Automatic retry logic with context-aware recovery strategies
- Comprehensive Monitoring: Detailed error tracking and performance analysis
- Circuit Breaker Protection: Prevention of cascading failures with intelligent state management
- Health Assessment: Continuous monitoring of transaction and system health
- Operational Visibility: Real-time insights into error patterns and recovery effectiveness
- Production Resilience: Enterprise-grade reliability patterns for mission-critical applications
Whether you're building high-throughput web applications, distributed microservices, data processing pipelines, or real-time analytics platforms, MongoDB's intelligent transaction error handling with QueryLeaf's familiar management interface provides the foundation for resilient, reliable database operations. This combination enables you to leverage advanced error recovery capabilities while maintaining familiar database administration patterns and operational procedures.
QueryLeaf Integration: QueryLeaf automatically translates SQL-familiar error handling patterns into optimal MongoDB transaction configurations while providing comprehensive monitoring and recovery through SQL-style queries. Advanced error classification, recovery automation, and performance analysis are seamlessly managed through familiar database administration interfaces, making sophisticated error handling both powerful and accessible.
The integration of intelligent error handling with SQL-style database operations makes MongoDB an ideal platform for applications requiring both high reliability and familiar error management patterns, ensuring your transactions remain both consistent and resilient as they scale to meet demanding production requirements.