MongoDB Full-Text Search and Advanced Indexing: SQL-Style Text Queries and Search Optimization
Modern applications require sophisticated search capabilities that go beyond simple pattern matching. Whether you're building e-commerce product catalogs, content management systems, or document repositories, users expect fast, relevant, and intelligent search functionality that can handle typos, synonyms, and complex queries across multiple languages.
Traditional database text search often relies on basic LIKE patterns or regular expressions, which are limited in functionality and performance. MongoDB's full-text search capabilities, combined with advanced indexing strategies, provide enterprise-grade search functionality that rivals dedicated search engines while maintaining the simplicity of database queries.
The Text Search Challenge
Basic text search approaches have significant limitations:
-- SQL basic text search limitations
-- Simple pattern matching - case sensitive, no relevance
SELECT product_name, description, price
FROM products
WHERE product_name LIKE '%laptop%'
OR description LIKE '%laptop%';
-- Problems: Case sensitivity, no stemming, no relevance scoring
-- Regular expressions - expensive and limited
SELECT title, content, author
FROM articles
WHERE content ~* '(machine|artificial|deep).*(learning|intelligence)';
-- Problems: No ranking, poor performance on large datasets
-- Multiple keyword search - complex and inefficient
SELECT *
FROM products
WHERE (LOWER(product_name) LIKE '%gaming%' OR LOWER(description) LIKE '%gaming%')
AND (LOWER(product_name) LIKE '%laptop%' OR LOWER(description) LIKE '%laptop%')
AND (LOWER(product_name) LIKE '%performance%' OR LOWER(description) LIKE '%performance%');
-- Problems: Complex syntax, no semantic understanding, poor performance
MongoDB's text search addresses these limitations:
// MongoDB advanced text search capabilities
db.products.find({
$text: {
$search: "gaming laptop performance",
$language: "english",
$caseSensitive: false,
$diacriticSensitive: false
}
}, {
score: { $meta: "textScore" }
}).sort({
score: { $meta: "textScore" }
});
// Results include:
// - Stemming: "games" matches "gaming"
// - Language-specific tokenization
// - Relevance scoring based on term frequency and position
// - Multi-field search across indexed text fields
// - Performance optimized with specialized text indexes
Text Indexing Fundamentals
Creating Text Indexes
Build comprehensive text search functionality with MongoDB text indexes:
// Basic text index creation
db.products.createIndex({
product_name: "text",
description: "text",
category: "text"
});
// Weighted text index for relevance tuning
db.products.createIndex({
product_name: "text",
description: "text",
tags: "text",
category: "text"
}, {
weights: {
product_name: 10, // Product name is most important
description: 5, // Description has medium importance
tags: 8, // Tags are highly relevant
category: 3 // Category provides context
},
name: "product_text_search",
default_language: "english",
language_override: "language"
});
// Compound index combining text search with other criteria
db.products.createIndex({
category: 1, // Standard index for filtering
price: 1, // Range queries
product_name: "text", // Text search
description: "text"
}, {
weights: {
product_name: 15,
description: 8
}
});
// Multi-language text index
db.articles.createIndex({
title: "text",
content: "text"
}, {
default_language: "english",
language_override: "lang", // Document field that specifies language
weights: {
title: 20,
content: 10
}
});
SQL-style text indexing concepts:
-- SQL full-text search equivalent patterns
-- Create full-text index on multiple columns
CREATE FULLTEXT INDEX ft_products_search
ON products (product_name, description, tags);
-- Weighted full-text search with relevance ranking
SELECT
product_id,
product_name,
description,
MATCH(product_name, description, tags)
AGAINST('gaming laptop performance' IN NATURAL LANGUAGE MODE) AS relevance_score
FROM products
WHERE MATCH(product_name, description, tags)
AGAINST('gaming laptop performance' IN NATURAL LANGUAGE MODE)
ORDER BY relevance_score DESC;
-- Boolean full-text search with operators
SELECT *
FROM products
WHERE MATCH(product_name, description)
AGAINST('+gaming +laptop -refurbished' IN BOOLEAN MODE);
-- Full-text search with additional filtering
SELECT
product_name,
price,
category,
MATCH(product_name, description)
AGAINST('high performance gaming' IN NATURAL LANGUAGE MODE) AS score
FROM products
WHERE price BETWEEN 1000 AND 3000
AND category = 'computers'
AND MATCH(product_name, description)
AGAINST('high performance gaming' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC
LIMIT 20;
Advanced Text Search Queries
Implement sophisticated search patterns:
// Advanced text search implementation
class TextSearchService {
constructor(db) {
this.db = db;
this.productsCollection = db.collection('products');
}
async basicTextSearch(searchTerm, options = {}) {
const query = {
$text: {
$search: searchTerm,
$language: options.language || "english",
$caseSensitive: options.caseSensitive || false,
$diacriticSensitive: options.diacriticSensitive || false
}
};
// Add additional filters
if (options.category) {
query.category = options.category;
}
if (options.priceRange) {
query.price = {
$gte: options.priceRange.min,
$lte: options.priceRange.max
};
}
const results = await this.productsCollection.find(query, {
projection: {
product_name: 1,
description: 1,
price: 1,
category: 1,
score: { $meta: "textScore" }
}
})
.sort({ score: { $meta: "textScore" } })
.limit(options.limit || 20)
.toArray();
return results;
}
async phraseSearch(phrase, options = {}) {
// Exact phrase search using quoted strings
const query = {
$text: {
$search: `"${phrase}"`,
$language: options.language || "english"
}
};
return await this.productsCollection.find(query, {
projection: {
product_name: 1,
description: 1,
score: { $meta: "textScore" }
}
})
.sort({ score: { $meta: "textScore" } })
.limit(options.limit || 10)
.toArray();
}
async booleanTextSearch(searchExpression, options = {}) {
// Boolean search with inclusion/exclusion operators
const query = {
$text: {
$search: searchExpression, // e.g., "laptop gaming -refurbished"
$language: options.language || "english"
}
};
return await this.productsCollection.find(query, {
projection: {
product_name: 1,
description: 1,
price: 1,
score: { $meta: "textScore" }
}
})
.sort({ score: { $meta: "textScore" } })
.limit(options.limit || 20)
.toArray();
}
async fuzzySearch(searchTerm, options = {}) {
// Combine text search with regex for fuzzy matching
const textResults = await this.basicTextSearch(searchTerm, options);
// Fuzzy fallback using regex for typos/variations
if (textResults.length < 5) {
const fuzzyPattern = this.buildFuzzyPattern(searchTerm);
const regexQuery = {
$or: [
{ product_name: { $regex: fuzzyPattern, $options: 'i' } },
{ description: { $regex: fuzzyPattern, $options: 'i' } }
]
};
const fuzzyResults = await this.productsCollection.find(regexQuery)
.limit(10 - textResults.length)
.toArray();
return [...textResults, ...fuzzyResults];
}
return textResults;
}
buildFuzzyPattern(term) {
// Create regex pattern allowing character variations
const chars = term.split('');
const pattern = chars.map(char => {
return `${char}.*?`;
}).join('');
return pattern;
}
async searchWithFacets(searchTerm, facetFields = ['category', 'brand', 'price_range']) {
const pipeline = [
{
$match: {
$text: { $search: searchTerm }
}
},
{
$addFields: {
score: { $meta: "textScore" },
price_range: {
$switch: {
branches: [
{ case: { $lte: ["$price", 500] }, then: "Under $500" },
{ case: { $lte: ["$price", 1000] }, then: "$500 - $1000" },
{ case: { $lte: ["$price", 2000] }, then: "$1000 - $2000" },
{ case: { $gt: ["$price", 2000] }, then: "Over $2000" }
],
default: "Unknown"
}
}
}
},
{
$facet: {
results: [
{ $sort: { score: -1 } },
{ $limit: 20 },
{
$project: {
product_name: 1,
description: 1,
price: 1,
category: 1,
brand: 1,
score: 1
}
}
],
category_facets: [
{ $group: { _id: "$category", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
],
brand_facets: [
{ $group: { _id: "$brand", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
],
price_facets: [
{ $group: { _id: "$price_range", count: { $sum: 1 } } },
{ $sort: { _id: 1 } }
]
}
}
];
const facetResults = await this.productsCollection.aggregate(pipeline).toArray();
return facetResults[0];
}
async autoComplete(prefix, field = 'product_name', limit = 10) {
// Auto-completion using regex and text search
const pipeline = [
{
$match: {
[field]: { $regex: `^${prefix}`, $options: 'i' }
}
},
{
$group: {
_id: `$${field}`,
count: { $sum: 1 }
}
},
{
$sort: { count: -1 }
},
{
$limit: limit
},
{
$project: {
suggestion: "$_id",
frequency: "$count",
_id: 0
}
}
];
return await this.productsCollection.aggregate(pipeline).toArray();
}
}
Multi-Language Text Search
Support international search requirements:
// Multi-language text search implementation
class MultiLanguageSearchService {
constructor(db) {
this.db = db;
this.documentsCollection = db.collection('documents');
// Language-specific stemming and stop words
this.languageConfig = {
english: {
stopwords: ['the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by'],
stemming: true
},
spanish: {
stopwords: ['el', 'la', 'y', 'o', 'pero', 'en', 'con', 'por', 'para', 'de'],
stemming: true
},
french: {
stopwords: ['le', 'la', 'et', 'ou', 'mais', 'dans', 'sur', 'avec', 'par', 'pour', 'de'],
stemming: true
}
};
}
async setupMultiLanguageIndexes() {
// Create language-specific text indexes
for (const [language, config] of Object.entries(this.languageConfig)) {
await this.documentsCollection.createIndex({
title: "text",
content: "text",
tags: "text"
}, {
name: `text_search_${language}`,
default_language: language,
language_override: "lang",
weights: {
title: 15,
content: 10,
tags: 8
}
});
}
// Create compound index with language field
await this.documentsCollection.createIndex({
language: 1,
title: "text",
content: "text"
}, {
name: "multilang_text_search"
});
}
async searchMultiLanguage(searchTerm, targetLanguage = null, options = {}) {
const query = {
$text: {
$search: searchTerm,
$language: targetLanguage || "english",
$caseSensitive: false,
$diacriticSensitive: false
}
};
// Filter by specific language if provided
if (targetLanguage) {
query.language = targetLanguage;
}
const pipeline = [
{ $match: query },
{
$addFields: {
score: { $meta: "textScore" },
// Boost score for exact language match
language_bonus: {
$cond: {
if: { $eq: ["$language", targetLanguage || "english"] },
then: 1.5,
else: 1.0
}
}
}
},
{
$addFields: {
adjusted_score: { $multiply: ["$score", "$language_bonus"] }
}
},
{
$sort: { adjusted_score: -1 }
},
{
$limit: options.limit || 20
},
{
$project: {
title: 1,
content: { $substr: ["$content", 0, 200] }, // Excerpt
language: 1,
author: 1,
created_at: 1,
score: "$adjusted_score"
}
}
];
return await this.documentsCollection.aggregate(pipeline).toArray();
}
async detectLanguage(text) {
// Simple language detection based on common words
const words = text.toLowerCase().split(/\s+/);
const languageScores = {};
for (const [language, config] of Object.entries(this.languageConfig)) {
const stopwordMatches = words.filter(word =>
config.stopwords.includes(word)
).length;
languageScores[language] = stopwordMatches / words.length;
}
// Return language with highest score
return Object.entries(languageScores)
.sort(([,a], [,b]) => b - a)[0][0];
}
async searchWithLanguageDetection(searchTerm, options = {}) {
// Auto-detect search term language
const detectedLanguage = await this.detectLanguage(searchTerm);
return await this.searchMultiLanguage(searchTerm, detectedLanguage, options);
}
async translateAndSearch(searchTerm, sourceLanguage, targetLanguages = ['english']) {
// This would integrate with translation services
const searchResults = new Map();
for (const targetLanguage of targetLanguages) {
// Placeholder for translation service integration
const translatedTerm = await this.translateTerm(searchTerm, sourceLanguage, targetLanguage);
const results = await this.searchMultiLanguage(translatedTerm, targetLanguage);
searchResults.set(targetLanguage, results);
}
return searchResults;
}
async translateTerm(term, from, to) {
// Placeholder for translation service
// In practice, integrate with Google Translate, AWS Translate, etc.
return term; // Return original term for now
}
}
Advanced Search Features
Search Analytics and Optimization
Track and optimize search performance:
// Search analytics and performance optimization
class SearchAnalytics {
constructor(db) {
this.db = db;
this.searchLogsCollection = db.collection('search_logs');
this.productsCollection = db.collection('products');
}
async logSearchQuery(searchData) {
const logEntry = {
search_term: searchData.query,
user_id: searchData.userId,
session_id: searchData.sessionId,
timestamp: new Date(),
results_count: searchData.resultsCount,
clicked_results: [],
execution_time_ms: searchData.executionTime,
search_type: searchData.searchType, // basic, fuzzy, phrase, etc.
filters_applied: searchData.filters || {},
user_agent: searchData.userAgent,
ip_address: searchData.ipAddress
};
await this.searchLogsCollection.insertOne(logEntry);
return logEntry._id;
}
async trackSearchClick(searchLogId, clickedResult) {
await this.searchLogsCollection.updateOne(
{ _id: searchLogId },
{
$push: {
clicked_results: {
result_id: clickedResult.id,
result_position: clickedResult.position,
clicked_at: new Date()
}
}
}
);
}
async getSearchAnalytics(timeframe = 7) {
const since = new Date(Date.now() - timeframe * 24 * 60 * 60 * 1000);
const pipeline = [
{
$match: {
timestamp: { $gte: since }
}
},
{
$group: {
_id: {
search_term: { $toLower: "$search_term" },
date: { $dateToString: { format: "%Y-%m-%d", date: "$timestamp" } }
},
search_count: { $sum: 1 },
avg_results: { $avg: "$results_count" },
avg_execution_time: { $avg: "$execution_time_ms" },
unique_users: { $addToSet: "$user_id" },
click_through_rate: {
$avg: {
$cond: {
if: { $gt: [{ $size: "$clicked_results" }, 0] },
then: 1,
else: 0
}
}
}
}
},
{
$addFields: {
unique_user_count: { $size: "$unique_users" }
}
},
{
$sort: { search_count: -1 }
},
{
$limit: 100
}
];
return await this.searchLogsCollection.aggregate(pipeline).toArray();
}
async getPopularSearchTerms(limit = 20) {
const pipeline = [
{
$match: {
timestamp: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
}
},
{
$group: {
_id: { $toLower: "$search_term" },
frequency: { $sum: 1 },
avg_results: { $avg: "$results_count" },
click_rate: {
$avg: {
$cond: {
if: { $gt: [{ $size: "$clicked_results" }, 0] },
then: 1,
else: 0
}
}
}
}
},
{
$match: {
frequency: { $gte: 2 } // Only terms searched more than once
}
},
{
$sort: { frequency: -1 }
},
{
$limit: limit
},
{
$project: {
search_term: "$_id",
frequency: 1,
avg_results: { $round: ["$avg_results", 1] },
click_rate: { $round: ["$click_rate", 3] },
_id: 0
}
}
];
return await this.searchLogsCollection.aggregate(pipeline).toArray();
}
async identifyZeroResultQueries(limit = 50) {
const pipeline = [
{
$match: {
results_count: 0,
timestamp: { $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) }
}
},
{
$group: {
_id: { $toLower: "$search_term" },
occurrence_count: { $sum: 1 },
last_searched: { $max: "$timestamp" }
}
},
{
$sort: { occurrence_count: -1 }
},
{
$limit: limit
},
{
$project: {
search_term: "$_id",
occurrence_count: 1,
last_searched: 1,
_id: 0
}
}
];
return await this.searchLogsCollection.aggregate(pipeline).toArray();
}
async optimizeSearchIndexes() {
// Analyze query patterns to optimize indexes
const searchPatterns = await this.getSearchAnalytics(30);
const optimizationRecommendations = [];
for (const pattern of searchPatterns) {
const searchTerm = pattern._id.search_term;
// Check if current indexes are efficient for common queries
const indexStats = await this.productsCollection.aggregate([
{ $indexStats: {} }
]).toArray();
// Analyze index usage for text searches
const textIndexUsage = indexStats.filter(stat =>
stat.name.includes('text') || stat.key.hasOwnProperty('_fts')
);
if (pattern.avg_execution_time > 100) { // Slow queries > 100ms
optimizationRecommendations.push({
issue: 'slow_search',
search_term: searchTerm,
avg_time: pattern.avg_execution_time,
recommendation: 'Consider adding compound index for frequent filters'
});
}
if (pattern.avg_results < 1) { // Very few results
optimizationRecommendations.push({
issue: 'low_recall',
search_term: searchTerm,
avg_results: pattern.avg_results,
recommendation: 'Consider fuzzy matching or synonym expansion'
});
}
}
return optimizationRecommendations;
}
async generateSearchSuggestions() {
// Generate search suggestions based on popular terms
const popularTerms = await this.getPopularSearchTerms(100);
const suggestions = [];
for (const term of popularTerms) {
// Extract keywords from successful searches
const keywords = term.search_term.split(' ').filter(word => word.length > 2);
for (const keyword of keywords) {
// Find related products to suggest similar searches
const relatedProducts = await this.productsCollection.find({
$text: { $search: keyword }
}, {
projection: { product_name: 1, category: 1, tags: 1 }
}).limit(5).toArray();
const relatedTerms = new Set();
relatedProducts.forEach(product => {
// Extract terms from product names and categories
const productWords = product.product_name.toLowerCase().split(/\s+/);
const categoryWords = product.category ? product.category.toLowerCase().split(/\s+/) : [];
const tagWords = product.tags ? product.tags.flatMap(tag => tag.toLowerCase().split(/\s+/)) : [];
[...productWords, ...categoryWords, ...tagWords].forEach(word => {
if (word.length > 2 && word !== keyword) {
relatedTerms.add(word);
}
});
});
if (relatedTerms.size > 0) {
suggestions.push({
base_term: keyword,
suggested_terms: Array.from(relatedTerms).slice(0, 5),
popularity: term.frequency
});
}
}
}
return suggestions.slice(0, 50); // Top 50 suggestions
}
}
Real-Time Search Suggestions
Implement dynamic search suggestions and autocomplete:
// Real-time search suggestions system
class SearchSuggestionEngine {
constructor(db) {
this.db = db;
this.suggestionsCollection = db.collection('search_suggestions');
this.productsCollection = db.collection('products');
}
async buildSuggestionIndex() {
// Create suggestions from product data
const products = await this.productsCollection.find({}, {
projection: {
product_name: 1,
category: 1,
brand: 1,
tags: 1,
description: 1
}
}).toArray();
const suggestionSet = new Set();
for (const product of products) {
// Extract searchable terms
const terms = this.extractTerms(product);
terms.forEach(term => suggestionSet.add(term));
}
// Convert to suggestion documents
const suggestionDocs = Array.from(suggestionSet).map(term => ({
text: term,
length: term.length,
frequency: 1, // Initial frequency
created_at: new Date()
}));
// Clear existing suggestions and insert new ones
await this.suggestionsCollection.deleteMany({});
if (suggestionDocs.length > 0) {
await this.suggestionsCollection.insertMany(suggestionDocs);
}
// Create indexes for fast prefix matching
await this.suggestionsCollection.createIndex({ text: 1 });
await this.suggestionsCollection.createIndex({ length: 1, frequency: -1 });
}
extractTerms(product) {
const terms = new Set();
// Product name - split and add individual words and phrases
if (product.product_name) {
const words = product.product_name.toLowerCase()
.replace(/[^\w\s]/g, ' ')
.split(/\s+/)
.filter(word => word.length >= 2);
words.forEach(word => terms.add(word));
// Add 2-word and 3-word phrases
for (let i = 0; i < words.length - 1; i++) {
terms.add(`${words[i]} ${words[i + 1]}`);
if (i < words.length - 2) {
terms.add(`${words[i]} ${words[i + 1]} ${words[i + 2]}`);
}
}
}
// Category and brand
if (product.category) {
terms.add(product.category.toLowerCase());
}
if (product.brand) {
terms.add(product.brand.toLowerCase());
}
// Tags
if (product.tags && Array.isArray(product.tags)) {
product.tags.forEach(tag => {
if (typeof tag === 'string') {
terms.add(tag.toLowerCase());
}
});
}
return Array.from(terms);
}
async getSuggestions(prefix, limit = 10) {
// Get suggestions starting with prefix
const suggestions = await this.suggestionsCollection.find({
text: { $regex: `^${prefix.toLowerCase()}`, $options: 'i' },
length: { $lte: 50 } // Reasonable length limit
})
.sort({ frequency: -1, length: 1 })
.limit(limit)
.project({ text: 1, frequency: 1, _id: 0 })
.toArray();
return suggestions.map(s => s.text);
}
async updateSuggestionFrequency(searchTerm) {
// Update frequency when user searches
await this.suggestionsCollection.updateOne(
{ text: searchTerm.toLowerCase() },
{
$inc: { frequency: 1 },
$set: { last_used: new Date() }
},
{ upsert: true }
);
}
async getFuzzySuggestions(term, maxDistance = 2, limit = 5) {
// Get fuzzy suggestions for typos
const pipeline = [
{
$project: {
text: 1,
frequency: 1,
distance: {
$function: {
body: function(text1, text2) {
// Levenshtein distance calculation
const a = text1.toLowerCase();
const b = text2.toLowerCase();
const matrix = [];
for (let i = 0; i <= b.length; i++) {
matrix[i] = [i];
}
for (let j = 0; j <= a.length; j++) {
matrix[0][j] = j;
}
for (let i = 1; i <= b.length; i++) {
for (let j = 1; j <= a.length; j++) {
if (b.charAt(i - 1) === a.charAt(j - 1)) {
matrix[i][j] = matrix[i - 1][j - 1];
} else {
matrix[i][j] = Math.min(
matrix[i - 1][j - 1] + 1,
matrix[i][j - 1] + 1,
matrix[i - 1][j] + 1
);
}
}
}
return matrix[b.length][a.length];
},
args: ["$text", term],
lang: "js"
}
}
}
},
{
$match: {
distance: { $lte: maxDistance }
}
},
{
$sort: {
distance: 1,
frequency: -1
}
},
{
$limit: limit
},
{
$project: {
text: 1,
distance: 1,
_id: 0
}
}
];
return await this.suggestionsCollection.aggregate(pipeline).toArray();
}
async contextualSuggestions(partialQuery, userContext = {}) {
// Provide contextual suggestions based on user behavior
const contextFilters = {};
if (userContext.previousSearches) {
// Weight suggestions based on user's search history
const historicalTerms = userContext.previousSearches.flatMap(search =>
search.split(' ')
);
contextFilters.historical_boost = {
$in: historicalTerms
};
}
if (userContext.category) {
// Boost suggestions from user's preferred category
contextFilters.category_match = userContext.category;
}
const pipeline = [
{
$match: {
text: { $regex: `^${partialQuery.toLowerCase()}`, $options: 'i' }
}
},
{
$addFields: {
context_score: {
$add: [
"$frequency",
// Boost for historical relevance
{
$cond: {
if: { $in: ["$text", userContext.previousSearches || []] },
then: 10,
else: 0
}
}
]
}
}
},
{
$sort: { context_score: -1, length: 1 }
},
{
$limit: 8
},
{
$project: { text: 1, _id: 0 }
}
];
const suggestions = await this.suggestionsCollection.aggregate(pipeline).toArray();
return suggestions.map(s => s.text);
}
}
Geospatial Text Search
Location-Based Search
Combine text search with geographic queries:
// Geospatial text search implementation
class GeoTextSearchService {
constructor(db) {
this.db = db;
this.businessesCollection = db.collection('businesses');
}
async setupGeoTextIndexes() {
// Create compound geospatial and text index
await this.businessesCollection.createIndex({
location: "2dsphere", // Geospatial index
name: "text", // Text search
description: "text",
tags: "text"
}, {
weights: {
name: 15,
description: 8,
tags: 10
}
});
// Alternative: separate indexes
await this.businessesCollection.createIndex({ location: "2dsphere" });
await this.businessesCollection.createIndex({
name: "text",
description: "text",
tags: "text"
});
}
async searchNearby(searchTerm, location, radius = 5000, options = {}) {
// Search for businesses near a location matching text criteria
const pipeline = [
{
$geoNear: {
near: {
type: "Point",
coordinates: [location.longitude, location.latitude]
},
distanceField: "distance_meters",
maxDistance: radius,
spherical: true,
query: {
$text: { $search: searchTerm }
}
}
},
{
$addFields: {
text_score: { $meta: "textScore" },
// Combine distance and text relevance scoring
combined_score: {
$add: [
{ $meta: "textScore" },
// Distance penalty (closer is better)
{ $multiply: [
{ $divide: [{ $subtract: [radius, "$distance_meters"] }, radius] },
5 // Distance weight factor
]}
]
}
}
},
{
$sort: { combined_score: -1 }
},
{
$limit: options.limit || 20
},
{
$project: {
name: 1,
description: 1,
address: 1,
location: 1,
distance_meters: { $round: ["$distance_meters", 0] },
text_score: { $round: ["$text_score", 2] },
combined_score: { $round: ["$combined_score", 2] }
}
}
];
return await this.businessesCollection.aggregate(pipeline).toArray();
}
async searchInArea(searchTerm, polygon, options = {}) {
// Search within a defined geographic area
const query = {
$and: [
{
location: {
$geoWithin: {
$geometry: polygon
}
}
},
{
$text: { $search: searchTerm }
}
]
};
return await this.businessesCollection.find(query, {
projection: {
name: 1,
description: 1,
address: 1,
location: 1,
score: { $meta: "textScore" }
}
})
.sort({ score: { $meta: "textScore" } })
.limit(options.limit || 20)
.toArray();
}
async clusterSearchResults(searchTerm, center, radius = 10000) {
// Group search results by geographic clusters
const pipeline = [
{
$match: {
$and: [
{
location: {
$geoWithin: {
$centerSphere: [
[center.longitude, center.latitude],
radius / 6378100 // Convert to radians (Earth radius in meters)
]
}
}
},
{
$text: { $search: searchTerm }
}
]
}
},
{
$addFields: {
text_score: { $meta: "textScore" },
// Create grid coordinates for clustering
grid_x: {
$floor: {
$multiply: [
{ $arrayElemAt: ["$location.coordinates", 0] },
1000 // Grid resolution
]
}
},
grid_y: {
$floor: {
$multiply: [
{ $arrayElemAt: ["$location.coordinates", 1] },
1000
]
}
}
}
},
{
$group: {
_id: {
grid_x: "$grid_x",
grid_y: "$grid_y"
},
businesses: {
$push: {
name: "$name",
location: "$location",
text_score: "$text_score",
address: "$address"
}
},
count: { $sum: 1 },
avg_score: { $avg: "$text_score" },
center_point: {
$avg: {
coordinates: "$location.coordinates"
}
}
}
},
{
$match: {
count: { $gte: 2 } // Only clusters with multiple businesses
}
},
{
$sort: { avg_score: -1 }
}
];
return await this.businessesCollection.aggregate(pipeline).toArray();
}
async spatialAutoComplete(prefix, location, radius = 10000, limit = 10) {
// Autocomplete suggestions based on nearby businesses
const pipeline = [
{
$match: {
location: {
$geoWithin: {
$centerSphere: [
[location.longitude, location.latitude],
radius / 6378100
]
}
}
}
},
{
$project: {
name: 1,
name_words: {
$split: [{ $toLower: "$name" }, " "]
}
}
},
{
$unwind: "$name_words"
},
{
$match: {
name_words: { $regex: `^${prefix.toLowerCase()}` }
}
},
{
$group: {
_id: "$name_words",
frequency: { $sum: 1 }
}
},
{
$sort: { frequency: -1 }
},
{
$limit: limit
},
{
$project: {
suggestion: "$_id",
frequency: 1,
_id: 0
}
}
];
return await this.businessesCollection.aggregate(pipeline).toArray();
}
}
SQL-style geospatial text search concepts:
-- SQL geospatial text search equivalent patterns
-- PostGIS extension for spatial queries with text search
CREATE EXTENSION IF NOT EXISTS postgis;
-- Spatial and text indexes
CREATE INDEX idx_businesses_location ON businesses USING GIST (location);
CREATE INDEX idx_businesses_text ON businesses USING GIN (
to_tsvector('english', name || ' ' || description || ' ' || array_to_string(tags, ' '))
);
-- Search nearby businesses with text matching
WITH nearby_businesses AS (
SELECT
business_id,
name,
description,
ST_Distance(location, ST_MakePoint(-122.4194, 37.7749)) AS distance_meters,
ts_rank(
to_tsvector('english', name || ' ' || description),
plainto_tsquery('english', 'coffee shop')
) AS text_relevance
FROM businesses
WHERE ST_DWithin(
location,
ST_MakePoint(-122.4194, 37.7749)::geography,
5000 -- 5km radius
)
AND to_tsvector('english', name || ' ' || description)
@@ plainto_tsquery('english', 'coffee shop')
)
SELECT
name,
description,
distance_meters,
text_relevance,
-- Combined scoring: text relevance + distance factor
(text_relevance + (1 - distance_meters / 5000.0)) AS combined_score
FROM nearby_businesses
ORDER BY combined_score DESC
LIMIT 20;
-- Spatial clustering with text search
SELECT
ST_ClusterKMeans(location, 5) OVER () AS cluster_id,
COUNT(*) AS businesses_in_cluster,
AVG(ts_rank(
to_tsvector('english', name || ' ' || description),
plainto_tsquery('english', 'restaurant')
)) AS avg_relevance,
ST_Centroid(ST_Collect(location)) AS cluster_center
FROM businesses
WHERE to_tsvector('english', name || ' ' || description)
@@ plainto_tsquery('english', 'restaurant')
AND ST_DWithin(
location,
ST_MakePoint(-122.4194, 37.7749)::geography,
10000
)
GROUP BY cluster_id
HAVING COUNT(*) >= 3
ORDER BY avg_relevance DESC;
Performance Optimization
Text Index Optimization
Optimize text search performance for large datasets:
// Text search performance optimization
class TextSearchOptimizer {
constructor(db) {
this.db = db;
}
async analyzeTextIndexPerformance(collection) {
// Get index statistics
const indexStats = await this.db.collection(collection).aggregate([
{ $indexStats: {} }
]).toArray();
const textIndexes = indexStats.filter(stat =>
stat.name.includes('text') || stat.key.hasOwnProperty('_fts')
);
const analysis = {
collection: collection,
text_indexes: textIndexes.length,
index_details: []
};
for (const index of textIndexes) {
const indexDetail = {
name: index.name,
size_bytes: index.size || 0,
accesses: index.accesses || {},
key_pattern: index.key,
// Calculate index efficiency
efficiency: this.calculateIndexEfficiency(index.accesses)
};
analysis.index_details.push(indexDetail);
}
return analysis;
}
calculateIndexEfficiency(accesses) {
if (!accesses || !accesses.ops || !accesses.since) {
return 0;
}
const ageHours = (Date.now() - accesses.since.getTime()) / (1000 * 60 * 60);
const operationsPerHour = accesses.ops / Math.max(ageHours, 1);
return {
ops_per_hour: Math.round(operationsPerHour),
total_operations: accesses.ops,
age_hours: Math.round(ageHours)
};
}
async optimizeTextIndexWeights(collection, sampleQueries = []) {
// Analyze query performance with different weight configurations
const fieldWeightTests = [
{ title: 20, content: 10, tags: 15 }, // Title-heavy
{ title: 10, content: 20, tags: 8 }, // Content-heavy
{ title: 15, content: 15, tags: 20 }, // Tag-heavy
{ title: 12, content: 12, tags: 12 } // Balanced
];
const testResults = [];
for (const weights of fieldWeightTests) {
// Create test index
const indexName = `text_test_${Date.now()}`;
try {
await this.db.collection(collection).createIndex({
title: "text",
content: "text",
tags: "text"
}, {
weights: weights,
name: indexName
});
// Test queries with this index configuration
const queryResults = [];
for (const query of sampleQueries) {
const startTime = Date.now();
const results = await this.db.collection(collection).find({
$text: { $search: query }
}, {
projection: { score: { $meta: "textScore" } }
})
.sort({ score: { $meta: "textScore" } })
.limit(10)
.toArray();
const executionTime = Date.now() - startTime;
queryResults.push({
query: query,
results_count: results.length,
execution_time: executionTime,
avg_score: results.reduce((sum, r) => sum + r.score, 0) / results.length || 0
});
}
testResults.push({
weights: weights,
query_performance: queryResults,
avg_execution_time: queryResults.reduce((sum, q) => sum + q.execution_time, 0) / queryResults.length,
avg_relevance: queryResults.reduce((sum, q) => sum + q.avg_score, 0) / queryResults.length
});
// Drop test index
await this.db.collection(collection).dropIndex(indexName);
} catch (error) {
console.error(`Failed to test weights ${JSON.stringify(weights)}:`, error);
}
}
// Find optimal weights
const bestConfig = testResults.reduce((best, current) => {
const bestScore = (best.avg_relevance || 0) - (best.avg_execution_time || 1000) / 1000;
const currentScore = (current.avg_relevance || 0) - (current.avg_execution_time || 1000) / 1000;
return currentScore > bestScore ? current : best;
});
return {
recommended_weights: bestConfig.weights,
test_results: testResults,
optimization_summary: {
performance_gain: bestConfig.avg_execution_time < 100 ? 'excellent' : 'good',
relevance_quality: bestConfig.avg_relevance > 1.0 ? 'high' : 'moderate'
}
};
}
async createOptimalTextIndex(collection, fields, sampleData = []) {
// Analyze field content to determine optimal index configuration
const fieldAnalysis = await this.analyzeFields(collection, fields);
// Calculate optimal weights based on content analysis
const weights = this.calculateOptimalWeights(fieldAnalysis);
// Determine language settings
const languageDistribution = await this.analyzeLanguageDistribution(collection);
const indexConfig = {
weights: weights,
default_language: languageDistribution.primary_language,
language_override: 'language',
name: `optimized_text_${Date.now()}`
};
// Create the optimized index
const indexSpec = {};
fields.forEach(field => {
indexSpec[field] = "text";
});
await this.db.collection(collection).createIndex(indexSpec, indexConfig);
return {
index_name: indexConfig.name,
configuration: indexConfig,
field_analysis: fieldAnalysis,
language_distribution: languageDistribution
};
}
async analyzeFields(collection, fields) {
const pipeline = [
{ $sample: { size: 1000 } }, // Sample for analysis
{
$project: fields.reduce((proj, field) => {
proj[field] = 1;
proj[`${field}_word_count`] = {
$size: {
$split: [
{ $ifNull: [`$${field}`, ""] },
" "
]
}
};
proj[`${field}_char_count`] = {
$strLenCP: { $ifNull: [`$${field}`, ""] }
};
return proj;
}, {})
}
];
const sampleDocs = await this.db.collection(collection).aggregate(pipeline).toArray();
const analysis = {};
for (const field of fields) {
const wordCounts = sampleDocs.map(doc => doc[`${field}_word_count`] || 0);
const charCounts = sampleDocs.map(doc => doc[`${field}_char_count`] || 0);
analysis[field] = {
avg_words: wordCounts.reduce((sum, count) => sum + count, 0) / wordCounts.length,
avg_chars: charCounts.reduce((sum, count) => sum + count, 0) / charCounts.length,
max_words: Math.max(...wordCounts),
non_empty_ratio: wordCounts.filter(count => count > 0).length / wordCounts.length
};
}
return analysis;
}
calculateOptimalWeights(fieldAnalysis) {
const weights = {};
let totalScore = 0;
// Calculate field importance scores
for (const [field, stats] of Object.entries(fieldAnalysis)) {
// Higher weight for fields with moderate word counts and high fill rates
const wordScore = Math.min(stats.avg_words / 10, 3); // Cap at reasonable level
const fillScore = stats.non_empty_ratio * 5;
const fieldScore = wordScore + fillScore;
weights[field] = Math.max(Math.round(fieldScore), 1);
totalScore += weights[field];
}
// Normalize weights to reasonable range (1-20)
const maxWeight = Math.max(...Object.values(weights));
if (maxWeight > 20) {
for (const field in weights) {
weights[field] = Math.round((weights[field] / maxWeight) * 20);
}
}
return weights;
}
async analyzeLanguageDistribution(collection) {
// Simple language detection based on common words
const pipeline = [
{ $sample: { size: 500 } },
{
$project: {
text_content: {
$concat: [
{ $ifNull: ["$title", ""] },
" ",
{ $ifNull: ["$content", ""] },
" ",
{ $ifNull: [{ $reduce: { input: "$tags", initialValue: "", in: { $concat: ["$$value", " ", "$$this"] } } }, ""] }
]
}
}
}
];
const samples = await this.db.collection(collection).aggregate(pipeline).toArray();
const languageScores = { english: 0, spanish: 0, french: 0, german: 0 };
// Language-specific common words
const languageMarkers = {
english: ['the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by'],
spanish: ['el', 'la', 'y', 'o', 'pero', 'en', 'con', 'por', 'para', 'de', 'que', 'es'],
french: ['le', 'la', 'et', 'ou', 'mais', 'dans', 'sur', 'avec', 'par', 'pour', 'de', 'que'],
german: ['der', 'die', 'das', 'und', 'oder', 'aber', 'in', 'auf', 'mit', 'für', 'von', 'zu']
};
for (const sample of samples) {
const words = sample.text_content.toLowerCase().split(/\s+/);
for (const [language, markers] of Object.entries(languageMarkers)) {
const matches = words.filter(word => markers.includes(word)).length;
languageScores[language] += matches / words.length;
}
}
const totalSamples = samples.length;
for (const language in languageScores) {
languageScores[language] = languageScores[language] / totalSamples;
}
const primaryLanguage = Object.entries(languageScores)
.sort(([,a], [,b]) => b - a)[0][0];
return {
primary_language: primaryLanguage,
distribution: languageScores,
confidence: languageScores[primaryLanguage]
};
}
}
QueryLeaf Text Search Integration
QueryLeaf provides familiar SQL-style text search syntax with MongoDB's powerful full-text capabilities:
-- QueryLeaf text search with SQL-familiar syntax
-- Basic full-text search using SQL MATCH syntax
SELECT
product_id,
product_name,
description,
price,
MATCH(product_name, description) AGAINST('gaming laptop') AS relevance_score
FROM products
WHERE MATCH(product_name, description) AGAINST('gaming laptop')
ORDER BY relevance_score DESC
LIMIT 20;
-- Boolean text search with operators
SELECT
product_name,
category,
price,
MATCH_SCORE(product_name, description, tags) AS score
FROM products
WHERE FULL_TEXT_SEARCH(product_name, description, tags, '+gaming +laptop -refurbished')
ORDER BY score DESC;
-- Phrase search for exact matches
SELECT
article_id,
title,
author,
created_date,
TEXT_SCORE(title, content) AS relevance
FROM articles
WHERE PHRASE_SEARCH(title, content, '"machine learning algorithms"')
ORDER BY relevance DESC;
-- Multi-language text search
SELECT
document_id,
title,
content,
language,
MATCH_MULTILANG(title, content, 'artificial intelligence', language) AS score
FROM documents
WHERE MATCH_MULTILANG(title, content, 'artificial intelligence', language) > 0.5
ORDER BY score DESC;
-- Text search with geographic filtering
SELECT
b.business_name,
b.address,
ST_Distance(b.location, ST_MakePoint(-122.4194, 37.7749)) AS distance_meters,
MATCH(b.business_name, b.description) AGAINST('coffee shop') AS text_score
FROM businesses b
WHERE ST_DWithin(
b.location,
ST_MakePoint(-122.4194, 37.7749),
5000 -- 5km radius
)
AND MATCH(b.business_name, b.description) AGAINST('coffee shop')
ORDER BY (text_score * 0.7 + (1 - distance_meters/5000) * 0.3) DESC;
-- QueryLeaf automatically handles:
-- 1. MongoDB text index creation and optimization
-- 2. Language detection and stemming
-- 3. Relevance scoring and ranking
-- 4. Multi-field search coordination
-- 5. Performance optimization through proper indexing
-- 6. Integration with other query types (geospatial, range, etc.)
-- Advanced text analytics with SQL aggregations
WITH search_analytics AS (
SELECT
search_term,
COUNT(*) as search_frequency,
AVG(MATCH(product_name, description) AGAINST(search_term)) as avg_relevance,
COUNT(CASE WHEN clicked = true THEN 1 END) as click_count
FROM search_logs
WHERE search_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY search_term
)
SELECT
search_term,
search_frequency,
ROUND(avg_relevance, 3) as avg_relevance,
ROUND(100.0 * click_count / search_frequency, 1) as click_through_rate,
CASE
WHEN avg_relevance < 0.5 THEN 'LOW_QUALITY'
WHEN click_through_rate < 5.0 THEN 'LOW_ENGAGEMENT'
ELSE 'PERFORMING_WELL'
END as search_quality
FROM search_analytics
WHERE search_frequency >= 10
ORDER BY search_frequency DESC;
-- Auto-complete and suggestions using SQL
SELECT DISTINCT
SUBSTRING(product_name, 1, POSITION(' ' IN product_name || ' ') - 1) as suggestion,
COUNT(*) as frequency
FROM products
WHERE product_name ILIKE 'gam%'
AND LENGTH(product_name) >= 4
GROUP BY suggestion
HAVING COUNT(*) >= 2
ORDER BY frequency DESC, suggestion ASC
LIMIT 10;
-- Search result clustering and categorization
SELECT
category,
COUNT(*) as result_count,
AVG(MATCH(product_name, description) AGAINST('smartphone')) as avg_relevance,
MIN(price) as min_price,
MAX(price) as max_price,
ARRAY_AGG(DISTINCT brand ORDER BY brand) as available_brands
FROM products
WHERE MATCH(product_name, description) AGAINST('smartphone')
AND MATCH(product_name, description) AGAINST('smartphone') > 0.3
GROUP BY category
HAVING COUNT(*) >= 5
ORDER BY avg_relevance DESC;
Best Practices for Text Search
Search Implementation Guidelines
Essential practices for implementing MongoDB text search:
- Index Strategy: Create focused text indexes on relevant fields with appropriate weights
- Language Support: Configure proper language settings for stemming and tokenization
- Performance Monitoring: Track search query performance and optimize accordingly
- Relevance Tuning: Adjust field weights based on user behavior and search analytics
- Fallback Mechanisms: Implement fuzzy search for handling typos and variations
- Caching: Cache frequent search results and suggestions for improved performance
Search Quality Optimization
Improve search result quality and user experience:
- Analytics-Driven Optimization: Use search analytics to identify and fix poor-performing queries
- User Feedback Integration: Incorporate click-through rates and user interactions for relevance tuning
- Synonym Management: Implement synonym expansion for better search recall
- Personalization: Provide contextual suggestions based on user history and preferences
- Multi-Modal Search: Combine text search with filters, geospatial queries, and faceted search
- Real-Time Adaptation: Continuously update indexes and suggestions based on new content
Conclusion
MongoDB's full-text search capabilities provide enterprise-grade search functionality that rivals dedicated search engines while maintaining database integration simplicity. Combined with SQL-style query patterns, MongoDB text search enables familiar search implementation approaches while delivering the scalability and performance required for modern applications.
Key text search benefits include:
- Advanced Linguistics: Stemming, tokenization, and language-specific processing for accurate results
- Relevance Scoring: Built-in scoring algorithms with customizable field weights for optimal ranking
- Performance Optimization: Specialized text indexes and query optimization for fast search response
- Multi-Language Support: Native support for multiple languages with proper linguistic handling
- Integration Flexibility: Seamless integration with other MongoDB query types and aggregation pipelines
Whether you're building product catalogs, content management systems, or document search applications, MongoDB text search with QueryLeaf's familiar SQL interface provides the foundation for sophisticated search experiences. This combination enables you to implement powerful search functionality while preserving the development patterns and query approaches your team already knows.
The integration of advanced text search capabilities with SQL-style query management makes MongoDB an ideal platform for applications requiring both powerful search functionality and familiar database interaction patterns, ensuring your search features remain both comprehensive and maintainable as they scale and evolve.