Firecrawl Self-Host Guide: Complete Setup and Configuration
Table of Contents
- What is Firecrawl?
- Features of Firecrawl
- Why Self Host Firecrawl?
- System Requirements
- Installation Guide
- Configuration
- API Usage
- Backup and Maintenance
- Troubleshooting
- FAQ
What is Firecrawl?
Firecrawl is an open-source web scraping and data extraction API that turns any website into clean, structured data. It's designed for developers who need reliable web scraping with intelligent content parsing, markdown conversion, and structured data extraction.
Features of Firecrawl
Intelligent Web Scraping
- Smart Content Extraction: Automatically identifies and extracts main content
- JavaScript Rendering: Handles SPAs and dynamic content
- Anti-Bot Detection: Bypasses common scraping countermeasures
- Rate Limiting: Built-in respect for robots.txt and rate limits
Multiple Output Formats
- Markdown: Clean, readable text format
- Structured Data: JSON with extracted metadata
- Raw HTML: Original page source when needed
- Screenshots: Visual captures of pages
API-First Design
- RESTful API: Easy integration with any programming language
- Webhook Support: Real-time notifications for completed jobs
- Bulk Processing: Handle multiple URLs efficiently
- Queue Management: Background processing for large jobs
Advanced Features
- Link Discovery: Extract all links from pages
- Sitemap Crawling: Crawl entire websites systematically
- Content Filtering: Extract specific elements using CSS selectors
- Data Validation: Ensure extracted data meets your requirements
Why Self Host Firecrawl?
Self-hosting provides complete control over your web scraping infrastructure.
Benefits of Self-Hosting Firecrawl
- Cost Control: No per-request pricing for high-volume usage
- Data Privacy: Keep scraped data within your infrastructure
- Customization: Modify scraping logic for specific needs
- Performance: Optimize for your specific use cases
- Compliance: Ensure adherence to your data governance policies
- Reliability: No external service dependencies
System Requirements
Minimum Requirements
- CPU: 4 cores
- RAM: 8GB
- Storage: 50GB SSD
- Network: High-speed internet connection
- OS: Linux (Ubuntu 20.04+ recommended)
- Docker and Docker Compose
Recommended Requirements
- CPU: 8+ cores
- RAM: 16GB+
- Storage: 200GB+ SSD
- Network: High-bandwidth, low-latency connection
- Load balancer for multiple instances
- Redis cluster for high availability
Installation Guide
Using Docker (Recommended)
- Clone the Firecrawl repository:
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration
- Start the services:
docker-compose up -d
Manual Installation
- Install dependencies:
# Install Node.js 18+
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
# Install Python 3.9+
sudo apt-get install -y python3 python3-pip
# Install Redis
sudo apt-get install -y redis-server
# Install PostgreSQL
sudo apt-get install -y postgresql postgresql-contrib
- Install Playwright (for browser automation):
npx playwright install
npx playwright install-deps
- Clone and build:
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
npm install
npm run build
- Set up database:
sudo -u postgres createdb firecrawl
sudo -u postgres createuser firecrawl
- Start the services:
npm run start
Configuration
Environment Variables
# Server Configuration
PORT=3002
HOST=0.0.0.0
NODE_ENV=production
# Database Configuration
DATABASE_URL=postgresql://firecrawl:password@localhost:5432/firecrawl
# Redis Configuration
REDIS_URL=redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://localhost:6379
# API Configuration
NUM_WORKERS_PER_QUEUE=8
SCRAPER_NUM_WORKERS=10
SCRAPER_NUM_WORKERS_PER_QUEUE=2
# Authentication (optional)
API_KEY=your-secret-api-key
SUPABASE_ANON_TOKEN=your-supabase-token
SUPABASE_URL=https://your-project.supabase.co
# Scraping Configuration
SCRAPING_BEE_API_KEY=optional-scraping-service-key
OPENAI_API_KEY=optional-for-ai-extraction
# Rate Limiting
RATE_LIMIT_TEST_MODE=false
RATE_LIMIT_REQUESTS_PER_MINUTE=100
# Logging
LOG_LEVEL=info
SENTRY_DSN=optional-error-tracking
Scraping Configuration
// firecrawl.config.js
module.exports = {
// Default scraping options
defaultOptions: {
formats: ['markdown', 'html'],
onlyMainContent: true,
includeHtml: false,
includeRawHtml: false,
waitFor: 0,
timeout: 30000,
},
// Crawler options
crawlerOptions: {
maxDepth: 3,
limit: 100,
allowBackwardCrawling: false,
allowExternalContentLinks: false,
},
// Browser configuration
browser: {
headless: true,
args: ['--no-sandbox', '--disable-dev-shm-usage'],
timeout: 30000,
}
};
Rate Limiting Setup
# Rate limiting configuration
RATE_LIMIT_ENABLED=true
RATE_LIMIT_REQUESTS_PER_MINUTE=100
RATE_LIMIT_BURST_LIMIT=20
RATE_LIMIT_WINDOW_MS=60000
# IP-based rate limiting
RATE_LIMIT_SKIP_SUCCESSFUL_REQUESTS=false
RATE_LIMIT_SKIP_FAILED_REQUESTS=false
API Usage
Basic Scraping
curl -X POST http://localhost:3002/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://example.com",
"formats": ["markdown", "html"]
}'
Website Crawling
curl -X POST http://localhost:3002/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://example.com",
"crawlerOptions": {
"limit": 100,
"maxDepth": 3
}
}'
Search Functionality
curl -X POST http://localhost:3002/v0/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"query": "firecrawl documentation",
"limit": 10
}'
Batch Processing
const firecrawl = require('@mendable/firecrawl-js');
const app = new firecrawl.FirecrawlApp({
apiKey: 'YOUR_API_KEY',
apiUrl: 'http://localhost:3002'
});
async function batchScrape() {
const urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com'
];
const results = await Promise.all(
urls.map(url => app.scrapeUrl(url, {
formats: ['markdown'],
onlyMainContent: true
}))
);
return results;
}
Backup and Maintenance
Database Backup
#!/bin/bash
# Backup script for Firecrawl
# PostgreSQL backup
pg_dump firecrawl > /backup/firecrawl-$(date +%Y%m%d).sql
# Redis backup
redis-cli --rdb /backup/redis-$(date +%Y%m%d).rdb
# Cleanup old backups (keep 30 days)
find /backup -name "*.sql" -mtime +30 -delete
find /backup -name "*.rdb" -mtime +30 -delete
Log Rotation
# /etc/logrotate.d/firecrawl
/var/log/firecrawl/*.log {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 644 firecrawl firecrawl
postrotate
systemctl reload firecrawl
endscript
}
Performance Monitoring
# Monitor queue status
curl http://localhost:3002/v0/queue/stats
# Check system resources
docker stats firecrawl-app
Troubleshooting
Common Issues
Memory Errors
# Increase memory limits
echo "NODE_OPTIONS=--max-old-space-size=4096" >> .env
# Monitor memory usage
docker stats --no-stream firecrawl-app
Browser Issues
# Install missing dependencies
npx playwright install-deps
# Test browser functionality
npx playwright test
Rate Limiting Problems
# Check Redis connection
redis-cli ping
# Monitor rate limit status
curl http://localhost:3002/v0/health
Database Connection Errors
# Check PostgreSQL status
sudo systemctl status postgresql
# Test database connection
psql -h localhost -U firecrawl -d firecrawl -c "SELECT 1;"
Performance Optimization
- Increase worker count:
NUM_WORKERS_PER_QUEUE=16
SCRAPER_NUM_WORKERS=20
- Optimize browser settings:
BROWSER_POOL_SIZE=10
BROWSER_IDLE_TIMEOUT=30000
- Redis optimization:
# redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru
Log Analysis
# Application logs
docker-compose logs -f firecrawl-app
# Worker logs
docker-compose logs -f firecrawl-worker
# Database logs
tail -f /var/log/postgresql/postgresql-14-main.log
FAQ
How does Firecrawl handle JavaScript-heavy sites?
Firecrawl uses Playwright to render JavaScript, ensuring dynamic content is properly extracted. It waits for page load and can handle SPAs.
Can I customize the extraction logic?
Yes, you can modify the source code to add custom extraction rules, CSS selectors, or AI-powered content analysis.
What's the difference between scraping and crawling?
- Scraping: Extract data from a single URL
- Crawling: Discover and scrape multiple URLs from a website
How do I handle websites with anti-bot protection?
Firecrawl includes basic anti-detection features, but for advanced protection, you may need to:
- Add proxy rotation
- Implement custom headers
- Use residential IP addresses
Can I process large websites?
Yes, Firecrawl supports:
- Concurrent processing with multiple workers
- Queue-based architecture for large jobs
- Resumable crawling for interrupted jobs
How do I ensure data quality?
Firecrawl provides:
- Content validation options
- Structured data extraction
- Error handling and retries
- Quality scoring for extracted content
Is there a rate limit?
Rate limits are configurable and depend on your server resources. The default is 100 requests per minute per IP.
How do I scale Firecrawl horizontally?
You can run multiple Firecrawl instances behind a load balancer, sharing the same Redis and PostgreSQL instances.
Can I integrate with other tools?
Yes, Firecrawl provides:
- Webhook notifications
- REST API for integration
- SDKs for popular programming languages
- Export to various data formats