MCP Server
Use AnyCrawl's API through the Model Context Protocol (MCP)
AnyCrawl MCP Server
🚀 AnyCrawl MCP Server — Powerful web scraping and crawling for Cursor, Claude, and other LLM clients via the Model Context Protocol (MCP).
Features
- Web scraping, crawling, and content extraction
- Search and discovery capabilities
- Batch processing and concurrency
- Cloud and self-hosted support
- Real-time progress tracking
- SSE support
Installation
Running with npx
ANYCRAWL_API_KEY=YOUR-API-KEY npx -y anycrawl-mcp
Using hosted API
{
"mcpServers": {
"anycrawl": {
"url": "https://mcp.anycrawl.dev/{YOUR_API_KEY}/mcp"
}
}
}
Manual installation
npm install -g anycrawl-mcp-server
ANYCRAWL_API_KEY=YOUR-API-KEY anycrawl-mcp
Running on Cursor
Add AnyCrawl MCP server to Cursor:
- Open Cursor Settings
- Go to Features > MCP Servers
- Click "+ Add new global MCP server"
- Enter the following configuration:
{
"mcpServers": {
"anycrawl-mcp": {
"command": "npx",
"args": ["-y", "anycrawl-mcp"],
"env": {
"ANYCRAWL_API_KEY": "YOUR-API-KEY"
}
}
}
}
Running on VS Code
Add the following JSON block to your User Settings (JSON) file in VS Code:
{
"mcp": {
"inputs": [
{
"type": "promptString",
"id": "apiKey",
"description": "AnyCrawl API Key",
"password": true
}
],
"servers": {
"anycrawl": {
"command": "npx",
"args": ["-y", "anycrawl-mcp"],
"env": {
"ANYCRAWL_API_KEY": "${input:apiKey}"
}
}
}
}
}
Running on Claude Desktop
Add this to the Claude config file:
{
"mcpServers": {
"anycrawl": {
"url": "https://mcp.anycrawl.dev/{YOUR_API_KEY}/sse"
}
}
}
Configuration
Environment Variables
Required for Cloud API
ANYCRAWL_API_KEY
: Your AnyCrawl API key- Required when using cloud API (default)
- Optional when using self-hosted instance
ANYCRAWL_BASE_URL
: AnyCrawl API base URL.- Optional
- Required for self-hosted instances
Configuration Examples
Basic Cloud Configuration
export ANYCRAWL_API_KEY="your-api-key-here"
npx -y anycrawl-mcp
Self-hosted Configuration
export ANYCRAWL_API_KEY="your-api-key"
export ANYCRAWL_BASE_URL="https://your-instance.com"
npx -y anycrawl-mcp
Available Tools
1. Scrape Tool (anycrawl_scrape
)
Scrape a single URL and extract content in various formats.
Best for:
- Extracting content from a single page
- Quick data extraction
- Testing specific URLs
Parameters:
url
(required): The URL to scrapeengine
(required): Scraping engine (playwright
,cheerio
,puppeteer
)formats
(optional): Output formats (markdown
,html
,text
,screenshot
,screenshot@fullPage
,rawHtml
,json
)timeout
(optional): Timeout in milliseconds (default: 300000)wait_for
(optional): Wait time for page to loadinclude_tags
(optional): HTML tags to includeexclude_tags
(optional): HTML tags to excludejson_options
(optional): Options for JSON extraction
Example:
{
"name": "anycrawl_scrape",
"arguments": {
"url": "https://example.com",
"engine": "cheerio",
"formats": ["markdown", "html"],
"timeout": 30000
}
}
2. Crawl Tool (anycrawl_crawl
)
Start a crawl job to scrape multiple pages from a website. By default this waits for completion and returns aggregated results using the SDK's client.crawl
(defaults: poll every 3 seconds, timeout after 60 seconds).
Best for:
- Extracting content from multiple related pages
- Comprehensive website analysis
- Bulk data collection
Parameters:
url
(required): The base URL to crawlengine
(required): Scraping enginemax_depth
(optional): Maximum crawl depth (default: 10)limit
(optional): Maximum number of pages (default: 100)strategy
(optional): Crawling strategy (all
,same-domain
,same-hostname
,same-origin
)exclude_paths
(optional): URL patterns to excludeinclude_paths
(optional): URL patterns to includescrape_options
(optional): Options for individual page scrapingpoll_seconds
(optional): Poll interval seconds for waiting (default: 3)timeout_ms
(optional): Overall timeout milliseconds for waiting (default: 60000)
Example:
{
"name": "anycrawl_crawl",
"arguments": {
"url": "https://example.com/blog",
"engine": "playwright",
"max_depth": 2,
"limit": 50,
"strategy": "same-domain",
"poll_seconds": 3,
"timeout_ms": 60000
}
}
Returns: { "job_id": "...", "status": "completed", "total": N, "completed": N, "creditsUsed": N, "data": [...] }
.
3. Crawl Status Tool (anycrawl_crawl_status
)
Check the status of a crawl job.
Parameters:
job_id
(required): The crawl job ID
Example:
{
"name": "anycrawl_crawl_status",
"arguments": {
"job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396"
}
}
4. Crawl Results Tool (anycrawl_crawl_results
)
Get results from a crawl job.
Parameters:
job_id
(required): The crawl job IDskip
(optional): Number of results to skip (for pagination)
Example:
{
"name": "anycrawl_crawl_results",
"arguments": {
"job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396",
"skip": 0
}
}
5. Cancel Crawl Tool (anycrawl_cancel_crawl
)
Cancel a pending crawl job.
Parameters:
job_id
(required): The crawl job ID to cancel
Example:
{
"name": "anycrawl_cancel_crawl",
"arguments": {
"job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396"
}
}
6. Search Tool (anycrawl_search
)
Search the web using AnyCrawl search engine.
Best for:
- Finding specific information across multiple websites
- Research and discovery
- When you don't know which website has the information
Parameters:
query
(required): Search queryengine
(optional): Search engine (google
)limit
(optional): Maximum number of results (default: 5)offset
(optional): Number of results to skip (default: 0)pages
(optional): Number of pages to searchlang
(optional): Language codecountry
(optional): Country codescrape_options
(optional): Options for scraping search resultssafeSearch
(optional): Safe search level (0=off, 1=moderate, 2=strict)
Example:
{
"name": "anycrawl_search",
"arguments": {
"query": "latest AI research papers 2024",
"engine": "google",
"limit": 5,
"scrape_options": {
"engine": "cheerio",
"formats": ["markdown"]
}
}
}
Output Formats
Markdown
Clean, structured markdown content perfect for LLM consumption.
HTML
Raw HTML content with all formatting preserved.
Text
Plain text content with minimal formatting.
Screenshot
Visual screenshot of the page.
Screenshot@fullPage
Full-page screenshot including content below the fold.
Raw HTML
Unprocessed HTML content.
JSON
Structured data extraction using custom schemas.
Engines
Cheerio
- Fast and lightweight
- Good for static content
- Server-side rendering
Playwright
- Full browser automation
- JavaScript rendering
- Best for dynamic content
Puppeteer
- Chrome/Chromium automation
- Good balance of features and performance
Error Handling
The server provides comprehensive error handling:
- Validation Errors: Invalid parameters or missing required fields
- API Errors: AnyCrawl API errors with detailed messages
- Network Errors: Connection and timeout issues
Logging
The server includes detailed logging:
- Debug: Detailed operation information
- Info: General operation status
- Warn: Non-critical issues
- Error: Critical errors and failures
Set log level with environment variable:
export LOG_LEVEL=debug # debug, info, warn, error
Development
Prerequisites
- Node.js 18+
- npm
Setup
git clone https://github.com/any4ai/anycrawl-mcp-server.git
cd anycrawl-mcp-server
npm ci
Build
npm run build
Test
npm test
Lint
npm run lint
Format
npm run format
Contributing
- Fork the repository
- Create your feature branch
- Run tests:
npm test
- Submit a pull request
License
MIT License - see LICENSE file for details
Support
- GitHub Issues: Report bugs or request features
- Documentation: AnyCrawl API Docs
- Email: help@anycrawl.dev
About AnyCrawl
AnyCrawl is a powerful Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. It features native multi-threading for bulk processing and supports multiple output formats.
- Website: https://anycrawl.dev
- GitHub: https://github.com/any4ai/anycrawl
- API: https://api.anycrawl.dev