Scrape
Scrape a URL, and turn it to LLM-ready structured data.
Introduction
AnyCrawl scrape API converts any webpage into structured data optimized for Large Language Models (LLM). It supports multiple scraping engines including Cheerio, Playwright, Puppeteer, and outputs in various formats such as HTML, Markdown, JSON, etc.
Key Features: The API returns data immediately and synchronously - no polling or webhooks required. It also natively supports high concurrency for large-scale scraping operations.
Core Features
- Multi-Engine Support: Supports
cheerio
(static HTML parsing, fastest),playwright
(cross-browser JavaScript rendering),puppeteer
(Chrome-optimized JavaScript rendering) - LLM Optimized: Automatically extracts and formats content, generates Markdown format for easy LLM processing
- Proxy Support: Supports HTTP/HTTPS proxy configuration
- Robust Error Handling: Comprehensive error handling and retry mechanisms
- High Performance: Native high concurrency support with asynchronous queue processing
- Immediate Response: Synchronous API - get results instantly without polling
API Endpoint
POST https://api.anycrawl.dev/v1/scrape
Request Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
url | string | Yes | - | URL to scrape, must be a valid HTTP/HTTPS address |
engine | enum | No | cheerio | Scraping engine type, options: cheerio , playwright , puppeteer |
formats | array | No | ["markdown"] | Output formats, options: markdown , html , text , screenshot , screenshot@fullPage , rawHtml , json |
timeout | number | No | 30000 | Timeout in milliseconds, default: 30000 |
wait_for | number | No | - | Wait for the page to load |
include_tags | array | No | - | Include tags, like: h1 |
exclude_tags | array | No | - | Exclude tags, like: h1 |
proxy | string (URI) | No | - | Proxy server address, format: http://proxy:port or https://proxy:port |
json_options | json | No | - | JSON options, like: {"schema": {}, "prompt": true} |
Engine Types
Important Note: Both playwright
and puppeteer
can use Chromium (Chrome's open-source version), but they serve different purposes and have different capabilities.
cheerio
(Default)
- Use Case: Static HTML content scraping
- Advantages: Fastest scraping speed, lowest resource consumption
- Limitations: Cannot execute JavaScript, cannot handle dynamic content
- Recommended For: News articles, blogs, static websites
playwright
- Use Case: Modern websites requiring JavaScript rendering, cross-browser testing
- Advantages: Robust auto-wait features, better stability
- Limitations: Higher resource consumption
- Recommended For: Complex web applications
puppeteer
- Advantages: Deep Chrome DevTools integration, excellent performance metrics, faster execution
- Limitations: Does not support ARM CPU architecture
Output Formats
You can specify which data formats to include in the response using the formats
parameter:
markdown
- Description: Converts HTML content to clean, structured Markdown format
- Use Case: Optimal for LLM processing, documentation, and content analysis
- Recommended For: Text-heavy content, articles, blogs
html
- Description: Returns cleaned and formatted HTML content
- Use Case: When you need structured HTML with preserved formatting
- Recommended For: Web content that needs to maintain HTML structure
text
- Description: Extracts plain text content without any formatting
- Use Case: Simple text extraction for basic content analysis
- Recommended For: Text-only processing, keyword extraction
screenshot
- Description: Captures a screenshot of the visible page area
- Use Case: Visual representation of the webpage
- Limitations: Only available with
playwright
andpuppeteer
engines - Recommended For: Visual verification, UI testing
screenshot@fullPage
- Description: Captures a full-page screenshot including content below the fold
- Use Case: Complete visual capture of the entire webpage
- Limitations: Only available with
playwright
andpuppeteer
engines - Recommended For: Complete page documentation, archival purposes
rawHtml
- Description: Returns the original, unprocessed HTML source
- Use Case: When you need the exact HTML as received from the server
- Recommended For: Technical analysis, debugging, preserving original structure
JSON options object
The json_options
parameter is an object that accepts the following parameters:
schema
: The schema to use for the extraction.prompt
: The prompt to use for the extraction without a schema.
Example
{
"schema": {},
"prompt": "Extract the title and content of the page"
}
or
{
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string"
},
"company_name": {
"type": "string"
},
"summary": {
"type": "string"
},
"is_open_source": {
"type": "boolean"
}
},
"required": ["company_name", "summary"]
},
"prompt": "Extract the company name, summary, and if it is open source"
}
Response Format
Success Response (HTTP 200)
Successful Scraping
{
"success": true,
"data": {
"url": "https://mock.httpstatus.io/200",
"status": "completed",
"jobId": "c9fb76c4-2d7b-41f9-9141-b9ec9af58b39",
"title": "",
"metadata": [
{
"name": "color-scheme",
"content": "light dark"
}
],
"html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">200 OK</pre></body></html>",
"screenshot": "http://localhost:8080/v1/public/storage/file/screenshot-c9fb76c4-2d7b-41f9-9141-b9ec9af58b39.jpeg",
"timestamp": "2025-07-01T04:38:02.951Z"
}
}
Error Responses
400 - Validation Error
{
"success": false,
"error": "Validation error",
"details": {
"issues": [
{
"field": "engine",
"message": "Invalid enum value. Expected 'playwright' | 'cheerio' | 'puppeteer', received 'invalid'",
"code": "invalid_enum_value"
}
],
"messages": [
"Invalid enum value. Expected 'playwright' | 'cheerio' | 'puppeteer', received 'invalid'"
]
}
}
401 - Authentication Error
{
"success": false,
"error": "Invalid API key"
}
Failed Scraping
{
"success": false,
"error": "Scrape task failed",
"message": "Page is not available: 404 ",
"data": {
"url": "https://mock.httpstatus.io/404",
"status": "failed",
"type": "http_error",
"message": "Page is not available: 404 ",
"code": 404,
"metadata": [
{
"name": "color-scheme",
"content": "light dark"
}
],
"jobId": "34cd1d26-eb83-40ce-9d63-3be1a901f4a3",
"title": "",
"html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">404 Not Found</pre></body></html>",
"screenshot": "screenshot-34cd1d26-eb83-40ce-9d63-3be1a901f4a3.jpeg",
"timestamp": "2025-07-01T04:36:20.978Z",
"statusCode": 404,
"statusMessage": ""
}
}
or
{
"success": false,
"error": "Scrape task failed",
"message": "Page is not available: 502 ",
"data": {
"url": "https://mock.httpstatus.io/502",
"status": "failed",
"type": "http_error",
"message": "Page is not available: 502 ",
"code": 502,
"metadata": [
{
"name": "color-scheme",
"content": "light dark"
}
],
"jobId": "5fc50008-07e0-4913-a6af-53b0b3e0214b",
"title": "",
"html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">502 Bad Gateway</pre></body></html>",
"screenshot": "screenshot-5fc50008-07e0-4913-a6af-53b0b3e0214b.jpeg",
"timestamp": "2025-07-01T04:39:59.981Z",
"statusCode": 502,
"statusMessage": ""
}
}
or
{
"success": false,
"error": "Scrape task failed",
"message": "Page is not available: 400 ",
"data": {
"url": "https://mock.httpstatus.io/400",
"status": "failed",
"type": "http_error",
"message": "Page is not available: 400 ",
"code": 400,
"metadata": [
{
"name": "color-scheme",
"content": "light dark"
}
],
"jobId": "0081747c-1fc5-44f9-800c-e27b24b55a2c",
"title": "",
"html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">400 Bad Request</pre></body></html>",
"screenshot": "screenshot-0081747c-1fc5-44f9-800c-e27b24b55a2c.jpeg",
"timestamp": "2025-07-01T04:38:24.136Z",
"statusCode": 400,
"statusMessage": ""
}
}
or
{
"success": false,
"error": "Scrape task failed",
"message": "Page is not available",
"data": {
"url": "https://httpstat.us/401",
"status": "failed",
"type": "http_error",
"message": "Page is not available"
}
}
Usage Examples
cURL
Basic Scraping (using default cheerio engine)
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
Scraping Dynamic Content with Playwright Engine
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://spa-example.com",
"engine": "playwright"
}'
Scraping with Proxy
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"engine": "playwright",
"proxy": "http://proxy.example.com:8080"
}'
JavaScript/TypeScript
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com",
engine: "playwright",
}),
});
const result = await response.json();
console.log(result.data.markdown); // Get Markdown formatted content
Python
import requests
response = requests.post(
'https://api.anycrawl.dev/v1/scrape',
headers={
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={
'url': 'https://example.com',
'engine': 'cheerio'
}
)
result = response.json()
print(result['data']['markdown']) # Output Markdown content
Best Practices
Engine Selection Guidelines
- Static Content Websites (news, blogs, documentation) → Use
cheerio
- Complex Web Applications (SPAs requiring JavaScript rendering) → Use
playwright
orpuppeteer
Performance Optimization
- For bulk static content scraping, prioritize
cheerio
engine - Only use
playwright
orpuppeteer
when JavaScript rendering is required - For best results, use rotating proxies to avoid IP blocks and rate limits, and ensure proxy servers are stable and reliable
- Leverage native high concurrency - the API handles multiple concurrent requests efficiently
Error Handling
try {
const response = await fetch("/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url: "https://example.com" }),
});
const result = await response.json();
if (result.success && result.data.status === "completed") {
// Handle successful result
console.log(result.data.markdown);
} else {
// Handle scraping failure
console.error("Scraping failed:", result.data.error);
}
} catch (error) {
// Handle network error
console.error("Request failed:", error);
}
High Concurrency Usage
The API natively supports high concurrency. You can make multiple simultaneous requests without rate limiting concerns:
// Concurrent scraping example
const urls = ["https://example1.com", "https://example2.com", "https://example3.com"];
const scrapePromises = urls.map((url) =>
fetch("/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, engine: "cheerio" }),
}).then((res) => res.json())
);
// All requests execute concurrently and return immediately
const results = await Promise.all(scrapePromises);
Frequently Asked Questions
Q: When should I use different engines?
A: Each engine has specific advantages:
- Cheerio: For static HTML content, fastest performance, no JavaScript execution
- Playwright: For complex web apps, better stability and auto-wait features, and it will support more browser types in the future
- Puppeteer: Chrome/Chromium only, does not work on ARM CPUs and we do not provide related docker images
Q: Why do some websites fail to scrape?
A: Possible reasons include:
- Website blocks crawlers (returns 403/404)
- JavaScript rendering required but using cheerio engine
- Website requires authentication or special headers
- Network connectivity issues
Q: How to handle websites requiring login?
A: Currently the API doesn't support authentication. Recommendations:
- Scrape publicly accessible pages
- Use alternative methods to obtain authenticated content
Q: What are the proxy configuration requirements?
A:
- Supports HTTP/HTTPS proxies
- Format:
http://host:port
orhttps://host:port
- Ensure proxy servers are stable and available
Q: Is there rate limiting for concurrent requests?
A: No, the API natively supports high concurrency. You can make multiple simultaneous requests without worrying about rate limits, and all requests return data immediately.