AnyCrawl

Scrape

Scrape a URL, and turn it to LLM-ready structured data.

Introduction

AnyCrawl scrape API converts any webpage into structured data optimized for Large Language Models (LLM). It supports multiple scraping engines including Cheerio, Playwright, Puppeteer, and outputs in various formats such as HTML, Markdown, JSON, etc.

Key Features: The API returns data immediately and synchronously - no polling or webhooks required. It also natively supports high concurrency for large-scale scraping operations.

Core Features

  • Multi-Engine Support: Supports cheerio (static HTML parsing, fastest), playwright (cross-browser JavaScript rendering), puppeteer (Chrome-optimized JavaScript rendering)
  • LLM Optimized: Automatically extracts and formats content, generates Markdown format for easy LLM processing
  • Proxy Support: Supports HTTP/HTTPS proxy configuration
  • Robust Error Handling: Comprehensive error handling and retry mechanisms
  • High Performance: Native high concurrency support with asynchronous queue processing
  • Immediate Response: Synchronous API - get results instantly without polling

API Endpoint

POST https://api.anycrawl.dev/v1/scrape

Request Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes-URL to scrape, must be a valid HTTP/HTTPS address
engineenumNocheerioScraping engine type, options: cheerio, playwright, puppeteer
formatsarrayNo["markdown"]Output formats, options: markdown, html, text, screenshot, screenshot@fullPage, rawHtml, json
timeoutnumberNo30000Timeout in milliseconds, default: 30000
wait_fornumberNo-Wait for the page to load
include_tagsarrayNo-Include tags, like: h1
exclude_tagsarrayNo-Exclude tags, like: h1
proxystring (URI)No-Proxy server address, format: http://proxy:port or https://proxy:port
json_optionsjsonNo-JSON options, like: {"schema": {}, "prompt": true}

Engine Types

Important Note: Both playwright and puppeteer can use Chromium (Chrome's open-source version), but they serve different purposes and have different capabilities.

cheerio (Default)

  • Use Case: Static HTML content scraping
  • Advantages: Fastest scraping speed, lowest resource consumption
  • Limitations: Cannot execute JavaScript, cannot handle dynamic content
  • Recommended For: News articles, blogs, static websites

playwright

  • Use Case: Modern websites requiring JavaScript rendering, cross-browser testing
  • Advantages: Robust auto-wait features, better stability
  • Limitations: Higher resource consumption
  • Recommended For: Complex web applications

puppeteer

  • Advantages: Deep Chrome DevTools integration, excellent performance metrics, faster execution
  • Limitations: Does not support ARM CPU architecture

Output Formats

You can specify which data formats to include in the response using the formats parameter:

markdown

  • Description: Converts HTML content to clean, structured Markdown format
  • Use Case: Optimal for LLM processing, documentation, and content analysis
  • Recommended For: Text-heavy content, articles, blogs

html

  • Description: Returns cleaned and formatted HTML content
  • Use Case: When you need structured HTML with preserved formatting
  • Recommended For: Web content that needs to maintain HTML structure

text

  • Description: Extracts plain text content without any formatting
  • Use Case: Simple text extraction for basic content analysis
  • Recommended For: Text-only processing, keyword extraction

screenshot

  • Description: Captures a screenshot of the visible page area
  • Use Case: Visual representation of the webpage
  • Limitations: Only available with playwright and puppeteer engines
  • Recommended For: Visual verification, UI testing

screenshot@fullPage

  • Description: Captures a full-page screenshot including content below the fold
  • Use Case: Complete visual capture of the entire webpage
  • Limitations: Only available with playwright and puppeteer engines
  • Recommended For: Complete page documentation, archival purposes

rawHtml

  • Description: Returns the original, unprocessed HTML source
  • Use Case: When you need the exact HTML as received from the server
  • Recommended For: Technical analysis, debugging, preserving original structure

JSON options object

The json_options parameter is an object that accepts the following parameters:

  • schema: The schema to use for the extraction.
  • prompt: The prompt to use for the extraction without a schema.

Example

{
    "schema": {},
    "prompt": "Extract the title and content of the page"
}

or

{
    "schema": {
        "type": "object",
        "properties": {
            "title": {
                "type": "string"
            },
            "company_name": {
                "type": "string"
            },
            "summary": {
                "type": "string"
            },
            "is_open_source": {
                "type": "boolean"
            }
        },
        "required": ["company_name", "summary"]
    },
    "prompt": "Extract the company name, summary, and if it is open source"
}

Response Format

Success Response (HTTP 200)

Successful Scraping

{
    "success": true,
    "data": {
        "url": "https://mock.httpstatus.io/200",
        "status": "completed",
        "jobId": "c9fb76c4-2d7b-41f9-9141-b9ec9af58b39",
        "title": "",
        "metadata": [
            {
                "name": "color-scheme",
                "content": "light dark"
            }
        ],
        "html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">200 OK</pre></body></html>",
        "screenshot": "http://localhost:8080/v1/public/storage/file/screenshot-c9fb76c4-2d7b-41f9-9141-b9ec9af58b39.jpeg",
        "timestamp": "2025-07-01T04:38:02.951Z"
    }
}

Error Responses

400 - Validation Error

{
    "success": false,
    "error": "Validation error",
    "details": {
        "issues": [
            {
                "field": "engine",
                "message": "Invalid enum value. Expected 'playwright' | 'cheerio' | 'puppeteer', received 'invalid'",
                "code": "invalid_enum_value"
            }
        ],
        "messages": [
            "Invalid enum value. Expected 'playwright' | 'cheerio' | 'puppeteer', received 'invalid'"
        ]
    }
}

401 - Authentication Error

{
    "success": false,
    "error": "Invalid API key"
}

Failed Scraping

{
    "success": false,
    "error": "Scrape task failed",
    "message": "Page is not available: 404 ",
    "data": {
        "url": "https://mock.httpstatus.io/404",
        "status": "failed",
        "type": "http_error",
        "message": "Page is not available: 404 ",
        "code": 404,
        "metadata": [
            {
                "name": "color-scheme",
                "content": "light dark"
            }
        ],
        "jobId": "34cd1d26-eb83-40ce-9d63-3be1a901f4a3",
        "title": "",
        "html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">404 Not Found</pre></body></html>",
        "screenshot": "screenshot-34cd1d26-eb83-40ce-9d63-3be1a901f4a3.jpeg",
        "timestamp": "2025-07-01T04:36:20.978Z",
        "statusCode": 404,
        "statusMessage": ""
    }
}

or

{
    "success": false,
    "error": "Scrape task failed",
    "message": "Page is not available: 502 ",
    "data": {
        "url": "https://mock.httpstatus.io/502",
        "status": "failed",
        "type": "http_error",
        "message": "Page is not available: 502 ",
        "code": 502,
        "metadata": [
            {
                "name": "color-scheme",
                "content": "light dark"
            }
        ],
        "jobId": "5fc50008-07e0-4913-a6af-53b0b3e0214b",
        "title": "",
        "html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">502 Bad Gateway</pre></body></html>",
        "screenshot": "screenshot-5fc50008-07e0-4913-a6af-53b0b3e0214b.jpeg",
        "timestamp": "2025-07-01T04:39:59.981Z",
        "statusCode": 502,
        "statusMessage": ""
    }
}

or

{
    "success": false,
    "error": "Scrape task failed",
    "message": "Page is not available: 400 ",
    "data": {
        "url": "https://mock.httpstatus.io/400",
        "status": "failed",
        "type": "http_error",
        "message": "Page is not available: 400 ",
        "code": 400,
        "metadata": [
            {
                "name": "color-scheme",
                "content": "light dark"
            }
        ],
        "jobId": "0081747c-1fc5-44f9-800c-e27b24b55a2c",
        "title": "",
        "html": "<html><head><meta name=\"color-scheme\" content=\"light dark\"></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">400 Bad Request</pre></body></html>",
        "screenshot": "screenshot-0081747c-1fc5-44f9-800c-e27b24b55a2c.jpeg",
        "timestamp": "2025-07-01T04:38:24.136Z",
        "statusCode": 400,
        "statusMessage": ""
    }
}

or

{
    "success": false,
    "error": "Scrape task failed",
    "message": "Page is not available",
    "data": {
        "url": "https://httpstat.us/401",
        "status": "failed",
        "type": "http_error",
        "message": "Page is not available"
    }
}

Usage Examples

cURL

Basic Scraping (using default cheerio engine)

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Scraping Dynamic Content with Playwright Engine

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://spa-example.com",
    "engine": "playwright"
  }'

Scraping with Proxy

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "engine": "playwright",
    "proxy": "http://proxy.example.com:8080"
  }'

JavaScript/TypeScript

const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
    method: "POST",
    headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    body: JSON.stringify({
        url: "https://example.com",
        engine: "playwright",
    }),
});

const result = await response.json();
console.log(result.data.markdown); // Get Markdown formatted content

Python

import requests

response = requests.post(
    'https://api.anycrawl.dev/v1/scrape',
    headers={
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    json={
        'url': 'https://example.com',
        'engine': 'cheerio'
    }
)

result = response.json()
print(result['data']['markdown'])  # Output Markdown content

Best Practices

Engine Selection Guidelines

  1. Static Content Websites (news, blogs, documentation) → Use cheerio
  2. Complex Web Applications (SPAs requiring JavaScript rendering) → Use playwright or puppeteer

Performance Optimization

  • For bulk static content scraping, prioritize cheerio engine
  • Only use playwright or puppeteer when JavaScript rendering is required
  • For best results, use rotating proxies to avoid IP blocks and rate limits, and ensure proxy servers are stable and reliable
  • Leverage native high concurrency - the API handles multiple concurrent requests efficiently

Error Handling

try {
    const response = await fetch("/v1/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url: "https://example.com" }),
    });

    const result = await response.json();

    if (result.success && result.data.status === "completed") {
        // Handle successful result
        console.log(result.data.markdown);
    } else {
        // Handle scraping failure
        console.error("Scraping failed:", result.data.error);
    }
} catch (error) {
    // Handle network error
    console.error("Request failed:", error);
}

High Concurrency Usage

The API natively supports high concurrency. You can make multiple simultaneous requests without rate limiting concerns:

// Concurrent scraping example
const urls = ["https://example1.com", "https://example2.com", "https://example3.com"];

const scrapePromises = urls.map((url) =>
    fetch("/v1/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url, engine: "cheerio" }),
    }).then((res) => res.json())
);

// All requests execute concurrently and return immediately
const results = await Promise.all(scrapePromises);

Frequently Asked Questions

Q: When should I use different engines?

A: Each engine has specific advantages:

  • Cheerio: For static HTML content, fastest performance, no JavaScript execution
  • Playwright: For complex web apps, better stability and auto-wait features, and it will support more browser types in the future
  • Puppeteer: Chrome/Chromium only, does not work on ARM CPUs and we do not provide related docker images

Q: Why do some websites fail to scrape?

A: Possible reasons include:

  • Website blocks crawlers (returns 403/404)
  • JavaScript rendering required but using cheerio engine
  • Website requires authentication or special headers
  • Network connectivity issues

Q: How to handle websites requiring login?

A: Currently the API doesn't support authentication. Recommendations:

  • Scrape publicly accessible pages
  • Use alternative methods to obtain authenticated content

Q: What are the proxy configuration requirements?

A:

  • Supports HTTP/HTTPS proxies
  • Format: http://host:port or https://host:port
  • Ensure proxy servers are stable and available

Q: Is there rate limiting for concurrent requests?

A: No, the API natively supports high concurrency. You can make multiple simultaneous requests without worrying about rate limits, and all requests return data immediately.