AnyCrawl

Crawl

Crawl an entire website, and turn it to LLM-ready structured data.

Introduction

AnyCrawl crawl API discovers and processes multiple pages from a seed URL, applying the same per-page extraction pipeline as /v1/scrape. It is asynchronous: you receive a job_id immediately, then poll status and fetch results in pages.

Key Features

  • Asynchronous jobs: Queue a crawl and fetch results later
  • Multi-Engine Support: cheerio, playwright, puppeteer
  • Flexible scope control: strategy, max_depth, include_paths, exclude_paths
  • Per-page options: Reuse /v1/scrape options under scrape_options
  • Pagination: Stream results via skip to control payload size

API Endpoints

POST    https://api.anycrawl.dev/v1/crawl
GET     https://api.anycrawl.dev/v1/crawl/{jobId}/status
GET     https://api.anycrawl.dev/v1/crawl/{jobId}?skip=0
DELETE  https://api.anycrawl.dev/v1/crawl/{jobId}

Usage Examples

Create a crawl job

curl -X POST "https://api.anycrawl.dev/v1/crawl" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://anycrawl.dev",
    "engine": "cheerio",
    "strategy": "same-domain",
    "max_depth": 5,
    "limit": 100,
    "exclude_paths": ["/blog/*"],
    "scrape_options": {
      "formats": ["markdown"],
      "timeout": 60000
    }
  }'
const start = await fetch("https://api.anycrawl.dev/v1/crawl", {
    method: "POST",
    headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    body: JSON.stringify({
        url: "https://anycrawl.dev",
        engine: "cheerio",
        strategy: "same-domain",
        max_depth: 5,
        limit: 100,
        exclude_paths: ["/blog/*"],
        scrape_options: { formats: ["markdown"], timeout: 60000 },
    }),
});
const startResult = await start.json();
const jobId = startResult.data.job_id;
import requests
resp = requests.post(
    'https://api.anycrawl.dev/v1/crawl',
    headers={
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    json={
        'url': 'https://anycrawl.dev',
        'engine': 'cheerio',
        'strategy': 'same-domain',
        'max_depth': 5,
        'limit': 100,
        'exclude_paths': ['/blog/*'],
    'scrape_options': { 'formats': ['markdown'], 'timeout': 60000 }
    }
)
start = resp.json()
job_id = start['data']['job_id']

Poll status

curl -H "Authorization: Bearer <YOUR_API_KEY>" \
  "https://api.anycrawl.dev/v1/crawl/7a2e165d-8f81-4be6-9ef7-23222330a396/status"
const statusRes = await fetch(`https://api.anycrawl.dev/v1/crawl/${jobId}/status`, {
    headers: { Authorization: "Bearer YOUR_API_KEY" },
});
const status = await statusRes.json();

Fetch results (paginated)

curl -H "Authorization: Bearer <YOUR_API_KEY>" \
  "https://api.anycrawl.dev/v1/crawl/7a2e165d-8f81-4be6-9ef7-23222330a396?skip=0"
let skip = 0;
while (true) {
    const res = await fetch(`https://api.anycrawl.dev/v1/crawl/${jobId}?skip=${skip}`, {
        headers: { Authorization: "Bearer YOUR_API_KEY" },
    });
    const page = await res.json();
    // process page.data
    if (!page.next) break;
    const nextUrl = new URL(page.next);
    skip = Number(nextUrl.searchParams.get("skip") || 0);
}

Cancel a job

curl -X DELETE -H "Authorization: Bearer <YOUR_API_KEY>" \
  "https://api.anycrawl.dev/v1/crawl/7a2e165d-8f81-4be6-9ef7-23222330a396"

Request Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes-Seed URL to start crawling
engineenumNocheerioPer-page scraping engine: cheerio, playwright, puppeteer
exclude_pathsarray of stringNo-Path rules to exclude (glob-like), e.g. /blog/*
include_pathsarray of stringNo-Path rules to include (applied after exclusion)
max_depthnumberNo10Max depth from the seed URL
strategyenumNosame-domainCrawl scope: all, same-domain, same-hostname, same-origin
limitnumberNo100Maximum number of pages to crawl
scrape_optionsobjectNo-Per-page scrape options (almost the same as /v1/scrape request, but without url and engine at the top level)

scrape_options fields

FieldTypeDefaultNotes
formatsarray of enum["markdown"]Output formats: markdown, html, text, screenshot, screenshot@fullPage, rawHtml, json
timeoutnumber60000Per-request timeout (ms)
retrybooleanfalseWhether to retry on failure
wait_fornumber-Delay before extraction (ms)
include_tagsarray of string-Only include elements that match CSS selectors
exclude_tagsarray of string-Exclude elements that match CSS selectors
proxystring (URI)-Optional proxy URL
json_optionsobject-Options for structured JSON extraction (schema, user_prompt, schema_name, schema_description)

Response Format

1) Create (HTTP 200)

{
    "success": true,
    "data": {
        "job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396",
        "status": "created",
        "message": "Crawl job has been queued for processing"
    }
}

Possible Errors

  • 400 Validation Error
{
    "success": false,
    "error": "Validation error",
    "message": "Invalid enum value...",
    "data": {
        "type": "validation_error",
        "issues": [
            { "field": "engine", "message": "Invalid enum value", "code": "invalid_enum_value" }
        ],
        "status": "failed"
    }
}
  • 401 Authentication Error
{ "success": false, "error": "Invalid API key" }

2) Status (HTTP 200)

{
    "success": true,
    "message": "Job status retrieved successfully",
    "data": {
        "job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396",
        "status": "completed",
        "start_time": "2025-05-25T07:56:44.162Z",
        "expires_at": "2025-05-26T07:56:44.162Z",
        "credits_used": 0,
        "total": 120,
        "completed": 30,
        "failed": 2
    }
}

3) Results page (HTTP 200)

{
    "success": true,
    "status": "pending",
    "total": 120,
    "completed": 30,
    "creditsUsed": 12,
    "next": "https://api.anycrawl.dev/v1/crawl/7a2e165d-8f81-4be6-9ef7-23222330a396?skip=100",
    "data": [
        {
            "url": "https://anycrawl.dev/",
            "title": "AnyCrawl",
            "markdown": "# AnyCrawl...",
            "timestamp": "2025-05-25T07:56:44.162Z"
        }
    ]
}

Possible Errors

  • 400 Invalid job id / Not found
{ "success": false, "error": "Invalid job ID", "message": "Job ID must be a valid UUID" }

4) Cancel (HTTP 200)

{
    "success": true,
    "message": "Job cancelled successfully",
    "data": { "job_id": "7a2e165d-8f81-4be6-9ef7-23222330a396", "status": "cancelled" }
}

Possible Errors

  • 404 Not found
  • 409 Job already finished
{
    "success": false,
    "error": "Job already finished",
    "message": "Finished jobs cannot be cancelled"
}

Best Practices

  • Use /v1/scrape for single pages to minimize cost; use /v1/crawl for site-wide data.
  • Tune strategy, max_depth, and path rules to control scope and cost.
  • Use formats to limit output size to what you actually need.
  • Page through results via skip to avoid very large responses.

Error Handling Example

async function fetchAllResults(jobId) {
    let skip = 0;
    while (true) {
        const res = await fetch(`https://api.anycrawl.dev/v1/crawl/${jobId}?skip=${skip}`);
        if (!res.ok) {
            const err = await res.json().catch(() => ({}));
            throw new Error(err.message || `HTTP ${res.status}`);
        }
        const page = await res.json();
        // handle page.data here
        if (!page.next) break;
        const next = new URL(page.next);
        skip = Number(next.searchParams.get("skip") || 0);
    }
}

High Concurrency Usage

The crawl queue supports concurrent jobs. Submit multiple crawl jobs and poll independently:

const seeds = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];
const jobs = await Promise.all(
    seeds.map((url) =>
        fetch("https://api.anycrawl.dev/v1/crawl", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ url, engine: "cheerio" }),
        }).then((r) => r.json())
    )
);

FAQ

How is /v1/crawl different from /v1/scrape?

/v1/scrape fetches a single URL synchronously and returns content immediately. /v1/crawl discovers multiple pages from a seed URL and runs asynchronously.

How do I limit crawl scope and cost?

Use strategy, max_depth, limit, and path rules (include_paths, exclude_paths).

How do I paginate results?

Use the skip query parameter. If the response includes next, follow it to fetch the next page.

Why do some jobs not return html/markdown?

Ensure the desired formats are included in scrape_options.formats and that your chosen engine supports them.

Status values

Job status follows the values defined in the job model:

StatusMeaning
pendingJob is queued or in progress (not finished yet)
completedJob finished successfully
failedJob ended with an error
cancelledJob was cancelled; no further processing will be done

OpenAPI (auto-generated)

See the API Reference for the auto-generated docs:

  • POST /v1/crawl
  • GET /v1/crawl/{jobId}/status
  • GET /v1/crawl/{jobId}
  • DELETE /v1/crawl/{jobId}