Map

Extract all URLs from a website using sitemap, search engine, and page link analysis.

Introduction

AnyCrawl Map API extracts URLs from a website by combining multiple discovery sources: sitemap parsing, search engine results, and HTML link extraction. This provides comprehensive URL discovery for site mapping, content indexing, and crawl planning.

Key Features: The API returns data immediately and synchronously - no polling or webhooks required. It combines three URL sources for maximum coverage.

Core Features

Multi-Source Discovery: Combines sitemap parsing, search engine results, and page link extraction
Sitemap Support: Parses robots.txt and sitemap.xml (including sitemap indexes and gzip)
Search Integration: Automatically uses search engines with site: operator for URL discovery
Link Extraction: Extracts all <a href> links from the target page
Domain Filtering: Filter by exact domain or include subdomains
Immediate Response: Synchronous API - get results instantly without polling

API Endpoint

POST https://api.anycrawl.dev/v1/map

Usage Examples

cURL

Basic URL Mapping

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Include Subdomains

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "include_subdomains": true,
    "limit": 1000
  }'

Skip Sitemap Parsing

For faster results when you only need page links and search results:

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "ignore_sitemap": true
  }'

Request Parameters

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	Target URL to map, must be a valid HTTP/HTTPS address
`limit`	number	No	`5000`	Maximum number of URLs to return (1-50000)
`include_subdomains`	boolean	No	`false`	Include URLs from subdomains (e.g., blog.example.com)
`ignore_sitemap`	boolean	No	`false`	Skip sitemap parsing, only use search engine and page links
`max_age`	number	No	-	Cache max age (ms). Use `0` to skip cache reads; omit to use server default
`use_index`	boolean	No	`true`	Whether to use Page Cache index (`page_cache`) as an additional URL source

Cache behavior

max_age controls Map Cache reads. 0 forces a refresh.
use_index=false disables the Page Cache index source (requires Page Cache to be enabled to have any effect).
/v1/map does not return fromCache in the response (cache usage is internal).

URL Discovery Sources

The Map API combines three sources to discover URLs:

1. Sitemap Parsing

Parses robots.txt to find sitemap locations
Tries common sitemap paths: /sitemap.xml, /sitemap.xml.gz
Supports sitemap indexes (sitemaps containing other sitemaps)
Supports gzip-compressed sitemaps

2. Search Engine Results

Automatically uses site:domain.com operator to discover indexed pages
Provides title and description metadata for discovered URLs

3. Page Link Extraction

Extracts all <a href> links from the target page HTML
Captures link text as title metadata
Captures title attribute and aria-label as description

Metadata Sources

Source	Title	Description
Sitemap	-	-
Search Engine	Search result title	Search result snippet
Page Links	Link text or title attribute	aria-label attribute

Response Format

Success Response (HTTP 200)

{
    "success": true,
    "data": [
        {
            "url": "https://example.com/page1",
            "title": "Page Title",
            "description": "Page description from search results"
        },
        {
            "url": "https://example.com/page2",
            "title": "Another Page"
        },
        {
            "url": "https://example.com/page3"
        }
    ]
}

Error Responses

400 - Validation Error

{
    "success": false,
    "error": "Validation error",
    "message": "Invalid url",
    "details": {
        "issues": [
            {
                "field": "url",
                "message": "Invalid url",
                "code": "invalid_string"
            }
        ]
    }
}

402 - Insufficient Credits

{
    "success": false,
    "error": "Insufficient credits",
    "message": "Estimated credits required (1) exceeds available credits (0).",
    "details": {
        "estimated_total": 1,
        "available_credits": 0
    }
}

500 - Internal Server Error

{
    "success": false,
    "error": "Internal server error",
    "message": "Error message details"
}

Best Practices

Use Cases

Crawl Planning: Use Map to discover all URLs before starting a full crawl
Content Indexing: Build a complete index of a website's pages
Site Auditing: Find all pages for SEO or accessibility audits
Link Analysis: Analyze internal linking structure

Performance Tips

Use ignore_sitemap: true for faster results when sitemap is not needed
Set appropriate limit to avoid processing unnecessary URLs
Use include_subdomains: false (default) unless you need cross-subdomain discovery

Combining with Crawl

Map is ideal for planning crawl operations:

// Step 1: Discover URLs with Map
const mapResponse = await fetch("/v1/map", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        url: "https://docs.example.com",
        limit: 100,
    }),
});
const { data: urls } = await mapResponse.json();

// Step 2: Use discovered URLs to plan crawl
const crawlResponse = await fetch("/v1/crawl", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        url: "https://docs.example.com",
        include_paths: urls.map((u) => new URL(u.url).pathname),
        limit: 100,
    }),
});

Credits

Operation	Credits
Map operation	1

Total: 1 credit per request.

Frequently Asked Questions

Q: What's the difference between Map and Crawl?

A: Map discovers URLs without fetching page content - it's fast and lightweight. Crawl fetches and processes the actual content of each page. Use Map for URL discovery and planning, Crawl for content extraction.

Q: Why are some URLs missing from the results?

A: Possible reasons:

URLs are on a different domain/subdomain (use include_subdomains: true)
Website doesn't have a sitemap
URLs are generated dynamically via JavaScript
URLs exceed the limit parameter