AnyCrawl

Map

Extract all URLs from a website using sitemap, search engine, and page link analysis.

Introduction

AnyCrawl Map API extracts URLs from a website by combining multiple discovery sources: sitemap parsing, search engine results, and HTML link extraction. This provides comprehensive URL discovery for site mapping, content indexing, and crawl planning.

Key Features: The API returns data immediately and synchronously - no polling or webhooks required. It combines three URL sources for maximum coverage.

Core Features

  • Multi-Source Discovery: Combines sitemap parsing, search engine results, and page link extraction
  • Sitemap Support: Parses robots.txt and sitemap.xml (including sitemap indexes and gzip)
  • Search Integration: Automatically uses search engines with site: operator for URL discovery
  • Link Extraction: Extracts all <a href> links from the target page
  • Domain Filtering: Filter by exact domain or include subdomains
  • Immediate Response: Synchronous API - get results instantly without polling

API Endpoint

POST https://api.anycrawl.dev/v1/map

Usage Examples

cURL

Basic URL Mapping

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Include Subdomains

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "include_subdomains": true,
    "limit": 1000
  }'

Skip Sitemap Parsing

For faster results when you only need page links and search results:

curl -X POST "https://api.anycrawl.dev/v1/map" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "ignore_sitemap": true
  }'

Request Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes-Target URL to map, must be a valid HTTP/HTTPS address
limitnumberNo5000Maximum number of URLs to return (1-50000)
include_subdomainsbooleanNofalseInclude URLs from subdomains (e.g., blog.example.com)
ignore_sitemapbooleanNofalseSkip sitemap parsing, only use search engine and page links
max_agenumberNo-Cache max age (ms). Use 0 to skip cache reads; omit to use server default
use_indexbooleanNotrueWhether to use Page Cache index (page_cache) as an additional URL source

Cache behavior

  • max_age controls Map Cache reads. 0 forces a refresh.
  • use_index=false disables the Page Cache index source (requires Page Cache to be enabled to have any effect).
  • /v1/map does not return fromCache in the response (cache usage is internal).

URL Discovery Sources

The Map API combines three sources to discover URLs:

1. Sitemap Parsing

  • Parses robots.txt to find sitemap locations
  • Tries common sitemap paths: /sitemap.xml, /sitemap.xml.gz
  • Supports sitemap indexes (sitemaps containing other sitemaps)
  • Supports gzip-compressed sitemaps

2. Search Engine Results

  • Automatically uses site:domain.com operator to discover indexed pages
  • Provides title and description metadata for discovered URLs
  • Extracts all <a href> links from the target page HTML
  • Captures link text as title metadata
  • Captures title attribute and aria-label as description

Metadata Sources

SourceTitleDescription
Sitemap--
Search EngineSearch result titleSearch result snippet
Page LinksLink text or title attributearia-label attribute

Response Format

Success Response (HTTP 200)

{
    "success": true,
    "data": [
        {
            "url": "https://example.com/page1",
            "title": "Page Title",
            "description": "Page description from search results"
        },
        {
            "url": "https://example.com/page2",
            "title": "Another Page"
        },
        {
            "url": "https://example.com/page3"
        }
    ]
}

Error Responses

400 - Validation Error

{
    "success": false,
    "error": "Validation error",
    "message": "Invalid url",
    "details": {
        "issues": [
            {
                "field": "url",
                "message": "Invalid url",
                "code": "invalid_string"
            }
        ]
    }
}

402 - Insufficient Credits

{
    "success": false,
    "error": "Insufficient credits",
    "message": "Estimated credits required (1) exceeds available credits (0).",
    "details": {
        "estimated_total": 1,
        "available_credits": 0
    }
}

500 - Internal Server Error

{
    "success": false,
    "error": "Internal server error",
    "message": "Error message details"
}

Best Practices

Use Cases

  1. Crawl Planning: Use Map to discover all URLs before starting a full crawl
  2. Content Indexing: Build a complete index of a website's pages
  3. Site Auditing: Find all pages for SEO or accessibility audits
  4. Link Analysis: Analyze internal linking structure

Performance Tips

  • Use ignore_sitemap: true for faster results when sitemap is not needed
  • Set appropriate limit to avoid processing unnecessary URLs
  • Use include_subdomains: false (default) unless you need cross-subdomain discovery

Combining with Crawl

Map is ideal for planning crawl operations:

// Step 1: Discover URLs with Map
const mapResponse = await fetch("/v1/map", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        url: "https://docs.example.com",
        limit: 100,
    }),
});
const { data: urls } = await mapResponse.json();

// Step 2: Use discovered URLs to plan crawl
const crawlResponse = await fetch("/v1/crawl", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        url: "https://docs.example.com",
        include_paths: urls.map((u) => new URL(u.url).pathname),
        limit: 100,
    }),
});

Credits

OperationCredits
Map operation1

Total: 1 credit per request.

Frequently Asked Questions

Q: What's the difference between Map and Crawl?

A: Map discovers URLs without fetching page content - it's fast and lightweight. Crawl fetches and processes the actual content of each page. Use Map for URL discovery and planning, Crawl for content extraction.

Q: Why are some URLs missing from the results?

A: Possible reasons:

  • URLs are on a different domain/subdomain (use include_subdomains: true)
  • Website doesn't have a sitemap
  • URLs are generated dynamically via JavaScript
  • URLs exceed the limit parameter

Q: How does search engine discovery work?

A: The Map API automatically queries search engines with site:domain.com to discover pages that have been indexed. This helps find URLs that may not be in the sitemap or linked from the main page.

Q: Does Map follow redirects?

A: Map extracts URLs as they appear in sitemaps and page links. It does not follow redirects to discover additional URLs.

Q: Is there a rate limit?

A: No, the API natively supports high concurrency. You can make multiple simultaneous requests without rate limiting concerns.