Map
Extract all URLs from a website using sitemap, search engine, and page link analysis.
Introduction
AnyCrawl Map API extracts URLs from a website by combining multiple discovery sources: sitemap parsing, search engine results, and HTML link extraction. This provides comprehensive URL discovery for site mapping, content indexing, and crawl planning.
Key Features: The API returns data immediately and synchronously - no polling or webhooks required. It combines three URL sources for maximum coverage.
Core Features
- Multi-Source Discovery: Combines sitemap parsing, search engine results, and page link extraction
- Sitemap Support: Parses robots.txt and sitemap.xml (including sitemap indexes and gzip)
- Search Integration: Automatically uses search engines with
site:operator for URL discovery - Link Extraction: Extracts all
<a href>links from the target page - Domain Filtering: Filter by exact domain or include subdomains
- Immediate Response: Synchronous API - get results instantly without polling
API Endpoint
POST https://api.anycrawl.dev/v1/mapUsage Examples
cURL
Basic URL Mapping
curl -X POST "https://api.anycrawl.dev/v1/map" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'Include Subdomains
curl -X POST "https://api.anycrawl.dev/v1/map" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"include_subdomains": true,
"limit": 1000
}'Skip Sitemap Parsing
For faster results when you only need page links and search results:
curl -X POST "https://api.anycrawl.dev/v1/map" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"ignore_sitemap": true
}'Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | Target URL to map, must be a valid HTTP/HTTPS address |
limit | number | No | 5000 | Maximum number of URLs to return (1-50000) |
include_subdomains | boolean | No | false | Include URLs from subdomains (e.g., blog.example.com) |
ignore_sitemap | boolean | No | false | Skip sitemap parsing, only use search engine and page links |
max_age | number | No | - | Cache max age (ms). Use 0 to skip cache reads; omit to use server default |
use_index | boolean | No | true | Whether to use Page Cache index (page_cache) as an additional URL source |
Cache behavior
max_agecontrols Map Cache reads.0forces a refresh.use_index=falsedisables the Page Cache index source (requires Page Cache to be enabled to have any effect)./v1/mapdoes not returnfromCachein the response (cache usage is internal).
URL Discovery Sources
The Map API combines three sources to discover URLs:
1. Sitemap Parsing
- Parses
robots.txtto find sitemap locations - Tries common sitemap paths:
/sitemap.xml,/sitemap.xml.gz - Supports sitemap indexes (sitemaps containing other sitemaps)
- Supports gzip-compressed sitemaps
2. Search Engine Results
- Automatically uses
site:domain.comoperator to discover indexed pages - Provides title and description metadata for discovered URLs
3. Page Link Extraction
- Extracts all
<a href>links from the target page HTML - Captures link text as title metadata
- Captures
titleattribute andaria-labelas description
Metadata Sources
| Source | Title | Description |
|---|---|---|
| Sitemap | - | - |
| Search Engine | Search result title | Search result snippet |
| Page Links | Link text or title attribute | aria-label attribute |
Response Format
Success Response (HTTP 200)
{
"success": true,
"data": [
{
"url": "https://example.com/page1",
"title": "Page Title",
"description": "Page description from search results"
},
{
"url": "https://example.com/page2",
"title": "Another Page"
},
{
"url": "https://example.com/page3"
}
]
}Error Responses
400 - Validation Error
{
"success": false,
"error": "Validation error",
"message": "Invalid url",
"details": {
"issues": [
{
"field": "url",
"message": "Invalid url",
"code": "invalid_string"
}
]
}
}402 - Insufficient Credits
{
"success": false,
"error": "Insufficient credits",
"message": "Estimated credits required (1) exceeds available credits (0).",
"details": {
"estimated_total": 1,
"available_credits": 0
}
}500 - Internal Server Error
{
"success": false,
"error": "Internal server error",
"message": "Error message details"
}Best Practices
Use Cases
- Crawl Planning: Use Map to discover all URLs before starting a full crawl
- Content Indexing: Build a complete index of a website's pages
- Site Auditing: Find all pages for SEO or accessibility audits
- Link Analysis: Analyze internal linking structure
Performance Tips
- Use
ignore_sitemap: truefor faster results when sitemap is not needed - Set appropriate
limitto avoid processing unnecessary URLs - Use
include_subdomains: false(default) unless you need cross-subdomain discovery
Combining with Crawl
Map is ideal for planning crawl operations:
// Step 1: Discover URLs with Map
const mapResponse = await fetch("/v1/map", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://docs.example.com",
limit: 100,
}),
});
const { data: urls } = await mapResponse.json();
// Step 2: Use discovered URLs to plan crawl
const crawlResponse = await fetch("/v1/crawl", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://docs.example.com",
include_paths: urls.map((u) => new URL(u.url).pathname),
limit: 100,
}),
});Credits
| Operation | Credits |
|---|---|
| Map operation | 1 |
Total: 1 credit per request.
Frequently Asked Questions
Q: What's the difference between Map and Crawl?
A: Map discovers URLs without fetching page content - it's fast and lightweight. Crawl fetches and processes the actual content of each page. Use Map for URL discovery and planning, Crawl for content extraction.
Q: Why are some URLs missing from the results?
A: Possible reasons:
- URLs are on a different domain/subdomain (use
include_subdomains: true) - Website doesn't have a sitemap
- URLs are generated dynamically via JavaScript
- URLs exceed the
limitparameter
Q: How does search engine discovery work?
A: The Map API automatically queries search engines with site:domain.com to discover pages that have been indexed. This helps find URLs that may not be in the sitemap or linked from the main page.
Q: Does Map follow redirects?
A: Map extracts URLs as they appear in sitemaps and page links. It does not follow redirects to discover additional URLs.
Q: Is there a rate limit?
A: No, the API natively supports high concurrency. You can make multiple simultaneous requests without rate limiting concerns.