Proxy Configuration
Configure URL-based proxy routing for AnyCrawl
Proxy Configuration
AnyCrawl supports flexible proxy routing based on URL patterns. You can configure different proxies for different websites or API endpoints.
Proxy Modes
AnyCrawl supports four proxy modes that can be specified in API requests:
| Mode | Description |
|---|---|
auto | Automatically decide between base and stealth proxy. Uses stealth if available, otherwise base. |
base | Use the proxy configured in ANYCRAWL_PROXY_URL (default) |
stealth | Use the proxy configured in ANYCRAWL_PROXY_STEALTH_URL (typically a residential or premium proxy) |
| Custom URL | A full proxy URL string (e.g., http://user:pass@proxy:8080), returned as custom in responses |
Usage Example
{
"url": "https://example.com",
"engine": "playwright",
"proxy": "auto"
}{
"url": "https://example.com",
"engine": "playwright",
"proxy": "stealth"
}{
"url": "https://example.com",
"engine": "playwright",
"proxy": "http://custom-proxy:8080"
}Configuration Methods
Method 1: Simple Proxy Configuration (ANYCRAWL_PROXY_URL)
For simple use cases where you want to use the same proxy for all requests, set the ANYCRAWL_PROXY_URL environment variable:
# Single proxy
export ANYCRAWL_PROXY_URL=http://username:password@proxy.example.com:8080
# Multiple proxies (tiered mode)
export ANYCRAWL_PROXY_URL=http://proxy1:8080,http://proxy2:8080,http://proxy3:8080When multiple proxies are provided (comma-separated), AnyCrawl uses a tiered proxy strategy:
- All requests start with the first proxy (tier 0)
- If a proxy fails for a domain, AnyCrawl automatically switches to the next tier for that domain
- This provides intelligent failover and optimal proxy usage
This is the simplest way to configure proxies when you don't need URL-based routing.
Method 2: Advanced Configuration File (ANYCRAWL_PROXY_CONFIG)
For URL-based proxy routing, create a JSON configuration file (e.g., proxy-config.json) and set the ANYCRAWL_PROXY_CONFIG environment variable to its path:
ANYCRAWL_PROXY_CONFIG=/path/to/proxy-config.jsonNote: If both ANYCRAWL_PROXY_URL and ANYCRAWL_PROXY_CONFIG are set, the configuration file rules take precedence, and ANYCRAWL_PROXY_URL serves as a fallback for URLs that don't match any rules.
Rule Types
AnyCrawl supports three types of proxy rules, applied in priority order:
1. URL Rules (Highest Priority)
Exact URL matching. Use this when you need a specific proxy for a specific endpoint.
{
"url": "https://api.example.com/v1/data",
"proxy": "http://username:password@proxy1.example.com:8080"
}2. Pattern Rules (Medium Priority)
Full URL pattern matching with wildcards. Useful for matching URLs with specific paths or protocols.
{
"pattern": "https://*.github.com/api/*",
"proxy": "http://username:password@proxy2.example.com:8080"
}3. Domain Rules (Lowest Priority)
Domain-only pattern matching. Routes all requests to a domain through a specific proxy.
{
"domain": "*.gov.au",
"proxy": "http://username:password@proxy3.example.com:8080"
}Wildcard Patterns
*- Matches any number of characters?- Matches exactly one character- Patterns are case-insensitive
Examples
*.example.com- Matchesapi.example.com,www.example.com,test.example.comapi-?.example.com- Matchesapi-1.example.com,api-2.example.com, but notapi-10.example.comhttps://*.example.com/api/*- Matches any HTTPS URL on any subdomain of example.com with /api/ path
Complete Configuration Example
{
"rules": [
{
"url": "https://api.example.com/v1/users",
"proxy": "http://premium-proxy.example.com:8080"
},
{
"pattern": "https://api.github.com/*",
"proxy": "http://github-proxy.example.com:8080"
},
{
"domain": "*.gov.au",
"proxy": "http://au-proxy.example.com:8080"
}
]
}Proxy URL Formats
AnyCrawl supports various proxy URL formats:
- HTTP:
http://username:password@proxy.example.com:8080 - HTTPS:
https://username:password@proxy.example.com:8443
Debugging
You'll see messages like:
Using proxy from request userData: http://custom-proxy:8080
Found proxy for URL https://example.com: http://proxy.example.com:8080 By matching a rule.
Proxy matched by domain pattern: *.gov.au → http://proxy.example.com:8080
Using tiered proxy: http://default-proxy:8080Priority Example
Given the URL https://api.github.com/repos/owner/repo, the following rules would be checked in order:
- URL match:
"url": "https://api.github.com/repos/owner/repo" - Pattern match:
"pattern": "https://api.github.com/*" - Domain match:
"domain": "*.github.com"
The first matching rule wins.
Best Practices
- Use domain rules for broad proxy requirements (e.g., all requests to a country's government sites)
- Use pattern rules when you need to match specific paths or protocols
- Use URL rules for exact endpoints that need special handling
- Order doesn't matter in the configuration file - priority is determined by rule type
- Test your patterns using the debug logging to ensure they match as expected
Tiered Proxy System
When using multiple proxies with ANYCRAWL_PROXY_URL, AnyCrawl employs an intelligent tiered proxy system:
How It Works
- Initial State: All domains start using the first proxy (tier 0)
- Error Detection: When a proxy fails for a specific domain, that domain is promoted to the next tier
- Domain-Specific: Each domain maintains its own tier level independently
Example Scenario
export ANYCRAWL_PROXY_URL=http://fast-proxy:8080,http://stable-proxy:8080,http://backup-proxy:8080- Initial requests to
example.com→ Usefast-proxy:8080(tier 0) - If
fast-proxyfails forexample.com→ Switch tostable-proxy:8080(tier 1) - Meanwhile,
github.commight still usefast-proxy:8080if it's working fine - System will periodically retry
fast-proxyforexample.comto check if it's recovered
Benefits
- Automatic Failover: No manual intervention needed when proxies fail
- Domain Optimization: Each domain uses the best available proxy
- Resource Efficiency: Failed proxies aren't completely abandoned
- Self-Healing: Automatically returns to optimal proxies when they recover
Complete Example: Using Both Methods
Here's an example setup that uses both configuration methods:
# Set a default proxy for general use
export ANYCRAWL_PROXY_URL=http://default-proxy:8080
# Set up URL-based routing for specific sites
export ANYCRAWL_PROXY_CONFIG=/path/to/proxy-config.jsonWith this proxy-config.json:
{
"rules": [
{
"domain": "*.gov.au",
"proxy": "http://au-residential-proxy:8080"
},
{
"pattern": "https://api.*.com/*",
"proxy": "http://api-optimized-proxy:3128"
}
]
}Result:
https://www.homeaffairs.gov.au/→ Usesau-residential-proxy:8080(domain rule match)https://api.github.com/repos→ Usesapi-optimized-proxy:3128(pattern rule match)https://example.com/→ Usesdefault-proxy:8080(fallback to ANYCRAWL_PROXY_URL)
Environment Variables Summary
| Variable | Purpose | Example |
|---|---|---|
ANYCRAWL_PROXY_URL | Base proxy configuration (single or multiple) | http://proxy:8080 or http://p1:8080,http://p2:8080 |
ANYCRAWL_PROXY_STEALTH_URL | Stealth/premium proxy (residential proxies) | http://residential-proxy:8080 |
ANYCRAWL_PROXY_CONFIG | Path to JSON config file for URL-based routing | /path/to/proxy-config.json |
Priority Order
- Highest: Proxy mode or custom URL specified in request options
Or with a custom URL:{ "url": "https://example.com", "engine": "playwright", "proxy": "stealth" }{ "url": "https://example.com", "engine": "cheerio", "proxy": "http://custom-proxy:8080" } - High: URL-based rules from
ANYCRAWL_PROXY_CONFIG - Low: Tiered proxies from
ANYCRAWL_PROXY_URL(fallback)