Proxy Configuration
Configure URL-based proxy routing for AnyCrawl
Proxy Configuration
AnyCrawl supports flexible proxy routing based on URL patterns. You can configure different proxies for different websites or API endpoints.
Configuration Methods
Method 1: Simple Proxy Configuration (ANYCRAWL_PROXY_URL)
For simple use cases where you want to use the same proxy for all requests, set the ANYCRAWL_PROXY_URL
environment variable:
# Single proxy
export ANYCRAWL_PROXY_URL=http://username:password@proxy.example.com:8080
# Multiple proxies (tiered mode)
export ANYCRAWL_PROXY_URL=http://proxy1:8080,http://proxy2:8080,http://proxy3:8080
When multiple proxies are provided (comma-separated), AnyCrawl uses a tiered proxy strategy:
- All requests start with the first proxy (tier 0)
- If a proxy fails for a domain, AnyCrawl automatically switches to the next tier for that domain
- This provides intelligent failover and optimal proxy usage
This is the simplest way to configure proxies when you don't need URL-based routing.
Method 2: Advanced Configuration File (ANYCRAWL_PROXY_CONFIG)
For URL-based proxy routing, create a JSON configuration file (e.g., proxy-config.json
) and set the ANYCRAWL_PROXY_CONFIG
environment variable to its path:
ANYCRAWL_PROXY_CONFIG=/path/to/proxy-config.json
Note: If both ANYCRAWL_PROXY_URL
and ANYCRAWL_PROXY_CONFIG
are set, the configuration file rules take precedence, and ANYCRAWL_PROXY_URL
serves as a fallback for URLs that don't match any rules.
Rule Types
AnyCrawl supports three types of proxy rules, applied in priority order:
1. URL Rules (Highest Priority)
Exact URL matching. Use this when you need a specific proxy for a specific endpoint.
{
"url": "https://api.example.com/v1/data",
"proxy": "http://username:password@proxy1.example.com:8080"
}
2. Pattern Rules (Medium Priority)
Full URL pattern matching with wildcards. Useful for matching URLs with specific paths or protocols.
{
"pattern": "https://*.github.com/api/*",
"proxy": "http://username:password@proxy2.example.com:8080"
}
3. Domain Rules (Lowest Priority)
Domain-only pattern matching. Routes all requests to a domain through a specific proxy.
{
"domain": "*.gov.au",
"proxy": "http://username:password@proxy3.example.com:8080"
}
Wildcard Patterns
*
- Matches any number of characters?
- Matches exactly one character- Patterns are case-insensitive
Examples
*.example.com
- Matchesapi.example.com
,www.example.com
,test.example.com
api-?.example.com
- Matchesapi-1.example.com
,api-2.example.com
, but notapi-10.example.com
https://*.example.com/api/*
- Matches any HTTPS URL on any subdomain of example.com with /api/ path
Complete Configuration Example
{
"rules": [
{
"url": "https://api.example.com/v1/users",
"proxy": "http://premium-proxy.example.com:8080"
},
{
"pattern": "https://api.github.com/*",
"proxy": "http://github-proxy.example.com:8080"
},
{
"domain": "*.gov.au",
"proxy": "http://au-proxy.example.com:8080"
}
]
}
Proxy URL Formats
AnyCrawl supports various proxy URL formats:
- HTTP:
http://username:password@proxy.example.com:8080
- HTTPS:
https://username:password@proxy.example.com:8443
Debugging
You'll see messages like:
Using proxy from request userData: http://custom-proxy:8080
Found proxy for URL https://example.com: http://proxy.example.com:8080 By matching a rule.
Proxy matched by domain pattern: *.gov.au → http://proxy.example.com:8080
Using tiered proxy: http://default-proxy:8080
Priority Example
Given the URL https://api.github.com/repos/owner/repo
, the following rules would be checked in order:
- URL match:
"url": "https://api.github.com/repos/owner/repo"
- Pattern match:
"pattern": "https://api.github.com/*"
- Domain match:
"domain": "*.github.com"
The first matching rule wins.
Best Practices
- Use domain rules for broad proxy requirements (e.g., all requests to a country's government sites)
- Use pattern rules when you need to match specific paths or protocols
- Use URL rules for exact endpoints that need special handling
- Order doesn't matter in the configuration file - priority is determined by rule type
- Test your patterns using the debug logging to ensure they match as expected
Tiered Proxy System
When using multiple proxies with ANYCRAWL_PROXY_URL
, AnyCrawl employs an intelligent tiered proxy system:
How It Works
- Initial State: All domains start using the first proxy (tier 0)
- Error Detection: When a proxy fails for a specific domain, that domain is promoted to the next tier
- Domain-Specific: Each domain maintains its own tier level independently
Example Scenario
export ANYCRAWL_PROXY_URL=http://fast-proxy:8080,http://stable-proxy:8080,http://backup-proxy:8080
- Initial requests to
example.com
→ Usefast-proxy:8080
(tier 0) - If
fast-proxy
fails forexample.com
→ Switch tostable-proxy:8080
(tier 1) - Meanwhile,
github.com
might still usefast-proxy:8080
if it's working fine - System will periodically retry
fast-proxy
forexample.com
to check if it's recovered
Benefits
- Automatic Failover: No manual intervention needed when proxies fail
- Domain Optimization: Each domain uses the best available proxy
- Resource Efficiency: Failed proxies aren't completely abandoned
- Self-Healing: Automatically returns to optimal proxies when they recover
Complete Example: Using Both Methods
Here's an example setup that uses both configuration methods:
# Set a default proxy for general use
export ANYCRAWL_PROXY_URL=http://default-proxy:8080
# Set up URL-based routing for specific sites
export ANYCRAWL_PROXY_CONFIG=/path/to/proxy-config.json
With this proxy-config.json:
{
"rules": [
{
"domain": "*.gov.au",
"proxy": "http://au-residential-proxy:8080"
},
{
"pattern": "https://api.*.com/*",
"proxy": "http://api-optimized-proxy:3128"
}
]
}
Result:
https://www.homeaffairs.gov.au/
→ Usesau-residential-proxy:8080
(domain rule match)https://api.github.com/repos
→ Usesapi-optimized-proxy:3128
(pattern rule match)https://example.com/
→ Usesdefault-proxy:8080
(fallback to ANYCRAWL_PROXY_URL)
Environment Variables Summary
Variable | Purpose | Example |
---|---|---|
ANYCRAWL_PROXY_URL | Tiered proxy configuration (single or multiple) | http://proxy:8080 or http://p1:8080,http://p2:8080 |
ANYCRAWL_PROXY_CONFIG | Path to JSON config file for URL-based routing | /path/to/proxy-config.json |
Priority Order
- Highest: Proxy specified in request options (user-provided in API request)
{ "url": "https://example.com", "engine": "cheerio", "proxy": "http://custom-proxy:8080" }
- High: URL-based rules from
ANYCRAWL_PROXY_CONFIG
- Low: Tiered proxies from
ANYCRAWL_PROXY_URL
(fallback)