JSON mode
Extract structured data from web pages using LLMs with JSON schema validation.
Introduction
JSON mode enables you to extract structured data from any webpage using Large Language Models (LLMs).
AnyCrawl allows you to define a JSON schema and prompts to guide the extraction process, ensuring the output matches your desired format.
Usage Examples
JSON mode supports three structured extraction modes:
- Prompt Only: Use a natural language prompt to describe what you want to extract. The LLM determines the output structure.
- Schema Only: Define a JSON schema to strictly constrain the output structure. The LLM fills in the content.
- Prompt + Schema: Combine both schema and prompt for precise structure and content guidance.
1. Prompt Only (Flexible Structure)
Steps:
- Set the target webpage URL
- Write a prompt describing the information to extract
- Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"user_prompt": "Extract the company mission, open source status, and employee count."
}
}'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com",
json_options: {
user_prompt: "Extract the company mission, open source status, and employee count.",
},
}),
});
const result = await response.json();
console.log(result.data.json);
import requests
response = requests.post(
'https://api.anycrawl.dev/v1/scrape',
headers={
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={
'url': 'https://example.com',
'json_options': {
'user_prompt': 'Extract the company mission, open source status, and employee count.'
}
}
)
print(response.json()['data']['json'])
2. Schema Only (Strict Structure)
Steps:
- Set the target webpage URL
- Define a JSON schema for the desired fields and types
- Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"schema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"is_open_source": { "type": "boolean" },
"employee_count": { "type": "number" }
},
"required": ["company_mission"]
}
}
}'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com",
json_options: {
schema: {
type: "object",
properties: {
company_mission: { type: "string" },
is_open_source: { type: "boolean" },
employee_count: { type: "number" },
},
required: ["company_mission"],
},
},
}),
});
const result = await response.json();
console.log(result.data.json);
schema = {
"type": "object",
"properties": {
"company_mission": {"type": "string"},
"is_open_source": {"type": "boolean"},
"employee_count": {"type": "number"}
},
"required": ["company_mission"]
}
import requests
response = requests.post(
'https://api.anycrawl.dev/v1/scrape',
headers={
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={
'url': 'https://example.com',
'json_options': {
'schema': schema
}
}
)
print(response.json()['data']['json'])
3. Prompt + Schema (Guided Structure & Content)
Steps:
- Set the target webpage URL
- Define a JSON schema for the output structure
- Write a prompt to guide the extraction
- Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"schema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"is_open_source": { "type": "boolean" },
"employee_count": { "type": "number" }
},
"required": ["company_mission"]
},
"user_prompt": "Extract the company mission (string), open source status (boolean), and employee count (number)."
}
}'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com",
json_options: {
schema: {
type: "object",
properties: {
company_mission: { type: "string" },
is_open_source: { type: "boolean" },
employee_count: { type: "number" },
},
required: ["company_mission"],
},
user_prompt:
"Extract the company mission (string), open source status (boolean), and employee count (number).",
},
}),
});
const result = await response.json();
console.log(result.data.json);
schema = {
"type": "object",
"properties": {
"company_mission": {"type": "string"},
"is_open_source": {"type": "boolean"},
"employee_count": {"type": "number"}
},
"required": ["company_mission"]
}
import requests
response = requests.post(
'https://api.anycrawl.dev/v1/scrape',
headers={
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={
'url': 'https://example.com',
'json_options': {
'schema': schema,
'user_prompt': 'Extract the company mission (string), open source status (boolean), and employee count (number).'
}
}
)
print(response.json()['data']['json'])
JSON options object
The json_options
parameter is an object that accepts the following parameters:
schema
: The schema to use for the extraction.user_prompt
: The user prompt to use for the extraction.schema_name
: Optional name of the output that should be generated.schema_description
: Optional description of the output that should be generated.
JSON Schema
The schema to use for the extraction. Common fields include:
type
: The type of the value, supported types show below.properties
: (forobject
) An object specifying the fields and their schemas.required
: (forobject
) An array of required field names.
Supported Types
Type | Description | Example |
---|---|---|
string | Text data | Company name, descriptions, titles |
number | Numeric values | Prices, quantities, ratings |
boolean | True/false values | Availability flags, feature indicators |
object | Nested structures | Address details, product specifications |
array | Lists of values or objects | Tags, categories, feature lists |
Schema Example
{
"type": "object",
"properties": {
"company_name": {
"type": "string"
},
"is_open_source": {
"type": "object",
"properties": {
"answer": {
"type": "string"
},
"value": {
"type": "boolean"
}
}
},
"employee_count": {
"type": "number"
}
},
"required": ["company_name"]
}