AnyCrawl

JSON mode

Extract structured data from web pages using LLMs with JSON schema validation.

Introduction

JSON mode enables you to extract structured data from any webpage using Large Language Models (LLMs).

AnyCrawl allows you to define a JSON schema and prompts to guide the extraction process, ensuring the output matches your desired format.

Usage Examples

JSON mode supports three structured extraction modes:

  1. Prompt Only: Use a natural language prompt to describe what you want to extract. The LLM determines the output structure.
  2. Schema Only: Define a JSON schema to strictly constrain the output structure. The LLM fills in the content.
  3. Prompt + Schema: Combine both schema and prompt for precise structure and content guidance.

1. Prompt Only (Flexible Structure)

Steps:

  1. Set the target webpage URL
  2. Write a prompt describing the information to extract
  3. Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "user_prompt": "Extract the company mission, open source status, and employee count."
    }
  }'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
    method: "POST",
    headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    body: JSON.stringify({
        url: "https://example.com",
        json_options: {
            user_prompt: "Extract the company mission, open source status, and employee count.",
        },
    }),
});
const result = await response.json();
console.log(result.data.json);
import requests

response = requests.post(
    'https://api.anycrawl.dev/v1/scrape',
    headers={
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    json={
        'url': 'https://example.com',
        'json_options': {
            'user_prompt': 'Extract the company mission, open source status, and employee count.'
        }
    }
)
print(response.json()['data']['json'])

2. Schema Only (Strict Structure)

Steps:

  1. Set the target webpage URL
  2. Define a JSON schema for the desired fields and types
  3. Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
    method: "POST",
    headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    body: JSON.stringify({
        url: "https://example.com",
        json_options: {
            schema: {
                type: "object",
                properties: {
                    company_mission: { type: "string" },
                    is_open_source: { type: "boolean" },
                    employee_count: { type: "number" },
                },
                required: ["company_mission"],
            },
        },
    }),
});
const result = await response.json();
console.log(result.data.json);
schema = {
    "type": "object",
    "properties": {
        "company_mission": {"type": "string"},
        "is_open_source": {"type": "boolean"},
        "employee_count": {"type": "number"}
    },
    "required": ["company_mission"]
}

import requests

response = requests.post(
    'https://api.anycrawl.dev/v1/scrape',
    headers={
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    json={
        'url': 'https://example.com',
        'json_options': {
            'schema': schema
        }
    }
)
print(response.json()['data']['json'])

3. Prompt + Schema (Guided Structure & Content)

Steps:

  1. Set the target webpage URL
  2. Define a JSON schema for the output structure
  3. Write a prompt to guide the extraction
  4. Send the request
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      },
      "user_prompt": "Extract the company mission (string), open source status (boolean), and employee count (number)."
    }
  }'
const response = await fetch("https://api.anycrawl.dev/v1/scrape", {
    method: "POST",
    headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    body: JSON.stringify({
        url: "https://example.com",
        json_options: {
            schema: {
                type: "object",
                properties: {
                    company_mission: { type: "string" },
                    is_open_source: { type: "boolean" },
                    employee_count: { type: "number" },
                },
                required: ["company_mission"],
            },
            user_prompt:
                "Extract the company mission (string), open source status (boolean), and employee count (number).",
        },
    }),
});
const result = await response.json();
console.log(result.data.json);
schema = {
    "type": "object",
    "properties": {
        "company_mission": {"type": "string"},
        "is_open_source": {"type": "boolean"},
        "employee_count": {"type": "number"}
    },
    "required": ["company_mission"]
}

import requests

response = requests.post(
    'https://api.anycrawl.dev/v1/scrape',
    headers={
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    json={
        'url': 'https://example.com',
        'json_options': {
            'schema': schema,
            'user_prompt': 'Extract the company mission (string), open source status (boolean), and employee count (number).'
        }
    }
)
print(response.json()['data']['json'])

JSON options object

The json_options parameter is an object that accepts the following parameters:

  • schema: The schema to use for the extraction.
  • user_prompt: The user prompt to use for the extraction.
  • schema_name: Optional name of the output that should be generated.
  • schema_description: Optional description of the output that should be generated.

JSON Schema

The schema to use for the extraction. Common fields include:

  • type: The type of the value, supported types show below.
  • properties: (for object) An object specifying the fields and their schemas.
  • required: (for object) An array of required field names.

Supported Types

TypeDescriptionExample
stringText dataCompany name, descriptions, titles
numberNumeric valuesPrices, quantities, ratings
booleanTrue/false valuesAvailability flags, feature indicators
objectNested structuresAddress details, product specifications
arrayLists of values or objectsTags, categories, feature lists

Schema Example

{
    "type": "object",
    "properties": {
        "company_name": {
            "type": "string"
        },
        "is_open_source": {
            "type": "object",
            "properties": {
                "answer": {
                    "type": "string"
                },
                "value": {
                    "type": "boolean"
                }
            }
        },
        "employee_count": {
            "type": "number"
        }
    },
    "required": ["company_name"]
}