scraper Agent

Production-grade web scraping. Handles failures, timeouts, and rate limits.

Basic Usage

agents:
  - name: scrape-page
    agent: scraper
    inputs:
      url: https://example.com/data

Output:

{
  "html": "<html>...</html>",
  "text": "extracted text content",
  "status": 200,
  "headers": {...}
}

Inputs

inputs:
  url:
    type: string
    required: true
    description: URL to scrape

  method:
    type: string
    default: GET
    description: HTTP method

  headers:
    type: object
    description: Custom headers

  timeout:
    type: number
    default: 30000
    description: Request timeout (ms)

  userAgent:
    type: string
    default: "Conductor/1.0"
    description: User-Agent header

  followRedirects:
    type: boolean
    default: true
    description: Follow HTTP redirects

Configuration

Basic Scrape

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

With Custom Headers

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
      headers:
        Authorization: Bearer ${env.API_KEY}
        Accept: application/json

With Retries

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
    retry:
      maxAttempts: 5
      backoff: exponential
      retryOn: [500, 502, 503, 504]

With Caching

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
    cache:
      ttl: 3600  # 1 hour
      key: scrape-${input.url}

Advanced Patterns

Multiple URLs with Fallback

agents:
  - name: scrape-primary
    agent: scraper
    inputs:
      url: https://primary.example.com/data
    retry:
      maxAttempts: 2

  - name: scrape-backup
    condition: ${scrape-primary.failed}
    agent: scraper
    inputs:
      url: https://backup.example.com/data

Rate-Limited Scraping

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
      timeout: 60000
    retry:
      maxAttempts: 3
      backoff: exponential
      initialDelay: 1000
      maxDelay: 10000

Extract Specific Content

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: extract
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Extract product information from this HTML:
        ${scrape.output.html}

        Return JSON with: name, price, description, availability

Scrape + Validate

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: validate
    agent: validator
    inputs:
      data: ${scrape.output}
      schema:
        status: number
        html: string

Output Schema

{
  html: string;        // Full HTML content
  text: string;        // Extracted text (no tags)
  status: number;      // HTTP status code
  headers: object;     // Response headers
  url: string;         // Final URL (after redirects)
  redirected: boolean; // Whether redirects occurred
}

Error Handling

The scraper agent handles:

Network timeouts: Automatic retry with exponential backoff
5xx errors: Retries with configurable attempts
4xx errors: No retry (client error)
Redirect loops: Fails after 10 redirects
Invalid URLs: Immediate failure

Access error details:

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: handle-error
    condition: ${scrape.failed}
    operation: code
    config:
      code: |
        return {
          error: ${scrape.error.message},
          status: ${scrape.error.status},
          retry: ${scrape.error.retryable}
        };

Best Practices

1. Always Set Timeouts

inputs:
  timeout: 30000  # 30 seconds

2. Use Caching for Static Content

cache:
  ttl: 86400  # 24 hours for slow-changing content

3. Respect robots.txt

inputs:
  userAgent: "YourBot/1.0 (+https://yoursite.com/bot)"

4. Handle Failures Gracefully

agents:
  - name: scrape
    agent: scraper
  - name: fallback
    condition: ${scrape.failed}
    operation: code
    config:
      code: return { cached: true };

5. Rate Limit Proactively

retry:
  maxAttempts: 3
  initialDelay: 1000
  maxDelay: 5000

Common Use Cases

Company Research

agents:
  - name: scrape-homepage
    agent: scraper
    inputs:
      url: https://${input.company_domain}

  - name: extract-info
    operation: think
    config:
      prompt: Extract company info from: ${scrape-homepage.output.text}

Price Monitoring

agents:
  - name: scrape-product
    agent: scraper
    inputs:
      url: ${input.product_url}
    cache:
      ttl: 3600

  - name: extract-price
    operation: code
    config:
      code: |
        const match = ${scrape-product.output.html}.match(/\$(\d+\.\d{2})/);
        return { price: parseFloat(match[1]) };

Content Aggregation

agents:
  - name: scrape-sources
    agent: scraper
    inputs:
      url: ${input.sources}  # Array of URLs

  - name: combine
    operation: code
    config:
      code: |
        return {
          content: ${scrape-sources.output}.map(r => r.text).join('\n\n')
        };

Limitations

JavaScript rendering: Not supported (use Puppeteer via code operation)
CAPTCHA: Cannot bypass (requires manual intervention)
Authentication: Basic only (OAuth via custom implementation)
File downloads: Binary content not stored (use http operation)

Next Steps

validator

Validate scraped data

http

Custom HTTP requests

RAG Pipeline

Scrape + embed workflow

Agents Overview

All pre-built agents

Conductor

Getting Started

Core Concepts

Building

Operations Reference

Pre-built Agents Reference

Playbooks

Reference

scraper Agent

scraper Agent

Basic Usage

Inputs

Configuration

Basic Scrape

With Custom Headers

With Retries

With Caching

Advanced Patterns

Multiple URLs with Fallback

Rate-Limited Scraping

Extract Specific Content

Scrape + Validate

Output Schema

Error Handling

Best Practices

Common Use Cases

Limitations

Next Steps

validator

http

RAG Pipeline

Agents Overview

Conductor

Getting Started

Core Concepts

Building

Operations Reference

Pre-built Agents Reference

Playbooks

Reference

​scraper Agent

​Basic Usage

​Inputs

​Configuration

​Basic Scrape

​With Custom Headers

​With Retries

​With Caching

​Advanced Patterns

​Multiple URLs with Fallback

​Rate-Limited Scraping

​Extract Specific Content

​Scrape + Validate

​Output Schema

​Error Handling

​Best Practices

​Common Use Cases

​Limitations

​Next Steps

validator

http

RAG Pipeline

Agents Overview

scraper Agent

Basic Usage

Inputs

Configuration

Basic Scrape

With Custom Headers

With Retries

With Caching

Advanced Patterns

Multiple URLs with Fallback

Rate-Limited Scraping

Extract Specific Content

Scrape + Validate

Output Schema

Error Handling

Best Practices

Common Use Cases

Limitations

Next Steps