Skip to main content

scraper Agent

Production-grade web scraping. Handles failures, timeouts, and rate limits.

Basic Usage

agents:
  - name: scrape-page
    agent: scraper
    inputs:
      url: https://example.com/data
Output:
{
  "html": "<html>...</html>",
  "text": "extracted text content",
  "status": 200,
  "headers": {...}
}

Inputs

inputs:
  url:
    type: string
    required: true
    description: URL to scrape

  method:
    type: string
    default: GET
    description: HTTP method

  headers:
    type: object
    description: Custom headers

  timeout:
    type: number
    default: 30000
    description: Request timeout (ms)

  userAgent:
    type: string
    default: "Conductor/1.0"
    description: User-Agent header

  followRedirects:
    type: boolean
    default: true
    description: Follow HTTP redirects

Configuration

Basic Scrape

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

With Custom Headers

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
      headers:
        Authorization: Bearer ${env.API_KEY}
        Accept: application/json

With Retries

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
    retry:
      maxAttempts: 5
      backoff: exponential
      retryOn: [500, 502, 503, 504]

With Caching

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
    cache:
      ttl: 3600  # 1 hour
      key: scrape-${input.url}

Advanced Patterns

Multiple URLs with Fallback

agents:
  - name: scrape-primary
    agent: scraper
    inputs:
      url: https://primary.example.com/data
    retry:
      maxAttempts: 2

  - name: scrape-backup
    condition: ${scrape-primary.failed}
    agent: scraper
    inputs:
      url: https://backup.example.com/data

Rate-Limited Scraping

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}
      timeout: 60000
    retry:
      maxAttempts: 3
      backoff: exponential
      initialDelay: 1000
      maxDelay: 10000

Extract Specific Content

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: extract
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Extract product information from this HTML:
        ${scrape.output.html}

        Return JSON with: name, price, description, availability

Scrape + Validate

agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: validate
    agent: validator
    inputs:
      data: ${scrape.output}
      schema:
        status: number
        html: string

Output Schema

{
  html: string;        // Full HTML content
  text: string;        // Extracted text (no tags)
  status: number;      // HTTP status code
  headers: object;     // Response headers
  url: string;         // Final URL (after redirects)
  redirected: boolean; // Whether redirects occurred
}

Error Handling

The scraper agent handles:
  • Network timeouts: Automatic retry with exponential backoff
  • 5xx errors: Retries with configurable attempts
  • 4xx errors: No retry (client error)
  • Redirect loops: Fails after 10 redirects
  • Invalid URLs: Immediate failure
Access error details:
agents:
  - name: scrape
    agent: scraper
    inputs:
      url: ${input.url}

  - name: handle-error
    condition: ${scrape.failed}
    operation: code
    config:
      code: |
        return {
          error: ${scrape.error.message},
          status: ${scrape.error.status},
          retry: ${scrape.error.retryable}
        };

Best Practices

1. Always Set Timeouts
inputs:
  timeout: 30000  # 30 seconds
2. Use Caching for Static Content
cache:
  ttl: 86400  # 24 hours for slow-changing content
3. Respect robots.txt
inputs:
  userAgent: "YourBot/1.0 (+https://yoursite.com/bot)"
4. Handle Failures Gracefully
agents:
  - name: scrape
    agent: scraper
  - name: fallback
    condition: ${scrape.failed}
    operation: code
    config:
      code: return { cached: true };
5. Rate Limit Proactively
retry:
  maxAttempts: 3
  initialDelay: 1000
  maxDelay: 5000

Common Use Cases

Company Research
agents:
  - name: scrape-homepage
    agent: scraper
    inputs:
      url: https://${input.company_domain}

  - name: extract-info
    operation: think
    config:
      prompt: Extract company info from: ${scrape-homepage.output.text}
Price Monitoring
agents:
  - name: scrape-product
    agent: scraper
    inputs:
      url: ${input.product_url}
    cache:
      ttl: 3600

  - name: extract-price
    operation: code
    config:
      code: |
        const match = ${scrape-product.output.html}.match(/\$(\d+\.\d{2})/);
        return { price: parseFloat(match[1]) };
Content Aggregation
agents:
  - name: scrape-sources
    agent: scraper
    inputs:
      url: ${input.sources}  # Array of URLs

  - name: combine
    operation: code
    config:
      code: |
        return {
          content: ${scrape-sources.output}.map(r => r.text).join('\n\n')
        };

Limitations

  • JavaScript rendering: Not supported (use Puppeteer via code operation)
  • CAPTCHA: Cannot bypass (requires manual intervention)
  • Authentication: Basic only (OAuth via custom implementation)
  • File downloads: Binary content not stored (use http operation)

Next Steps