Scrape Agent - Ensemble Edge

Starter Kit - Ships with your template. You own it - modify freely.

Basic Usage

agents:
  - name: scrape-page
    agent: scrape
    inputs:
      url: https://example.com/data

Output:

{
  "success": true,
  "url": "https://example.com/data",
  "markdown": "# Example Domain\n\nThis domain is for...",
  "html": "<html>...</html>",
  "text": "extracted text content",
  "title": "Example Domain",
  "tier": 1,
  "duration": 350,
  "botProtectionDetected": false,
  "contentLength": 1256
}

Inputs

inputs:
  url:
    type: string
    format: uri
    required: true
    description: URL to scrape

Configuration

config:
  strategy:
    type: string
    enum: [fast, balanced, aggressive]
    default: balanced
    description: |
      fast: Only Tier 1, return fallback on failure
      balanced: Tier 1 + Tier 2, then fallback
      aggressive: All 3 tiers

  returnFormat:
    type: string
    enum: [markdown, html, text]
    default: markdown
    description: Format for the returned content

  blockResources:
    type: boolean
    default: true
    description: Block images, fonts, etc. for faster scraping

  userAgent:
    type: string
    description: Custom user agent string

  timeout:
    type: integer
    default: 30000
    description: Request timeout in milliseconds

Scraping Strategies

The scrape agent uses a 3-tier approach with automatic fallback:

Tier 1: Fast Browser Rendering (~350ms)

Uses domcontentloaded wait strategy
Fastest scraping method
Best for simple static pages

Tier 2: Slow Browser Rendering (~2s)

Uses networkidle2 wait strategy
Waits for network activity to settle
Good for JavaScript-heavy pages

Tier 3: HTML Parsing Fallback (~1.5s)

Pure HTML parsing without browser
Used when bot protection is detected
Most reliable but least feature-rich

Strategy Options

Fast Strategy

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: fast
      returnFormat: text

Balanced Strategy (Default)

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: balanced
      returnFormat: markdown

Aggressive Strategy

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: aggressive
      returnFormat: html

Return Formats

Markdown Format (Default)

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: markdown

Output includes clean markdown conversion of the page content.

HTML Format

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: html

Output includes raw HTML for further processing.

Text Format

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: text

Output includes plain text extraction (no tags).

Advanced Patterns

With Caching

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    cache:
      ttl: 3600  # 1 hour
      key: scrape-${input.url}

With Retries

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    retry:
      maxAttempts: 5
      backoff: exponential

Multiple URLs with Fallback

agents:
  - name: scrape-primary
    agent: scrape
    inputs:
      url: https://primary.example.com/data
    retry:
      maxAttempts: 2

  - name: scrape-backup
    condition: ${scrape-primary.failed}
    agent: scrape
    inputs:
      url: https://backup.example.com/data

Rate-Limited Scraping

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      timeout: 60000
    retry:
      maxAttempts: 3
      backoff: exponential
      initialDelay: 1000
      maxDelay: 10000

Extract Specific Content

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      returnFormat: html

  - name: extract
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Extract product information from this HTML:
        ${scrape.output.html}

        Return JSON with: name, price, description, availability

Scrape + Validate

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: validate
    agent: validate
    inputs:
      data: ${scrape.output}
      schema:
        success: boolean
        url: string
        markdown: string

Output Schema

{
  success: boolean;           // Whether scraping succeeded
  url: string;                // Final URL (after redirects)
  markdown?: string;          // Content as markdown (if returnFormat: markdown)
  html?: string;              // Full HTML content (if returnFormat: html)
  text?: string;              // Plain text extraction (if returnFormat: text)
  title: string;              // Page title
  tier: 1 | 2 | 3;           // Which tier successfully scraped
  duration: number;           // Total scrape time in ms
  botProtectionDetected: boolean;  // Whether bot protection was detected
  contentLength: number;      // Content length in bytes
}

Error Handling

The scrape agent handles:

Network timeouts: Automatic retry with exponential backoff
Bot protection: Automatic fallback to HTML parsing
JavaScript rendering: Multi-tier strategy with progressive enhancement
Redirect loops: Follows redirects up to configured limit
Invalid URLs: Immediate failure with descriptive error

Access error details:

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: handle-error
    condition: ${scrape.failed}
    operation: code
    config:
      script: scripts/handle-scrape-error
    input:
      error_message: ${scrape.error.message}
      tier_reached: ${scrape.output.tier}
      bot_detected: ${scrape.output.botProtectionDetected}

// scripts/handle-scrape-error.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function handleScrapeError(context: AgentExecutionContext) {
  const { error_message, tier_reached, bot_detected } = context.input

  if (bot_detected) {
    return {
      error: 'Bot protection detected',
      recommendation: 'Consider using a different scraping approach'
    }
  }

  return {
    error: error_message,
    tier: tier_reached,
    retry: true
  }
}

Best Practices

1. Choose the Right Strategy

config:
  strategy: fast      # For simple, static pages
  strategy: balanced  # For most use cases (default)
  strategy: aggressive # For complex, bot-protected sites

2. Use Caching for Static Content

cache:
  ttl: 86400  # 24 hours for slow-changing content

3. Set Appropriate Timeouts

config:
  timeout: 30000  # 30 seconds (default)
  timeout: 60000  # 60 seconds for slower sites

4. Respect robots.txt

config:
  userAgent: "YourBot/1.0 (+https://yoursite.com/bot)"

5. Handle Failures Gracefully

agents:
  - name: scrape
    agent: scrape
  - name: fallback
    condition: ${scrape.failed}
    operation: code
    config:
      script: scripts/return-cached-response

// scripts/return-cached-response.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function returnCachedResponse(context: AgentExecutionContext) {
  return { cached: true, content: 'Fallback content' }
}

6. Rate Limit Proactively

retry:
  maxAttempts: 3
  initialDelay: 1000
  maxDelay: 5000

7. Monitor Scraping Performance

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: log-metrics
    operation: code
    config:
      script: scripts/log-scrape-metrics
    input:
      duration: ${scrape.output.duration}
      tier: ${scrape.output.tier}
      bot_detected: ${scrape.output.botProtectionDetected}

Common Use Cases

Company Research

agents:
  - name: scrape-homepage
    agent: scrape
    inputs:
      url: https://${input.company_domain}
    config:
      returnFormat: text

  - name: extract-info
    operation: think
    config:
      prompt: Extract company info from: ${scrape-homepage.output.text}

Price Monitoring

agents:
  - name: scrape-product
    agent: scrape
    inputs:
      url: ${input.product_url}
    config:
      returnFormat: html
    cache:
      ttl: 3600

  - name: extract-price
    operation: code
    config:
      script: scripts/extract-price-from-html
    input:
      html: ${scrape-product.output.html}

// scripts/extract-price-from-html.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function extractPriceFromHtml(context: AgentExecutionContext) {
  const { html } = context.input
  const match = html.match(/\$(\d+\.\d{2})/)

  if (!match) {
    throw new Error('Price not found in HTML')
  }

  return { price: parseFloat(match[1]) }
}

Content Aggregation

agents:
  - name: scrape-sources
    agent: scrape
    inputs:
      url: ${input.sources}  # Array of URLs
    config:
      returnFormat: markdown

  - name: combine
    operation: code
    config:
      script: scripts/combine-scraped-content
    input:
      scraped_results: ${scrape-sources.output}

// scripts/combine-scraped-content.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function combineScrapedContent(context: AgentExecutionContext) {
  const { scraped_results } = context.input

  return {
    content: scraped_results.map((r: any) => r.markdown).join('\n\n')
  }
}

SEO Analysis

agents:
  - name: scrape-page
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      returnFormat: html

  - name: analyze-seo
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Analyze the SEO of this page:
        Title: ${scrape-page.output.title}
        HTML: ${scrape-page.output.html}

        Check for:
        - Meta descriptions
        - Header hierarchy
        - Alt text on images
        - Internal linking
        - Page speed considerations

Performance Characteristics

Tier	Speed	Success Rate	Use Case
Tier 1	~350ms	70%	Static pages
Tier 2	~2s	90%	JavaScript-heavy pages
Tier 3	~1.5s	95%	Bot-protected sites

Limitations

CAPTCHA: Cannot bypass (requires manual intervention)
Authentication: Basic authentication supported via headers
File downloads: Binary content handling depends on returnFormat
Rate limiting: Respect target site’s rate limits
Legal compliance: Ensure scraping is allowed per site’s terms of service

Next Steps

Validate Agent

Validate scraped data

HTTP Operation

Custom HTTP requests

RAG Pipeline

Scrape + embed workflow

Starter Kit Overview

All starter kit agents

Conductor

Getting Started

Core Concepts

Building

Components

Operations Reference

Plugins

Starter Kit

Playbooks

Reference

​Basic Usage

​Inputs

​Configuration

​Scraping Strategies

​Tier 1: Fast Browser Rendering (~350ms)

​Tier 2: Slow Browser Rendering (~2s)

​Tier 3: HTML Parsing Fallback (~1.5s)

​Strategy Options

​Return Formats

​Markdown Format (Default)

​HTML Format

​Text Format

​Advanced Patterns

​With Caching

​With Retries

​Multiple URLs with Fallback

​Rate-Limited Scraping

​Extract Specific Content

​Scrape + Validate

​Output Schema

​Error Handling

​Best Practices

​Common Use Cases

​Performance Characteristics

​Limitations

​Next Steps

Validate Agent

HTTP Operation

RAG Pipeline

Starter Kit Overview

Basic Usage

Inputs

Configuration

Scraping Strategies

Tier 1: Fast Browser Rendering (~350ms)

Tier 2: Slow Browser Rendering (~2s)

Tier 3: HTML Parsing Fallback (~1.5s)

Strategy Options

Return Formats

Markdown Format (Default)

HTML Format

Text Format

Advanced Patterns

With Caching

With Retries

Multiple URLs with Fallback

Rate-Limited Scraping

Extract Specific Content

Scrape + Validate

Output Schema

Error Handling

Best Practices

Common Use Cases

Performance Characteristics

Limitations

Next Steps