Skip to main content
Starter Kit - Ships with your template. You own it - modify freely.

Basic Usage

agents:
  - name: scrape-page
    agent: scrape
    inputs:
      url: https://example.com/data
Output:
{
  "success": true,
  "url": "https://example.com/data",
  "markdown": "# Example Domain\n\nThis domain is for...",
  "html": "<html>...</html>",
  "text": "extracted text content",
  "title": "Example Domain",
  "tier": 1,
  "duration": 350,
  "botProtectionDetected": false,
  "contentLength": 1256
}

Inputs

inputs:
  url:
    type: string
    format: uri
    required: true
    description: URL to scrape

Configuration

config:
  strategy:
    type: string
    enum: [fast, balanced, aggressive]
    default: balanced
    description: |
      fast: Only Tier 1, return fallback on failure
      balanced: Tier 1 + Tier 2, then fallback
      aggressive: All 3 tiers

  returnFormat:
    type: string
    enum: [markdown, html, text]
    default: markdown
    description: Format for the returned content

  blockResources:
    type: boolean
    default: true
    description: Block images, fonts, etc. for faster scraping

  userAgent:
    type: string
    description: Custom user agent string

  timeout:
    type: integer
    default: 30000
    description: Request timeout in milliseconds

Scraping Strategies

The scrape agent uses a 3-tier approach with automatic fallback:

Tier 1: Fast Browser Rendering (~350ms)

  • Uses domcontentloaded wait strategy
  • Fastest scraping method
  • Best for simple static pages

Tier 2: Slow Browser Rendering (~2s)

  • Uses networkidle2 wait strategy
  • Waits for network activity to settle
  • Good for JavaScript-heavy pages

Tier 3: HTML Parsing Fallback (~1.5s)

  • Pure HTML parsing without browser
  • Used when bot protection is detected
  • Most reliable but least feature-rich

Strategy Options

Fast Strategy
agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: fast
      returnFormat: text
Balanced Strategy (Default)
agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: balanced
      returnFormat: markdown
Aggressive Strategy
agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      strategy: aggressive
      returnFormat: html

Return Formats

Markdown Format (Default)

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: markdown
Output includes clean markdown conversion of the page content.

HTML Format

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: html
Output includes raw HTML for further processing.

Text Format

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: https://example.com
    config:
      returnFormat: text
Output includes plain text extraction (no tags).

Advanced Patterns

With Caching

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    cache:
      ttl: 3600  # 1 hour
      key: scrape-${input.url}

With Retries

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    retry:
      maxAttempts: 5
      backoff: exponential

Multiple URLs with Fallback

agents:
  - name: scrape-primary
    agent: scrape
    inputs:
      url: https://primary.example.com/data
    retry:
      maxAttempts: 2

  - name: scrape-backup
    condition: ${scrape-primary.failed}
    agent: scrape
    inputs:
      url: https://backup.example.com/data

Rate-Limited Scraping

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      timeout: 60000
    retry:
      maxAttempts: 3
      backoff: exponential
      initialDelay: 1000
      maxDelay: 10000

Extract Specific Content

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      returnFormat: html

  - name: extract
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Extract product information from this HTML:
        ${scrape.output.html}

        Return JSON with: name, price, description, availability

Scrape + Validate

agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: validate
    agent: validate
    inputs:
      data: ${scrape.output}
      schema:
        success: boolean
        url: string
        markdown: string

Output Schema

{
  success: boolean;           // Whether scraping succeeded
  url: string;                // Final URL (after redirects)
  markdown?: string;          // Content as markdown (if returnFormat: markdown)
  html?: string;              // Full HTML content (if returnFormat: html)
  text?: string;              // Plain text extraction (if returnFormat: text)
  title: string;              // Page title
  tier: 1 | 2 | 3;           // Which tier successfully scraped
  duration: number;           // Total scrape time in ms
  botProtectionDetected: boolean;  // Whether bot protection was detected
  contentLength: number;      // Content length in bytes
}

Error Handling

The scrape agent handles:
  • Network timeouts: Automatic retry with exponential backoff
  • Bot protection: Automatic fallback to HTML parsing
  • JavaScript rendering: Multi-tier strategy with progressive enhancement
  • Redirect loops: Follows redirects up to configured limit
  • Invalid URLs: Immediate failure with descriptive error
Access error details:
agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: handle-error
    condition: ${scrape.failed}
    operation: code
    config:
      script: scripts/handle-scrape-error
    input:
      error_message: ${scrape.error.message}
      tier_reached: ${scrape.output.tier}
      bot_detected: ${scrape.output.botProtectionDetected}
// scripts/handle-scrape-error.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function handleScrapeError(context: AgentExecutionContext) {
  const { error_message, tier_reached, bot_detected } = context.input

  if (bot_detected) {
    return {
      error: 'Bot protection detected',
      recommendation: 'Consider using a different scraping approach'
    }
  }

  return {
    error: error_message,
    tier: tier_reached,
    retry: true
  }
}

Best Practices

1. Choose the Right Strategy
config:
  strategy: fast      # For simple, static pages
  strategy: balanced  # For most use cases (default)
  strategy: aggressive # For complex, bot-protected sites
2. Use Caching for Static Content
cache:
  ttl: 86400  # 24 hours for slow-changing content
3. Set Appropriate Timeouts
config:
  timeout: 30000  # 30 seconds (default)
  timeout: 60000  # 60 seconds for slower sites
4. Respect robots.txt
config:
  userAgent: "YourBot/1.0 (+https://yoursite.com/bot)"
5. Handle Failures Gracefully
agents:
  - name: scrape
    agent: scrape
  - name: fallback
    condition: ${scrape.failed}
    operation: code
    config:
      script: scripts/return-cached-response
// scripts/return-cached-response.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function returnCachedResponse(context: AgentExecutionContext) {
  return { cached: true, content: 'Fallback content' }
}
6. Rate Limit Proactively
retry:
  maxAttempts: 3
  initialDelay: 1000
  maxDelay: 5000
7. Monitor Scraping Performance
agents:
  - name: scrape
    agent: scrape
    inputs:
      url: ${input.url}

  - name: log-metrics
    operation: code
    config:
      script: scripts/log-scrape-metrics
    input:
      duration: ${scrape.output.duration}
      tier: ${scrape.output.tier}
      bot_detected: ${scrape.output.botProtectionDetected}

Common Use Cases

Company Research
agents:
  - name: scrape-homepage
    agent: scrape
    inputs:
      url: https://${input.company_domain}
    config:
      returnFormat: text

  - name: extract-info
    operation: think
    config:
      prompt: Extract company info from: ${scrape-homepage.output.text}
Price Monitoring
agents:
  - name: scrape-product
    agent: scrape
    inputs:
      url: ${input.product_url}
    config:
      returnFormat: html
    cache:
      ttl: 3600

  - name: extract-price
    operation: code
    config:
      script: scripts/extract-price-from-html
    input:
      html: ${scrape-product.output.html}
// scripts/extract-price-from-html.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function extractPriceFromHtml(context: AgentExecutionContext) {
  const { html } = context.input
  const match = html.match(/\$(\d+\.\d{2})/)

  if (!match) {
    throw new Error('Price not found in HTML')
  }

  return { price: parseFloat(match[1]) }
}
Content Aggregation
agents:
  - name: scrape-sources
    agent: scrape
    inputs:
      url: ${input.sources}  # Array of URLs
    config:
      returnFormat: markdown

  - name: combine
    operation: code
    config:
      script: scripts/combine-scraped-content
    input:
      scraped_results: ${scrape-sources.output}
// scripts/combine-scraped-content.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'

export default function combineScrapedContent(context: AgentExecutionContext) {
  const { scraped_results } = context.input

  return {
    content: scraped_results.map((r: any) => r.markdown).join('\n\n')
  }
}
SEO Analysis
agents:
  - name: scrape-page
    agent: scrape
    inputs:
      url: ${input.url}
    config:
      returnFormat: html

  - name: analyze-seo
    operation: think
    config:
      provider: openai
      model: gpt-4o-mini
      prompt: |
        Analyze the SEO of this page:
        Title: ${scrape-page.output.title}
        HTML: ${scrape-page.output.html}

        Check for:
        - Meta descriptions
        - Header hierarchy
        - Alt text on images
        - Internal linking
        - Page speed considerations

Performance Characteristics

TierSpeedSuccess RateUse Case
Tier 1~350ms70%Static pages
Tier 2~2s90%JavaScript-heavy pages
Tier 3~1.5s95%Bot-protected sites

Limitations

  • CAPTCHA: Cannot bypass (requires manual intervention)
  • Authentication: Basic authentication supported via headers
  • File downloads: Binary content handling depends on returnFormat
  • Rate limiting: Respect target site’s rate limits
  • Legal compliance: Ensure scraping is allowed per site’s terms of service

Next Steps