Skip to main content

Overview

The Scrape member extracts content from websites with automatic HTML cleaning, CSS selector support, and markdown conversion. Built for edge performance with sub-100ms response times and intelligent caching. Perfect for content extraction, competitor monitoring, data gathering, and web scraping workflows.

Basic Usage

name: scrape-article
description: Extract article content from website

flow:
  - member: scrape-content
    type: Scrape
    input:
      url: "https://example.com/article"

output:
  content: ${scrape-content.output.content}
  title: ${scrape-content.output.title}

Configuration

Input Parameters

input:
  url: string              # Required: URL to scrape
  selector: string         # Optional: CSS selector for content
  cleanHtml: boolean       # Optional: Remove scripts/styles (default: true)
  convertToMarkdown: boolean  # Optional: Convert to markdown (default: false)
  timeout: number          # Optional: Request timeout in ms (default: 30000)
  headers: object          # Optional: Custom HTTP headers

Output Format

output:
  content: string          # Extracted content
  title: string           # Page title
  html: string            # Raw HTML (if requested)
  markdown: string        # Markdown (if requested)
  url: string             # Final URL (after redirects)
  status: number          # HTTP status code

Examples

Basic Scraping

- member: scrape-page
  type: Scrape
  input:
    url: "https://example.com"

With CSS Selector

- member: scrape-article
  type: Scrape
  input:
    url: "https://blog.example.com/post"
    selector: "article.post-content"
    cleanHtml: true

Convert to Markdown

- member: scrape-to-markdown
  type: Scrape
  input:
    url: "https://docs.example.com"
    selector: "main"
    convertToMarkdown: true

output:
  markdown: ${scrape-to-markdown.output.markdown}

With Custom Headers

- member: scrape-with-auth
  type: Scrape
  input:
    url: "https://api.example.com/page"
    headers:
      Authorization: "Bearer ${env.API_TOKEN}"
      User-Agent: "MyBot/1.0"

Multiple Pages in Parallel

flow:
  parallel:
    - member: scrape-page-1
      type: Scrape
      input:
        url: "https://example.com/page1"

    - member: scrape-page-2
      type: Scrape
      input:
        url: "https://example.com/page2"

    - member: scrape-page-3
      type: Scrape
      input:
        url: "https://example.com/page3"

  - member: combine-content
    type: Transform
    input:
      data:
        - ${scrape-page-1.output}
        - ${scrape-page-2.output}
        - ${scrape-page-3.output}

Common Patterns

Content Extraction Pipeline

name: extract-and-analyze
description: Scrape, clean, and analyze content

flow:
  # Scrape the page
  - member: scrape
    type: Scrape
    input:
      url: ${input.url}
      selector: "article"
      convertToMarkdown: true

  # Analyze with AI
  - member: analyze
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    input:
      prompt: |
        Analyze this article and extract key insights:
        ${scrape.output.markdown}

output:
  content: ${scrape.output.markdown}
  analysis: ${analyze.output.text}

Competitor Monitoring

name: monitor-competitor
description: Track competitor content changes

flow:
  # Scrape competitor page
  - member: scrape-competitor
    type: Scrape
    input:
      url: "https://competitor.com/pricing"
      selector: ".pricing-table"
    cache:
      ttl: 3600  # Cache for 1 hour

  # Get previous version from KV
  - member: get-previous
    type: Data
    config:
      storage: kv
      operation: get
      binding: CACHE
    input:
      key: "competitor-pricing"

  # Compare versions
  - member: detect-changes
    type: Function
    input:
      current: ${scrape-competitor.output.content}
      previous: ${get-previous.output.value}

  # Alert if changed
  - member: send-alert
    condition: ${detect-changes.output.changed}
    type: API
    config:
      url: "${env.SLACK_WEBHOOK}"
      method: POST
    input:
      body:
        text: "Competitor pricing changed!"
        changes: ${detect-changes.output.diff}

  # Save current version
  - member: save-current
    type: Data
    config:
      storage: kv
      operation: put
      binding: CACHE
    input:
      key: "competitor-pricing"
      value: ${scrape-competitor.output.content}

output:
  changed: ${detect-changes.output.changed}

News Aggregation

name: aggregate-news
description: Scrape and aggregate news from multiple sources

flow:
  # Scrape multiple news sites
  parallel:
    - member: scrape-techcrunch
      type: Scrape
      input:
        url: "https://techcrunch.com"
        selector: ".post-block"

    - member: scrape-verge
      type: Scrape
      input:
        url: "https://theverge.com"
        selector: ".c-entry-box"

    - member: scrape-hn
      type: Scrape
      input:
        url: "https://news.ycombinator.com"
        selector: ".itemlist"

  # Combine and deduplicate
  - member: combine-articles
    type: Transform
    input:
      data:
        - ${scrape-techcrunch.output}
        - ${scrape-verge.output}
        - ${scrape-hn.output}
      expression: |
        $distinct($.content[])

  # Summarize with AI
  - member: summarize
    type: Think
    config:
      provider: openai
      model: gpt-4o-mini
    input:
      prompt: |
        Summarize the top tech news from today:
        ${combine-articles.output}

output:
  articles: ${combine-articles.output}
  summary: ${summarize.output.text}

Documentation Sync

name: sync-docs
description: Scrape external docs and sync to knowledge base

flow:
  # Scrape documentation pages
  - member: scrape-docs
    foreach: ${input.docUrls}
    type: Scrape
    input:
      url: ${item}
      selector: "main.docs"
      convertToMarkdown: true

  # Generate embeddings
  - member: generate-embeddings
    foreach: ${scrape-docs.output}
    type: Think
    config:
      provider: openai
      model: text-embedding-3-small
    input:
      text: ${item.markdown}

  # Store in vector database
  - member: store-vectors
    foreach: ${generate-embeddings.output}
    type: RAG
    config:
      operation: insert
      vectorizeBinding: "VECTORIZE"
      indexName: "docs"
    input:
      text: ${item.text}
      embedding: ${item.embedding}
      metadata:
        url: ${item.url}
        title: ${item.title}

output:
  synced: ${scrape-docs.output.length}

Caching

Cache Scraped Content

- member: scrape
  type: Scrape
  cache:
    ttl: 3600  # Cache for 1 hour
  input:
    url: "https://example.com"

Conditional Caching

- member: scrape
  type: Scrape
  cache:
    ttl: ${input.cacheDuration || 0}
  input:
    url: ${input.url}

Cache Key Customization

- member: scrape
  type: Scrape
  cache:
    ttl: 3600
    key: "scrape:${input.url}:${input.selector}"
  input:
    url: ${input.url}
    selector: ${input.selector}

Error Handling

Retry on Failure

- member: scrape
  type: Scrape
  retry:
    maxAttempts: 3
    backoff: exponential
  input:
    url: "https://unreliable-site.com"

Fallback URL

flow:
  - member: scrape-primary
    type: Scrape
    continue_on_error: true
    input:
      url: "https://primary.com"

  - member: scrape-fallback
    condition: ${!scrape-primary.success}
    type: Scrape
    input:
      url: "https://fallback.com"

output:
  content: ${scrape-primary.success ? scrape-primary.output.content : scrape-fallback.output.content}

Handle Missing Selectors

- member: scrape
  type: Scrape
  input:
    url: ${input.url}
    selector: ${input.selector || "body"}  # Fallback to body

Performance Optimization

Parallel Scraping

parallel:
  - member: scrape-1
    type: Scrape
  - member: scrape-2
    type: Scrape
  - member: scrape-3
    type: Scrape

Set Appropriate Timeouts

- member: scrape-fast
  type: Scrape
  input:
    timeout: 5000  # 5 seconds for fast sites

- member: scrape-slow
  type: Scrape
  input:
    timeout: 60000  # 60 seconds for slow sites

Minimize Content

- member: scrape
  type: Scrape
  input:
    selector: "article"  # Only scrape what you need
    cleanHtml: true      # Remove unnecessary HTML

Rate Limiting

Basic Rate Limiting

flow:
  - member: check-rate-limit
    type: Data
    config:
      storage: kv
      operation: get
      binding: CACHE
    input:
      key: "rate-limit:scrape"

  - member: scrape
    condition: ${(check-rate-limit.output.value || 0) < 100}
    type: Scrape

  - member: increment-counter
    condition: ${scrape.success}
    type: Data
    config:
      storage: kv
      operation: put
      binding: CACHE
    input:
      key: "rate-limit:scrape"
      value: ${(check-rate-limit.output.value || 0) + 1}
      expirationTtl: 3600

Delay Between Requests

flow:
  - member: scrape-page-1
    type: Scrape

  - member: wait
    type: Schedule
    config:
      delay: 1000  # Wait 1 second

  - member: scrape-page-2
    type: Scrape

Testing

import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';

describe('scrape member', () => {
  it('should scrape website', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        http: {
          responses: {
            'https://example.com': {
              status: 200,
              body: '<html><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
            }
          }
        }
      }
    });

    const result = await conductor.executeMember('scrape', {
      url: 'https://example.com'
    });

    expect(result).toBeSuccessful();
    expect(result.output.title).toBe('Test');
    expect(result.output.content).toContain('Hello');
  });

  it('should use CSS selector', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        http: {
          responses: {
            'https://example.com': {
              body: '<html><body><article>Article content</article><aside>Sidebar</aside></body></html>'
            }
          }
        }
      }
    });

    const result = await conductor.executeMember('scrape', {
      url: 'https://example.com',
      selector: 'article'
    });

    expect(result.output.content).toContain('Article content');
    expect(result.output.content).not.toContain('Sidebar');
  });

  it('should convert to markdown', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        http: {
          responses: {
            'https://example.com': {
              body: '<html><body><h1>Title</h1><p>Paragraph</p></body></html>'
            }
          }
        }
      }
    });

    const result = await conductor.executeMember('scrape', {
      url: 'https://example.com',
      convertToMarkdown: true
    });

    expect(result.output.markdown).toContain('# Title');
    expect(result.output.markdown).toContain('Paragraph');
  });
});

Best Practices

  1. Use CSS selectors - Extract only what you need
  2. Enable caching - Reduce redundant requests
  3. Set timeouts - Don’t wait forever
  4. Handle errors - Use retry and fallback
  5. Rate limit - Respect target servers
  6. Clean HTML - Remove unnecessary content
  7. Parallel scraping - Scrape multiple pages concurrently
  8. Test thoroughly - Mock HTTP responses

Limitations

  • JavaScript rendering: Does not execute JavaScript (use Puppeteer/Playwright for SPA)
  • Dynamic content: Cannot interact with dynamic content
  • File downloads: Cannot download binary files
  • Authentication: Basic header auth only (no OAuth flows)
  • Rate limits: Respect robots.txt and rate limiting