Skip to main content

Overview

The scrape built-in member performs intelligent web scraping with a three-tier fallback strategy: fast (Cloudflare Browser Rendering), slow (Browserless), and HTML parsing.
- member: scrape-page
  type: Function
  config:
    builtin: scrape
    url: https://example.com
    format: markdown

Configuration

url
string
required
URL to scrape
format
string
default:"markdown"
Output format: markdown, html, text
strategy
string
default:"auto"
Strategy: auto, fast, slow, html
selector
string
CSS selector for content extraction
waitFor
string
Wait for selector before scraping
timeout
number
default:"30000"
Timeout in milliseconds

Strategies

Auto (Default)

Tries strategies in order until one succeeds:
  1. Fast - Cloudflare Browser Rendering (fastest)
  2. Slow - Browserless with full browser (slower but reliable)
  3. HTML - Direct HTML parsing (fallback)

Fast

Uses Cloudflare Browser Rendering API:
- member: fast-scrape
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    strategy: fast

Slow

Uses Browserless for JavaScript-heavy sites:
- member: slow-scrape
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    strategy: slow
    waitFor: '.content-loaded'

HTML

Direct HTML parsing (no JavaScript):
- member: html-scrape
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    strategy: html
    selector: 'article.content'

Output Formats

Markdown

Clean markdown output:
- member: to-markdown
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    format: markdown
Output:
# Page Title

Content paragraph...

## Section Heading

More content...

HTML

Preserved HTML structure:
- member: to-html
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    format: html

Text

Plain text only:
- member: to-text
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    format: text

Examples

Basic Scraping

flow:
  - member: scrape-article
    type: Function
    config:
      builtin: scrape
      url: https://blog.example.com/article-123
      format: markdown

output:
  content: ${scrape-article.output.content}
  title: ${scrape-article.output.title}

With Selector

- member: scrape-content
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    selector: 'main.article-content'
    format: markdown

Wait for Dynamic Content

- member: scrape-spa
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
    strategy: slow
    waitFor: '[data-content-loaded="true"]'
    timeout: 60000

Batch Scraping

flow:
  - foreach: ${input.urls}
    as: url
    do:
      - member: scrape-page
        type: Function
        config:
          builtin: scrape
          url: ${url}
          format: markdown
    cache:
      enabled: true
      ttl: 3600000  # Cache 1 hour
      key: ${url}

Output

interface ScrapeOutput {
  content: string;
  title?: string;
  strategy: 'fast' | 'slow' | 'html';
  duration: number;
  cached?: boolean;
}

Error Handling

- member: safe-scrape
  type: Function
  config:
    builtin: scrape
    url: ${input.url}
  retry:
    maxAttempts: 3
    backoff: exponential
Error codes:
  • SCRAPE_TIMEOUT - Exceeded timeout
  • SCRAPE_FAILED - All strategies failed
  • INVALID_URL - Malformed URL
  • NETWORK_ERROR - Connection failed

Best Practices

  1. Use auto strategy - Let it choose best method
  2. Cache results - Avoid redundant scraping
  3. Set reasonable timeouts - Prevent hanging
  4. Handle errors - Sites may be down
  5. Respect robots.txt - Be a good citizen
  6. Rate limit - Don’t overwhelm servers
  7. Use selectors - Extract specific content
  8. Test with various sites - Different structures