Starter Kit - Ships with your template. You own it - modify freely.
Basic Usage
agents :
- name : scrape-page
agent : scrape
inputs :
url : https://example.com/data
Output:
{
"success" : true ,
"url" : "https://example.com/data" ,
"markdown" : "# Example Domain \n\n This domain is for..." ,
"html" : "<html>...</html>" ,
"text" : "extracted text content" ,
"title" : "Example Domain" ,
"tier" : 1 ,
"duration" : 350 ,
"botProtectionDetected" : false ,
"contentLength" : 1256
}
inputs :
url :
type : string
format : uri
required : true
description : URL to scrape
Configuration
config :
strategy :
type : string
enum : [ fast , balanced , aggressive ]
default : balanced
description : |
fast: Only Tier 1, return fallback on failure
balanced: Tier 1 + Tier 2, then fallback
aggressive: All 3 tiers
returnFormat :
type : string
enum : [ markdown , html , text ]
default : markdown
description : Format for the returned content
blockResources :
type : boolean
default : true
description : Block images, fonts, etc. for faster scraping
userAgent :
type : string
description : Custom user agent string
timeout :
type : integer
default : 30000
description : Request timeout in milliseconds
Scraping Strategies
The scrape agent uses a 3-tier approach with automatic fallback:
Tier 1: Fast Browser Rendering (~350ms)
Uses domcontentloaded wait strategy
Fastest scraping method
Best for simple static pages
Tier 2: Slow Browser Rendering (~2s)
Uses networkidle2 wait strategy
Waits for network activity to settle
Good for JavaScript-heavy pages
Tier 3: HTML Parsing Fallback (~1.5s)
Pure HTML parsing without browser
Used when bot protection is detected
Most reliable but least feature-rich
Strategy Options
Fast Strategy
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
config :
strategy : fast
returnFormat : text
Balanced Strategy (Default)
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
config :
strategy : balanced
returnFormat : markdown
Aggressive Strategy
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
config :
strategy : aggressive
returnFormat : html
agents :
- name : scrape
agent : scrape
inputs :
url : https://example.com
config :
returnFormat : markdown
Output includes clean markdown conversion of the page content.
agents :
- name : scrape
agent : scrape
inputs :
url : https://example.com
config :
returnFormat : html
Output includes raw HTML for further processing.
Text Format
agents :
- name : scrape
agent : scrape
inputs :
url : https://example.com
config :
returnFormat : text
Output includes plain text extraction (no tags).
Advanced Patterns
With Caching
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
cache :
ttl : 3600 # 1 hour
key : scrape-${input.url}
With Retries
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
retry :
maxAttempts : 5
backoff : exponential
Multiple URLs with Fallback
agents :
- name : scrape-primary
agent : scrape
inputs :
url : https://primary.example.com/data
retry :
maxAttempts : 2
- name : scrape-backup
condition : ${scrape-primary.failed}
agent : scrape
inputs :
url : https://backup.example.com/data
Rate-Limited Scraping
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
config :
timeout : 60000
retry :
maxAttempts : 3
backoff : exponential
initialDelay : 1000
maxDelay : 10000
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
config :
returnFormat : html
- name : extract
operation : think
config :
provider : openai
model : gpt-4o-mini
prompt : |
Extract product information from this HTML:
${scrape.output.html}
Return JSON with: name, price, description, availability
Scrape + Validate
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
- name : validate
agent : validate
inputs :
data : ${scrape.output}
schema :
success : boolean
url : string
markdown : string
Output Schema
{
success : boolean ; // Whether scraping succeeded
url : string ; // Final URL (after redirects)
markdown ?: string ; // Content as markdown (if returnFormat: markdown)
html ?: string ; // Full HTML content (if returnFormat: html)
text ?: string ; // Plain text extraction (if returnFormat: text)
title : string ; // Page title
tier : 1 | 2 | 3 ; // Which tier successfully scraped
duration : number ; // Total scrape time in ms
botProtectionDetected : boolean ; // Whether bot protection was detected
contentLength : number ; // Content length in bytes
}
Error Handling
The scrape agent handles:
Network timeouts : Automatic retry with exponential backoff
Bot protection : Automatic fallback to HTML parsing
JavaScript rendering : Multi-tier strategy with progressive enhancement
Redirect loops : Follows redirects up to configured limit
Invalid URLs : Immediate failure with descriptive error
Access error details:
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
- name : handle-error
condition : ${scrape.failed}
operation : code
config :
script : scripts/handle-scrape-error
input :
error_message : ${scrape.error.message}
tier_reached : ${scrape.output.tier}
bot_detected : ${scrape.output.botProtectionDetected}
// scripts/handle-scrape-error.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'
export default function handleScrapeError ( context : AgentExecutionContext ) {
const { error_message , tier_reached , bot_detected } = context . input
if ( bot_detected ) {
return {
error: 'Bot protection detected' ,
recommendation: 'Consider using a different scraping approach'
}
}
return {
error: error_message ,
tier: tier_reached ,
retry: true
}
}
Best Practices
1. Choose the Right Strategy
config :
strategy : fast # For simple, static pages
strategy : balanced # For most use cases (default)
strategy : aggressive # For complex, bot-protected sites
2. Use Caching for Static Content
cache :
ttl : 86400 # 24 hours for slow-changing content
3. Set Appropriate Timeouts
config :
timeout : 30000 # 30 seconds (default)
timeout : 60000 # 60 seconds for slower sites
4. Respect robots.txt
config :
userAgent : "YourBot/1.0 (+https://yoursite.com/bot)"
5. Handle Failures Gracefully
agents :
- name : scrape
agent : scrape
- name : fallback
condition : ${scrape.failed}
operation : code
config :
script : scripts/return-cached-response
// scripts/return-cached-response.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'
export default function returnCachedResponse ( context : AgentExecutionContext ) {
return { cached: true , content: 'Fallback content' }
}
6. Rate Limit Proactively
retry :
maxAttempts : 3
initialDelay : 1000
maxDelay : 5000
7. Monitor Scraping Performance
agents :
- name : scrape
agent : scrape
inputs :
url : ${input.url}
- name : log-metrics
operation : code
config :
script : scripts/log-scrape-metrics
input :
duration : ${scrape.output.duration}
tier : ${scrape.output.tier}
bot_detected : ${scrape.output.botProtectionDetected}
Common Use Cases
Company Research
agents :
- name : scrape-homepage
agent : scrape
inputs :
url : https://${input.company_domain}
config :
returnFormat : text
- name : extract-info
operation : think
config :
prompt : Extract company info from : ${scrape-homepage.output.text}
Price Monitoring
agents :
- name : scrape-product
agent : scrape
inputs :
url : ${input.product_url}
config :
returnFormat : html
cache :
ttl : 3600
- name : extract-price
operation : code
config :
script : scripts/extract-price-from-html
input :
html : ${scrape-product.output.html}
// scripts/extract-price-from-html.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'
export default function extractPriceFromHtml ( context : AgentExecutionContext ) {
const { html } = context . input
const match = html . match ( / \$ ( \d + \. \d {2} ) / )
if ( ! match ) {
throw new Error ( 'Price not found in HTML' )
}
return { price: parseFloat ( match [ 1 ]) }
}
Content Aggregation
agents :
- name : scrape-sources
agent : scrape
inputs :
url : ${input.sources} # Array of URLs
config :
returnFormat : markdown
- name : combine
operation : code
config :
script : scripts/combine-scraped-content
input :
scraped_results : ${scrape-sources.output}
// scripts/combine-scraped-content.ts
import type { AgentExecutionContext } from '@ensemble-edge/conductor'
export default function combineScrapedContent ( context : AgentExecutionContext ) {
const { scraped_results } = context . input
return {
content: scraped_results . map (( r : any ) => r . markdown ). join ( ' \n\n ' )
}
}
SEO Analysis
agents :
- name : scrape-page
agent : scrape
inputs :
url : ${input.url}
config :
returnFormat : html
- name : analyze-seo
operation : think
config :
provider : openai
model : gpt-4o-mini
prompt : |
Analyze the SEO of this page:
Title: ${scrape-page.output.title}
HTML: ${scrape-page.output.html}
Check for:
- Meta descriptions
- Header hierarchy
- Alt text on images
- Internal linking
- Page speed considerations
Tier Speed Success Rate Use Case Tier 1 ~350ms 70% Static pages Tier 2 ~2s 90% JavaScript-heavy pages Tier 3 ~1.5s 95% Bot-protected sites
Limitations
CAPTCHA : Cannot bypass (requires manual intervention)
Authentication : Basic authentication supported via headers
File downloads : Binary content handling depends on returnFormat
Rate limiting : Respect target site’s rate limits
Legal compliance : Ensure scraping is allowed per site’s terms of service
Next Steps
Validate Agent Validate scraped data
HTTP Operation Custom HTTP requests
RAG Pipeline Scrape + embed workflow
Starter Kit Overview All starter kit agents