Documentation
HomeAPISign In
  • Getting Started
    • Overview
      • Core Concepts
      • Building your First Workflow
    • API Reference
  • Your Data
    • Brand Kits
    • Knowledge Bases
      • Add Data
        • Upload Files
        • Web Scrape
        • Import from Google Drive
        • Import from SQL Database
        • Import from Shopify
      • Knowledge Base Search
      • Knowledge Base Metadata
      • Knowledge Base API
  • Building Workflows
    • Workflow Concepts
      • Workflow Inputs
        • Input Types
      • Workflow Outputs
      • Variable Referencing
      • Liquid Templating
    • Workflow Steps
      • AI
        • Prompt LLM
          • Model Selection Guide
          • Prompting Guide
        • Transcribe Audio File
      • Web Research
        • Google Search
        • Web Page Scrape
      • Code
        • Run Code
        • Call API
        • Format JSON
        • Run SQL Query
        • Write Liquid Text
      • Flow
        • Condition
        • Iteration
        • Human Review
        • Content Comparison
        • Error
      • Data
        • Read from Grid
        • Write to Grid
        • Search Knowledge Base
        • Write to Knowledge Base
        • Get Knowledge Base File
      • AirOps
        • Workflow
        • Agent
      • Image & Video
        • Generate Image with API
        • Search Stock Images
        • Fetch Stock Image with ID
        • Resize Image
        • Screenshot from URL
        • Create OpenGraph Image
        • Create Video Avatar
      • SEO Research
        • Semrush
        • Data4SEO
      • Content Quality
        • Detect AI Content
        • Scan Content for Plagiarism
      • Content Processing
        • Convert Markdown to HTML
        • Convert PDF URL to Text
        • Group Keywords into Clusters
      • B2B Enrichment
        • Hunter.io
        • People Data Labs
      • CMS Integrations
        • Webflow
        • WordPress
        • Shopify
        • Contentful
        • Sanity
        • Strapi
      • Analytics Integrations
        • Google Search Console
      • Collaboration Integrations
        • Gmail
        • Google Docs
        • Google Sheets
        • Notion
        • Slack
    • Testing and Iteration
    • Publishing and Versioning
  • Running Workflows
    • Run Once
    • Run in Bulk (Grid)
    • Run via API
    • Run via Trigger
      • Incoming Webhook Trigger
      • Zapier
    • Run on a Schedule
    • Error Handling
  • Grids
    • Create a Grid
      • Import from Webflow
      • Import from Wordpress
      • Import from Semrush
      • Import from Google Search Console
    • Add Columns in the Grid
    • Run Workflows in the Grid
      • Add Workflow Column
      • Run Workflow Column
      • Map Workflow Outputs
      • Review Workflow Run Metadata
    • Review Content in the Grid
      • Review Markdown Content
      • Review HTML Content
      • Compare Content Difference
    • Publish to CMS from Grid
    • Pull Analytics in the Grid
    • Export as CSV
  • Copilot
    • Chat with Copilot
    • Edit Workflows with Copilot
    • Fix Errors with Copilot
  • Monitoring
    • Task Usage
    • Analytics
    • Alerts
    • Execution History
  • Your Workspace
    • Create a Workspace
    • Folders
    • Settings
    • Billing
    • Use your own LLM API Keys
    • Secrets
    • Team and Permissions
  • Chat Agents (Legacy)
    • Agent Quick Start
    • Chat Agents
    • Integrate Agents
      • Widget
      • Client Web SDK
  • About
    • Ethical AI and IP Production
    • Principles
    • Security and Compliance
Powered by GitBook
On this page
  • Configuring the "Web Page Scrape" Step
  • URL
  • Maximum Length
  • How to continue if the Web Scrape step fails
  • Enable Javascript rendering?
  • Timeout
  • Type of Proxy:
  • Output Type

Was this helpful?

  1. Building Workflows
  2. Workflow Steps
  3. Web Research

Web Page Scrape

Scrape text, markdown or HTML from a website

Last updated 4 months ago

Was this helpful?

The "Web Page Scrape" Step allows you to automate a text/markdown/HTML scrape from a specific URL. You can combine this with an Iteration Step to scrape through multiple websites, and parse the output separately.

Configuring the "Web Page Scrape" Step

Configuring the step requires setting the parameters shown below:

URL

Add the specific URL you want the step to scrape.

Maximum Length

Optionally, you may limit the number of characters returned by the step.

This parameter can be helpful to limit the amount of text passed to a subsequent LLM step, which has a limited context window.

1 token is approximately 4 characters in English. To estimate the number of characters, you should pass to an LLM step, multiply the # of tokens you want to pass by 4.

How to continue if the Web Scrape step fails

By default, the code step will terminate the workflow if it fails. However, to continue the workflow if the step fails, simply click Continue at the bottom of the step.

The step will return the following keys:

  • output : this will be null

  • error :

    • message: the message returned from the step

    • code : the error code representing the error

Enable Javascript rendering?

By default, the Web Page Scrape Step will not render websites that use Javascript.

Check this box to enable scraping from websites that use Javascript to help render dynamic content (examples include Facebook, Airbnb, and more).

Timeout

The maximum time to wait for your webscraped results to return in milliseconds.

Type of Proxy:

Use the residential proxy for sites that require more reliability and higher success rates. On the other hand, use the datacenter proxy where reliability and success rates are not a concern.

Datacenter:

  • Private IP addresses that are housed in data centers

  • Offer higher speed but they are less reliable in terms of anonymity

  • More likely to be detected and blocked by websites and internet services.

Residential

  • A real IP address attached to a physical location

  • Webscraping will appear as if it's coming from a residential home in a certain location

  • Considered more legitimate and less likely to be blocked by websites

Note: Using a residential proxy is more expensive than using the datacenter, so it's good to measure this against your use-case when deciding which proxy to use.

Headers

Use this field to pass custom headers for the web scrape request. Ensure the headers are formatted as valid JSON. For example:

{
    "authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "{{ a_variable }}"
}

Output Type

  • Text: No formatting included

  • HTML: Extract headers and formatting from a website in HTML

  • Markdown: Extract the headers from a website in markdown

Configure your Web Page Scrape parameters
Click continue to continue the workflow