Web Page Scrape

Scrape text, markdown or HTML from a website

The "Web Page Scrape" Step allows you to automate a text/markdown/HTML scrape from a specific URL. You can combine this with an Iteration Step to scrape through multiple websites, and parse the output separately.

Configuring the "Web Page Scrape" Step

Configuring the step requires setting the parameters shown below:

URL

Add the specific URL you want the step to scrape.

Maximum Length

Optionally, you may limit the number of characters returned by the step.

This parameter can be helpful to limit the amount of text passed to a subsequent LLM step, which has a limited context window.

1 token is approximately 4 characters in English. To estimate the number of characters, you should pass to an LLM step, multiply the # of tokens you want to pass by 4.

Hard fail?

If checked, trigger an error and force the app to fail if no results could be scraped. If left unchecked, then an empty string will be returned when no results could be scraped.

Use a conditional step or Liquid conditional to check if the Webscrape step returned an empty string

Enable Javascript rendering?

By default, the Web Page Scrape Step will not render websites that use Javascript.

Check this box to enable scraping from websites that use Javascript to help render dynamic content (examples include Facebook, Airbnb, and more).

Timeout

The maximum time to wait for your webscraped results to return in milliseconds.

Type of Proxy:

Use the residential proxy for sites that require more reliability and higher success rates. On the other hand, use the datacenter proxy where reliability and success rates are not a concern.

Datacenter:

  • Private IP addresses that are housed in data centers

  • Offer higher speed but they are less reliable in terms of anonymity

  • More likely to be detected and blocked by websites and internet services.

Residential

  • A real IP address attached to a physical location

  • Webscraping will appear as if it's coming from a residential home in a certain location

  • Considered more legitimate and less likely to be blocked by websites

Note: Using a residential proxy is more expensive than using the datacenter, so it's good to measure this against your use-case when deciding which proxy to use.

Output Type

  • Text: No formatting included

  • HTML: Extract headers and formatting from a website in HTML

  • Markdown: Extract the headers from a website in markdown

Last updated