Web Page Scrape
Scrape text, markdown or HTML from a website
Last updated
Scrape text, markdown or HTML from a website
Last updated
The "Web Page Scrape" Step allows you to automate a text/markdown/HTML scrape from a specific URL. You can combine this with an Iteration Step to scrape through multiple websites, and parse the output separately.
Configuring the step requires setting the parameters shown below:
Add the specific URL you want the step to scrape.
Optionally, you may limit the number of characters returned by the step.
This parameter can be helpful to limit the amount of text passed to a subsequent LLM step, which has a limited context window.
1 token is approximately 4 characters in English. To estimate the number of characters, you should pass to an LLM step, multiply the # of tokens you want to pass by 4.
By default, the code step will terminate the workflow if it fails. However, to continue the workflow if the step fails, simply click Continue
at the bottom of the step.
The step will return the following keys:
output
: this will be null
error
:
message
: the message returned from the step
code
: the error code representing the error
By default, the Web Page Scrape Step will not render websites that use Javascript.
Check this box to enable scraping from websites that use Javascript to help render dynamic content (examples include Facebook, Airbnb, and more).
The maximum time to wait for your webscraped results to return in milliseconds.
Use the residential proxy for sites that require more reliability and higher success rates. On the other hand, use the datacenter proxy where reliability and success rates are not a concern.
Datacenter:
Private IP addresses that are housed in data centers
Offer higher speed but they are less reliable in terms of anonymity
More likely to be detected and blocked by websites and internet services.
A real IP address attached to a physical location
Webscraping will appear as if it's coming from a residential home in a certain location
Considered more legitimate and less likely to be blocked by websites
Note: Using a residential proxy is more expensive than using the datacenter, so it's good to measure this against your use-case when deciding which proxy to use.
Text: No formatting included
HTML: Extract headers and formatting from a website in HTML
Markdown: Extract the headers from a website in markdown