Documentation
HomeAPISign In
  • Getting Started
    • Overview
      • Core Concepts
      • Building your First Workflow
    • API Reference
  • Your Data
    • Brand Kits
    • Knowledge Bases
      • Add Data
        • Upload Files
        • Web Scrape
        • Import from Google Drive
        • Import from SQL Database
        • Import from Shopify
      • Knowledge Base Search
      • Knowledge Base Metadata
      • Knowledge Base API
  • Building Workflows
    • Workflow Concepts
      • Workflow Inputs
        • Input Types
      • Workflow Outputs
      • Variable Referencing
      • Liquid Templating
    • Workflow Steps
      • AI
        • Prompt LLM
          • Model Selection Guide
          • Prompting Guide
        • Transcribe Audio File
      • Web Research
        • Google Search
        • Web Page Scrape
      • Code
        • Run Code
        • Call API
        • Format JSON
        • Run SQL Query
        • Write Liquid Text
      • Flow
        • Condition
        • Iteration
        • Human Review
        • Content Comparison
        • Error
      • Data
        • Read from Grid
        • Write to Grid
        • Search Knowledge Base
        • Write to Knowledge Base
        • Get Knowledge Base File
      • AirOps
        • Workflow
        • Agent
      • Image & Video
        • Generate Image with API
        • Search Stock Images
        • Fetch Stock Image with ID
        • Resize Image
        • Screenshot from URL
        • Create OpenGraph Image
        • Create Video Avatar
      • SEO Research
        • Semrush
        • Data4SEO
      • Content Quality
        • Detect AI Content
        • Scan Content for Plagiarism
      • Content Processing
        • Convert Markdown to HTML
        • Convert PDF URL to Text
        • Group Keywords into Clusters
      • B2B Enrichment
        • Hunter.io
        • People Data Labs
      • CMS Integrations
        • Webflow
        • WordPress
        • Shopify
        • Contentful
        • Sanity
        • Strapi
      • Analytics Integrations
        • Google Search Console
      • Collaboration Integrations
        • Gmail
        • Google Docs
        • Google Sheets
        • Notion
        • Slack
    • Testing and Iteration
    • Publishing and Versioning
  • Running Workflows
    • Run Once
    • Run in Bulk (Grid)
    • Run via API
    • Run via Trigger
      • Incoming Webhook Trigger
      • Zapier
    • Run on a Schedule
    • Error Handling
  • Grids
    • Create a Grid
      • Import from Webflow
      • Import from Wordpress
      • Import from Semrush
      • Import from Google Search Console
    • Add Columns in the Grid
    • Run Workflows in the Grid
      • Add Workflow Column
      • Run Workflow Column
      • Map Workflow Outputs
      • Review Workflow Run Metadata
    • Review Content in the Grid
      • Review Markdown Content
      • Review HTML Content
      • Compare Content Difference
    • Publish to CMS from Grid
    • Pull Analytics in the Grid
    • Export as CSV
  • Copilot
    • Chat with Copilot
    • Edit Workflows with Copilot
    • Fix Errors with Copilot
  • Monitoring
    • Task Usage
    • Analytics
    • Alerts
    • Execution History
  • Your Workspace
    • Create a Workspace
    • Folders
    • Settings
    • Billing
    • Use your own LLM API Keys
    • Secrets
    • Team and Permissions
  • Chat Agents (Legacy)
    • Agent Quick Start
    • Chat Agents
    • Integrate Agents
      • Widget
      • Client Web SDK
  • About
    • Ethical AI and IP Production
    • Principles
    • Security and Compliance
Powered by GitBook
On this page
  • How do Evaluations work?
  • Evaluating an LLM step
  • Adding samples from history
  • Test + Evaluate

Was this helpful?

  1. Archived

Output Evaluation

Offline evaluation framework to accelerate development

Last updated 3 months ago

Was this helpful?

Evaluations allow users to check if the outputs of a LLM step prompt meets a specified criteria. It also allows users to rapidly validate new prompts.

How do Evaluations work?

An evaluation determines if your sample (inputs) and prompt generate an output that meets the given criteria and returns true or false.

  • Sample: a set of inputs that you will use to evaluate against a set of criteria.

  • Criteria: a rule or specification written in natural language to evaluate the output of your prompt.

    • E.g. The output must be a JSON object that begins with { and ends with }.

    • E.g. The output must not include a conclusion.

  • Evaluation: returns true or false based on the criteria you specified, along with the rationale.

  • LLM Step History: a history of executions, including the inputs, prompt, and output. Rerun the same inputs on new prompts by selecting the executions with the inputs you want to rerun.

Evaluating an LLM step

Lets use an example workflow to guide you on how to perform an evaluation. We have a workflow that has only a LLM step and a topic input:

On the LLM step, we write a prompt to tell the LLM to write a simple 50-word story about the topic:

We test it with the topic "unicorns" and we get a nice 50-word story about unicorns:

So, everything seems to be working as expected. We can now create an Evaluation! So we hit the "Create Evaluation" button. This will give us some suggested Criteria to evaluate against.

We now have our first Sample and Criteria for evaluation:

We can run an evaluation for the created sample by hitting the "evaluate" button:

So evaluation succeeded. The inputs of the sample were used to execute the workflow to generate a new output, and that output was evaluated against the defined evaluation criteria. Lets try now to modify the original prompt, we will just ask it to say some nonsense:

So we head back to the Evaluation tab and run the evaluation again. This time the evaluation criteria that controlled that the story is about the required topic failed. So, in this way, we can check that the changes we make to the prompt don't break previous behavior.

Adding samples from history

Under the history tab, we can see all the executions of the LLM step. We can use any of those executions as samples for evaluation:

Test + Evaluate

You can use the evaluation criteria that you previously defined to evaluate a prompt you are currently working on. Just use the "Test + Evaluate" option and it will evaluate the output of the LLM right there.

LLM Evaluations