Model Selection Guide

Determine which large language model to use

Which Model Should I Use?

What to Consider

Choosing a model depends on the following:

Context Window: the context window refers to the number of tokens you can provide to a LLM. ~1 Token = ~4 characters
Task Complexity: more capable models are generally better suited for complex logic.
Web Access: whether the use case you're building require the model to have web access?
Cost: more capable models are generally more expensive - for example, o1 is more expensive than GPT-4o.
Speed: more capable models are generally slower to execute.

AirOps Popular LLMs

Model

Provider

Description

Context Window

Vision

JSON Mode

Web Access

GPT-5

OpenAI

Flagship model for complex tasks

400K

✓

GPT-4.1

OpenAI

For complex tasks, vision-capable

✓

GPT-4o Search Preview

OpenAI

Flagship model for online web research

128K

✓

O4 Mini

OpenAI

Fast multi-step reasoning for complex tasks

128K

✓

OpenAI

Advanced reasoning for complex tasks

128K

✓

O3 Mini

OpenAI

Fast multi-step reasoning for complex tasks

128K

✓

Claude Opus 4.1

Anthropic

Powerful model for complex and writing tasks

200K

✓

Claude Sonnet 4

Anthropic

Hybrid reasoning: fast answers or deep thinking

200K

✓

Gemini 2.5 Pro

Google

Advanced reasoning for complex tasks

✓

Gemini 2.5 Flash

Google

Fast and intelligent model for lightweight tasks

✓

Perplexity Sonar

Perplexity

Balanced model for online web research

128K

✓

Differences between “o-series” vs “GPT” models

GPT-5 Series: Built-In Reasoning

GPT-5 Models: OpenAI's first model series to combine reasoning paradigm with traditional LLM capabilities. Features reasoning levels of minimal, low, medium, high that control how much reasoning the model performs.

O-series Models (o3, o4-mini): Pure Reasoning Specialists

Specialized exclusively for deep reasoning and step-by-step problem solving. These models excel at complex, multi-stage tasks requiring logical thinking and tool use. Choose these when maximum accuracy and reasoning depth are paramount. Features reasoning levels of low, medium, high for controlling reasoning token usage.

GPT Models (4.1, 4o): Traditional General-Purpose

Optimized for general-purpose tasks with excellent instruction following. GPT-4.1 excels with long contexts (1M tokens) while GPT-4o has variants for realtime speech, text-to-speech, and speech-to-text. GPT-4.1 also comes in mini and nano variants, while GPT-4o has a mini variant. These variants are cheaper and faster than their full-size counterparts. Strong in structured output generation.

How much will it cost to run?

The cost to run a model depends on the number of input and output tokens.

Token Approximation

Input tokens: to approximate the total input tokens, copy and paste your system, user, and assistant prompts into the OpenAI tokenizer

Output tokens: to approximate the total output tokens, copy and paste your output into the OpenAI tokenizer

Cost Approximation

OpenAI: divide the input and output tokens by 1000; then multiply by their respective costs based on OpenAI pricing*

Anthropic: divide the input and output tokens by 1,000,000; then multiply by their respective costs based on Anthropic pricing*

*This is the cost if you bring your own API Key. If you choose to use AirOps hosted models, you will be charged tasks according to your usage.

Last updated 7 days ago

Was this helpful?