Inference

A practical guide to understanding what inference means in AI and why it matters for product performance and cost.

What inference is, how it differs from training, and what product and design teams need to understand about its implications for speed, cost, and reliability.

22 May 20264 min read

What it is

glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of using a trained glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term to generate outputs — producing a response, completing a task, or making a prediction based on a given input.

Every time a user sends a message to an AI chatbot, asks an AI tool to summarise a document, or glossaryRequestA request is an action sent from a client to a server asking for data or a service.Open glossary term a generated image, that is glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term in action.

glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is distinct from training. Training is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of building and improving the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term using data. Inference is the process of using the finished model.

From a product perspective, glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is almost always what matters day-to-day. Training happens once — or periodically when the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term is updated. Inference happens every time a user interacts with the glossaryFeatureA feature is a specific piece of functionality within a product that delivers value to users. It represents something users can do or experience as part of the overall product.Open glossary term.

glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term has cost, glossaryLatencyLatency is the time delay between a user action and the system response.Open glossary term, and glossaryReliabilityReliability is the ability of a system to consistently perform as expected without failure.Open glossary term implications that directly affect the user experience.

When to use it

Understand when glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term considerations are practically relevant. They matter most when:

They matter less when:

You are estimating or managing the running costs of an AI feature

Response speed is a meaningful part of the user experience

You are designing for high-volume usage where cost and performance at scale matter

You are troubleshooting slow or inconsistent AI responses

You are in early exploration or prototyping and performance is not yet a concern

Key takeaway

Inference is the operational reality of AI products. Understanding it helps you design features that are fast, cost-effective, and reliable under real-world usage.

How it works

Understand the basic mechanism. When a user sends an input to a language glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term, the model glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term it token by token, generating a probability distribution over possible next tokens and glossarySamplingSampling is the process of selecting a subset of data or users to represent a larger population.Open glossary term from it to produce a response.

This happens on powerful servers maintained by the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term provider. The cost and speed of glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term depend on the size of the model, the length of the input and output, and the glossaryInfrastructureInfrastructure refers to the underlying systems and resources that support applications and services.Open glossary term being used.

Longer inputs and outputs cost more and take longer. Smaller, more efficient glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term are cheaper and faster but may produce lower-quality outputs.

What this means for designers and product teams. glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term glossaryLatencyLatency is the time delay between a user action and the system response.Open glossary term — the time between sending a glossaryRequestA request is an action sent from a client to a server asking for data or a service.Open glossary term and receiving a response — is a direct UX consideration. Long response times frustrate users and reduce the perceived value of AI features.

Streaming glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term — where the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term's output is displayed progressively as it is generated, rather than all at once after completion — can significantly improve the perceived glossaryResponsivenessHow well a design adapts to different screen sizes.Open glossary term of AI features even when total generation time is unchanged.

What to look for

Focus on:

Latency — whether response times are acceptable for the intended use case

Cost per interaction — whether inference costs are sustainable at projected usage volumes

Streaming — whether displaying responses progressively would improve perceived performance

Model selection — whether a smaller, faster model could deliver acceptable quality at lower cost

Error rates — how often inference fails or returns errors, and how those failures are handled

Where it goes wrong

Most issues come from: Ignoring glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term costs until production scale reveals them as a problem is a common and expensive mistake.

No cost modelling before launch, leading to unexpected bills at scale

Latency only tested on fast connections and not representative of real user conditions

No streaming implementation, making response times feel longer than they are

Over-engineering with large, expensive models for tasks that a smaller model handles adequately

What you get from it

Understanding glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term gives you:

A clearer picture of where AI running costs come from

Better criteria for balancing model quality against performance and cost

A basis for designing AI features that feel responsive and reliable

More informed conversations with engineers about performance optimisation

Key takeaway

Every AI response is an inference call. Designing around its cost, latency, and reliability is as important as designing the experience itself.

FAQ

Common questions

A few practical answers to the questions that usually come up around this method.

What is inference in AI?

How is inference different from training?

Training is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of building the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term using glossaryDataData is raw information collected and stored for analysis, processing, or decision-making.Open glossary term. Inference is the process of using the finished model. Training is expensive, slow, and happens infrequently. Inference is what your product does constantly in production.

Why do AI responses sometimes take a long time?

Because generating a long glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term token by token takes time, and that time increases with the length of both the input and the output. glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term size, glossaryInfrastructureInfrastructure refers to the underlying systems and resources that support applications and services.Open glossary term load, and network conditions also affect response speed. Streaming responses can make the experience feel faster by displaying output as it is generated.

How much does inference cost?

It varies significantly by glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term size, provider, and input and output length. Most providers charge per thousand tokens. Small, efficient models can cost a fraction of a penny per glossaryInteractionInteraction refers to any action a user takes within a product and how the system responds. It includes clicks, taps, gestures, and inputs that drive the user experience.Open glossary term. Large, complex models with long inputs and outputs can cost several cents per interaction — which adds up quickly at scale.

Can inference speed be improved?

Yes, through several approaches: using smaller glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term where quality requirements allow, caching common glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term, optimising glossaryPromptA prompt is the input or instruction given to an AI system to guide its output or response.Open glossary term length, and choosing providers with lower-latency infrastructure. Streaming responses also improves perceived speed without changing actual generation time.

Quick take

Inference is what happens every time an AI generates a response — and understanding it helps you design better AI features and have more informed conversations about performance and cost.

Related Services

Artificial Intelligence