AI
Inference
A practical guide to understanding what inference means in AI and why it matters for product performance and cost.
What inference is, how it differs from training, and what product and design teams need to understand about its implications for speed, cost, and reliability.
What it is
glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of using a trained glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term to generate outputs — producing a response, completing a task, or making a prediction based on a given input.
Every time a user sends a message to an AI chatbot, asks an AI tool to summarise a document, or glossaryRequestA request is an action sent from a client to a server asking for data or a service.Open glossary term a generated image, that is glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term in action.
glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is distinct from training. Training is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of building and improving the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term using data. Inference is the process of using the finished model.
From a product perspective, glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is almost always what matters day-to-day. Training happens once — or periodically when the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term is updated. Inference happens every time a user interacts with the glossaryFeatureA feature is a specific piece of functionality within a product that delivers value to users. It represents something users can do or experience as part of the overall product.Open glossary term.
glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term has cost, glossaryLatencyLatency is the time delay between a user action and the system response.Open glossary term, and glossaryReliabilityReliability is the ability of a system to consistently perform as expected without failure.Open glossary term implications that directly affect the user experience.
When to use it
Understand when glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term considerations are practically relevant. They matter most when:
They matter less when:
Key takeaway
Inference is the operational reality of AI products. Understanding it helps you design features that are fast, cost-effective, and reliable under real-world usage.
How it works
Understand the basic mechanism. When a user sends an input to a language glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term, the model glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term it token by token, generating a probability distribution over possible next tokens and glossarySamplingSampling is the process of selecting a subset of data or users to represent a larger population.Open glossary term from it to produce a response.
This happens on powerful servers maintained by the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term provider. The cost and speed of glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term depend on the size of the model, the length of the input and output, and the glossaryInfrastructureInfrastructure refers to the underlying systems and resources that support applications and services.Open glossary term being used.
Longer inputs and outputs cost more and take longer. Smaller, more efficient glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term are cheaper and faster but may produce lower-quality outputs.
What this means for designers and product teams. glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term glossaryLatencyLatency is the time delay between a user action and the system response.Open glossary term — the time between sending a glossaryRequestA request is an action sent from a client to a server asking for data or a service.Open glossary term and receiving a response — is a direct UX consideration. Long response times frustrate users and reduce the perceived value of AI features.
Streaming glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term — where the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term's output is displayed progressively as it is generated, rather than all at once after completion — can significantly improve the perceived glossaryResponsivenessHow well a design adapts to different screen sizes.Open glossary term of AI features even when total generation time is unchanged.
What to look for
Focus on:
Where it goes wrong
Most issues come from: Ignoring glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term costs until production scale reveals them as a problem is a common and expensive mistake.
What you get from it
Understanding glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term gives you:
Key takeaway
Every AI response is an inference call. Designing around its cost, latency, and reliability is as important as designing the experience itself.
FAQ
Common questions
A few practical answers to the questions that usually come up around this method.
What is inference in AI?
glossaryInferenceInference is the process of using a trained model to generate outputs or make predictions based on new input data.Open glossary term is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of using a trained glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term to generate outputs — producing a response, completing a task, or making a prediction. It is what happens every time a user interacts with an AI feature.
How is inference different from training?
Training is the glossaryProcessA process is a defined sequence of steps used to achieve a specific outcome.Open glossary term of building the glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term using glossaryDataData is raw information collected and stored for analysis, processing, or decision-making.Open glossary term. Inference is the process of using the finished model. Training is expensive, slow, and happens infrequently. Inference is what your product does constantly in production.
Why do AI responses sometimes take a long time?
Because generating a long glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term token by token takes time, and that time increases with the length of both the input and the output. glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term size, glossaryInfrastructureInfrastructure refers to the underlying systems and resources that support applications and services.Open glossary term load, and network conditions also affect response speed. Streaming responses can make the experience feel faster by displaying output as it is generated.
How much does inference cost?
It varies significantly by glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term size, provider, and input and output length. Most providers charge per thousand tokens. Small, efficient models can cost a fraction of a penny per glossaryInteractionInteraction refers to any action a user takes within a product and how the system responds. It includes clicks, taps, gestures, and inputs that drive the user experience.Open glossary term. Large, complex models with long inputs and outputs can cost several cents per interaction — which adds up quickly at scale.
Can inference speed be improved?
Yes, through several approaches: using smaller glossaryModelA model is a system or representation used to process data and generate outputs, often trained to perform specific tasks.Open glossary term where quality requirements allow, caching common glossaryResponseA response is the data or result returned by a server after receiving a request.Open glossary term, optimising glossaryPromptA prompt is the input or instruction given to an AI system to guide its output or response.Open glossary term length, and choosing providers with lower-latency infrastructure. Streaming responses also improves perceived speed without changing actual generation time.
Quick take
Inference is what happens every time an AI generates a response — and understanding it helps you design better AI features and have more informed conversations about performance and cost.
Related Services
Related Guides