Gartner Inc. projects that running inference on a large language model (LLM) with 1 trillion parameters will cost GenAI providers over 90% less by 2030 compared with 2025, driven by improvements in chips, infrastructure, and model design.

The research and consulting firm defines inference as the process where AI models generate responses after processing inputs, often measured in “tokens,” which represent chunks of data. In its analysis, one token is roughly 3.5 bytes or about four characters.

The research firm said several factors will contribute to the steep decline in costs, including more efficient semiconductors, better use of computing resources, higher chip utilization, and the growing use of specialized hardware designed specifically for AI workloads. Edge devices, which process data closer to where it is generated, are also expected to play a role in certain use cases.

“These cost improvements will be driven by a combination of semiconductor and infrastructure efficiency improvements, model design innovations, higher chip utilization, increased use of inference-specialized silicon, and application of edge devices for specific use cases,” said Will Sommer, senior director analyst at Gartner.

Gartner also forecasts that by 2030, LLMs could be up to 100 times more cost-efficient compared with early models of similar size introduced in 2022.

The report outlines two scenarios used in its projections. In “frontier” scenarios, models run on cutting-edge chips, while “legacy blend” scenarios use a mix of available semiconductor technologies. Costs in the legacy blend setup are significantly higher due to lower computing performance compared with advanced chips.

Despite falling costs per token, Gartner noted that savings may not fully translate to lower prices for enterprise customers. Advanced AI systems are expected to use far more tokens per task than current tools. For example, agentic AI models, which can perform complex multi-step tasks, may require 5 to 30 times more tokens than standard chatbot interactions.

This means overall usage could rise faster than per-token costs decline, potentially increasing total spending on AI inference even as unit costs fall.

“Chief Product Officers (CPOs) should not confuse the deflation of commodity tokens with the democratization of frontier reasoning,” Sommer said. “As commoditized intelligence trends toward near-zero cost, the compute and systems needed to support advanced reasoning remain scarce. CPOs who mask architectural inefficiencies with cheap tokens today will find agentic scale elusive tomorrow.”

Gartner added that future value will likely shift toward platforms that can manage and distribute workloads across different types of AI models. Routine tasks are expected to run on smaller, more efficient models, while complex reasoning will rely on more powerful systems, but used selectively due to higher costs.

Discover more from Back End News

Subscribe now to keep reading and get access to the full archive.

Continue reading