Huang Renxun GTC Interview: Low-latency inference will become the next explosive driver of the AI economy, and the tight supply and demand balance of power chips will persist in the long term.

robot
Abstract generation in progress

AI is moving from “generating information” to “performing tasks.” Low-latency, high-throughput inference scenarios represented by coding agents are opening the next important phase of AI infrastructure commercialization. On the supply side, power, chips, and data center construction are almost lacking redundancy, and tight balancing may become a long-term industry norm.

After the GTC 2026 keynote speech, NVIDIA CEO Jensen Huang was interviewed by Stratechery founder Ben Thompson, where he shared systematic views on core topics such as the AI inference economy, CPU strategy, the logic behind acquiring Groq, and supply chain tensions.

Huang pointed out that AI has crossed a critical threshold in the past year—the improvement in inference capabilities has made models generate real economic value for the first time, and the explosion of coding agents is the clearest reflection of this shift. NVIDIA has officially incorporated ultra-fast, low-latency inference into its product lineup.

On the supply side, Huang frankly stated that “almost all links are tight,” whether it’s power or chip supply, doubling capacity is difficult. Although NVIDIA claims its supply chain has been planned for “this year and next,” he hopes that “land, electricity, and data centers” can be deployed more quickly, which will directly impact the pace of computing expansion and capital expenditure realization.

Inference Economy: Low Latency Becomes the Next Paid Engine

Huang attributes the core breakthrough in AI development over the past year to the maturity of “inference” capabilities. He said that early generative AI was difficult to commercialize due to hallucination issues, but the introduction of inference enabled models to “ground” through reflection, retrieval, and search, thus transitioning from merely providing information to actually completing tasks.

“Search is a service that no one pays for because the barrier to access information is too low to justify payment,” Huang explained. “We have now crossed that threshold—AI can not only converse with people but also do things for them.”

Programming is one of his most typical examples. He pointed out that code generation is not a simple language modality; it requires models to reflect on, verify, and execute code blocks holistically. The maturity of this ability allows engineers to shift their focus from line-by-line coding to architecture and specification design.

He revealed that 100% of NVIDIA’s internal software engineers are now using coding agents. “Many haven’t written a line of code manually for a while, but their productivity is extremely high.”

Based on this judgment, NVIDIA has decided to incorporate low-latency inference capabilities into its product lines. Huang explained that existing GPU systems inherently balance between maximizing throughput and maximizing the quality of intelligent tokens. For high-value coding agent users, they are willing to pay a premium for a 10x increase in token generation speed.

“If Anthropic launches a Claude Code service layer that boosts coding speed by 10 times, I would pay for it—no doubt. I am building this product for myself.”

Acquiring Groq: Strategic Layout to Deconstruct the Inference Pipeline

Huang sees NVIDIA’s acquisition of Groq not as a sudden move but as a natural extension of its long-term layout in inference infrastructure.

He said that when NVIDIA released the Dynamo inference scheduling framework a year ago, it was already thinking about how to decompose inference processes more granularly on heterogeneous infrastructure. The collaboration with Groq began about six months before the acquisition announcement. The core of this deal is to acquire Groq’s team and technology licensing, not its cloud service business.

Technologically, NVIDIA plans to extend the inference pipeline decomposition into the decoding stage, with Vera Rubin GPUs handling high-FLOP attention calculations, and Groq’s LPU architecture taking on parts requiring extremely high token rates and very low latency. Related products are expected to launch within this year.

He stated:

“But if your business is similar to Anthropic or OpenAI, and Codex is generating real economic value, and you want to produce more tokens, then adding this accelerator can significantly increase revenue.”

He also acknowledged that this solution is not suitable for all customers. For platforms mainly serving free users with low paid conversion rates, introducing Groq would increase costs and complexity, which may not be worthwhile.

Huang compared Groq to NVIDIA’s previous acquisition of Mellanox—both represent NVIDIA’s consistent logic of integrating external dedicated architectures into its computing stack to achieve system-level optimization. “NVIDIA is an accelerated computing company, not just a GPU company. We don’t care where the computation happens; we just want to accelerate applications.”

CPU Strategy: Redefining Server Architecture for the AI Agent Era

Long perceived as a GPU company, Huang systematically explained NVIDIA’s logic for entering the CPU market and the design philosophy behind its self-developed Vera CPU.

He pointed out that over the past decade, CPU design has been optimized for large-scale cloud computing—maximizing the number of rentable cores, with single-thread performance not a priority. However, in AI agent scenarios, CPU single-thread performance directly impacts overall system efficiency while waiting for tool calls. “You can never let the GPU sit idle,” he said.

Vera CPU’s key differentiator is memory and I/O bandwidth: each CPU core has three times the bandwidth of current CPUs, designed to prevent I/O bottlenecks from dragging down GPU performance. He also mentioned collaboration with Intel on NVLink to meet enterprise computing market needs for x86 ecosystem continuity.

Huang divides AI tool usage into two categories: structured tools, including CLI, API, and database queries; and unstructured tools, such as web browsing via multimodal perception applications on PCs. NVIDIA is actively developing in both areas.

Supply Tightness: Power and Chip Capacity Both Under Strain

Addressing ongoing concerns about AI compute supply, Huang gave the most direct assessment to date: power and chip capacity are both in tight balance, with no room for doubling in the short term.

“I don’t think we have twice the power demand, nor twice the chip supply—there’s no double redundancy in any aspect,” he said. “But based on what I see, our supply chain can support it.”

He stated that NVIDIA has about 200 long-term partners in the supply chain and has planned upstream and downstream accordingly, optimistic about large-scale growth in the next one to two years.

However, he admitted that the biggest bottleneck may not be the chips themselves but the speed of land, power, and building deployment in data centers. “What I hope most is that these infrastructures can be completed faster.”

When asked whether NVIDIA is the biggest beneficiary of compute scarcity, Huang acknowledged that the company is the largest and most prepared, attributing this to long-term planning rather than market luck.

Risk Disclaimer

Market risks exist; investment should be cautious. This article does not constitute personal investment advice and does not consider individual users’ specific investment goals, financial situations, or needs. Users should determine whether any opinions, viewpoints, or conclusions herein are suitable for their circumstances. Invest at your own risk.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin