Huang Ren-Hsuan's Token Economics

How AI · Token Economics Will Change Data Center Profit Models

Reporter Zheng Chenye

The NVIDIA GTC conference, known as the annual industry benchmark for AI, was held this year from March 16 to 19 in San Jose, California.

At 11 a.m. local time on March 16, which is 2 a.m. Beijing time on March 17, NVIDIA CEO Jensen Huang delivered a keynote speech lasting over two hours at the San Jose SAP Center.

In his speech, Huang predicted that global demand for AI infrastructure will reach $1 trillion by 2027. He also mentioned that actual demand could be much higher, and NVIDIA’s products might even be in short supply.

After this figure was announced, NVIDIA’s US stock price surged by over 4% instantly. However, a few hours later, when the A-share market opened, stocks in the computing industry chain collectively declined, with Tianfutongtong (300394.SZ) closing down over 10%, and Changguang Huaxin (688048.SH) down 9.72%. Most leading stocks gave back nearly five days’ worth of gains.

On one side is the trillion-dollar expectation; on the other, a sharp decline in industry chain stocks. The difference stems from the time scale.

Huang was talking about future demand expectations, but his forecast for the next-generation Feynman chip architecture won’t be available until 2028. Additionally, a March 16 report from Wanjia Securities pointed out that the average P/E ratio of the A-share electronics sector was about 82 times as of March 15, indicating market concerns about “high prices and cold winds at high altitudes.”

What’s more noteworthy about Huang’s speech isn’t the trillion-dollar figure itself, but the fact that he spent two hours presenting a new business logic: data centers are shifting from being places for training models to factories for producing Tokens.

Token Factory

Tokens are the basic units of information processing in large language models, roughly understood as fragments of text generated or processed by AI. One Chinese character roughly corresponds to one or two Tokens.

In the past two years, Token consumption has experienced several jumps in scale.

Huang traced this development to three key moments: the launch of ChatGPT at the end of 2022, which enabled AI to generate content and significantly increase Token consumption; the emergence of the ChatGPT o1 model, which taught AI reasoning and reflection, requiring it to internally generate large amounts of Tokens for self-assessment; and the release of Claude Code (an AI programming tool developed by Anthropic), which allows reading files, writing code, and testing, with each task consuming many times more Tokens than simple conversations.

Huang mentioned that all NVIDIA software engineers are now using AI to assist with programming.

AI work involves two stages: training, which makes the model smarter and requires a large upfront investment; and inference, which is the model performing tasks daily, with increasing demand. Previously, global GPU (graphics processing unit, the core hardware for AI computation) purchases were mainly for training, but now the focus is shifting toward inference.

Huang said that the business scale of inference service providers has grown 100 times in the past year. IDC China analyst Du Yunlong also told Caijing that domestically, inference servers are now growing faster than training servers, and their market share has exceeded 60% based on server shipment value.

While inference demand is exploding, Token pricing has not yet formed a market pricing system.

Huang outlined five future pricing tiers: a free tier with high output but slow response; a mid-tier at about $3 per million Tokens; a premium tier at about $6 per million Tokens; a high-speed tier at about $45 per million Tokens; and a top-tier at about $150 per million Tokens. Larger models, longer contexts, and faster responses make Tokens more expensive.

He gave an example of the top-tier: a research team using 50 million Tokens daily would only spend about $7,500 at $150 per million Tokens, which is not significant for enterprises. When the context window expands from 32K to 400K Tokens, AI can read an entire contract or codebase at once, enabling tasks previously impossible, at a corresponding higher price.

With tiered pricing, the economic model of data centers changes.

Huang explained that every data center is limited by power. A 1GW (gigawatt) data center will never become 2GW, as dictated by electricity and land constraints. Under fixed power, the key is to maximize Tokens produced per watt; the lower the production cost, the better. In other words, the same amount of electricity used, the more Tokens generated, the more profit.

He presented a set of figures: a 1GW data center, allocating compute power across different price tiers, could generate annual revenue of about $30 billion with NVIDIA’s current Blackwell architecture, about $150 billion with the new Vera Rubin architecture, and up to $300 billion with Groq’s LPU inference accelerators. Switching equipment in the same data center could result in a tenfold difference in revenue.

NVIDIA’s full-year revenue for fiscal 2026 was $215.9 billion, with data center business contributing $193.7 billion.

According to Huang’s logic, existing data centers are underutilized; upgrading to new-generation equipment under the same power conditions could multiply revenue several times. The trillion-dollar expectation isn’t due to chip price increases but because the same electricity can produce more and higher-value Tokens.

Huang said that in the future, every CEO will focus on the efficiency of their Token factory because that directly correlates with revenue.

He also described a change happening in Silicon Valley: more engineers are now using AI daily for coding, research, and document processing—all of which consume Tokens. Companies will need to pay for employees’ AI usage.

Huang predicted this expense will become so large that it will require separate budgeting, just like providing employees with computers and software.

He further stated that each engineer will receive an annual Token budget upon hiring, roughly half of their base salary.

Two Types of Chips

The hardware corresponding to Huang’s Token economics is the Vera Rubin platform, officially announced at GTC.

Huang said that in the past, when discussing the Hopper architecture, he would hold up a chip, but Vera Rubin isn’t just a chip; it’s an entire system. This system achieves 100% liquid cooling, reducing installation time from two days to two hours.

Vera Rubin consists of seven chips. The core rack NVL72 integrates 72 Rubin GPUs and 36 Vera CPUs, connected via NVLink 6 (NVIDIA’s high-speed interconnect technology). Compared to the previous Blackwell generation, inference throughput per watt has increased tenfold, and the cost per Token has been reduced to one-tenth.

NVIDIA also released a new 88-core Vera CPU, optimized for AI agent scenarios involving tool invocation and data processing.

Huang mentioned that Microsoft CEO Satya Nadella has confirmed that the first Vera Rubin racks are already running on Azure.

However, Vera Rubin has a shortcoming: when each user needs to generate more than 400 Tokens per second, NVL72’s bandwidth becomes insufficient. To address this, NVIDIA acquired the technology and core team of Groq, a US-based AI accelerator chip company founded in 2016.

Groq’s LPU (Language Processing Unit) and GPU are entirely different chips. GPUs have large memory and high computing power; a single Rubin GPU has 288GB of memory, suitable for complex calculations. LPUs have small but extremely fast read/write speeds, with only 500MB of storage—unable to hold full model parameters but capable of generating Tokens faster with lower latency than GPUs.

NVIDIA uses a software called Dynamo to split inference into two steps: context understanding, which requires massive compute and memory and is handled by Vera Rubin; and Token generation, which is latency-sensitive and handled by Groq’s LPU. These two chips are connected via high-speed Ethernet, working collaboratively to reduce latency by about half.

Huang calls this approach “decoupled inference,” meaning splitting inference tasks across different chips. The core idea is that high throughput and low latency are inherently contradictory, so it’s better to let each chip do what it’s best at.

He said this combination achieves performance improvements of 35 times over the previous generation at high price tiers of $45 and $150.

Looking at a longer time horizon, the same 1GW data center could increase Token generation from 22 million per second to 700 million in two years.

Huang advised clients that if their work mainly involves high-throughput batch inference, they should fully adopt Vera Rubin; if they have large programming or real-time interaction needs, they can allocate about 25% of their data center compute power to Groq’s LPU.

He said that Groq’s 3 LPU units are being mass-produced by Samsung and are expected to ship in Q3 this year.

On the software side, NVIDIA launched the enterprise AI agent platform NemoClaw, supporting the popular open-source project OpenClaw. OpenClaw has become the fastest-growing open-source project on GitHub in recent weeks. Huang compared its importance to Linux, calling it the operating system for intelligent agent computing.

However, deploying open-source OpenClaw directly in enterprise environments poses security risks, as agents can access sensitive data, execute code, and communicate externally. NemoClaw adds an enterprise security layer to OpenClaw. Seventeen companies, including Adobe, Salesforce, and SAP, have announced adoption of NVIDIA’s Agent Toolkit for developing intelligent agents.

Regarding the roadmap, NVIDIA previewed the next-generation Feynman architecture, scheduled for release in 2028, which will support both copper cabling and CPO (chip-integrated optical communication) interconnects.

This year also marks the 20th anniversary of CUDA, NVIDIA’s GPU computing platform, which is considered the foundation of NVIDIA’s software ecosystem. Huang mentioned that currently, 60% of NVIDIA’s business comes from the top five global cloud providers, with the remaining 40% spread across sovereign AI, enterprise, industrial, and robotics sectors.

At this GTC, NVIDIA also announced collaborations with Uber, BYD, Geely, Hyundai, Nissan, and Isuzu in autonomous driving. Driven by this news, the Hong Kong auto sector rallied on the 17th, with Geely (00175.HK) rising over 5% intraday and closing up 4.55%.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin