First evaluation released: AI modifying code, in most cases "getting worse"! Programmers don't need to worry about their jobs?

Question

In recent years, the programming capabilities of large AI models have advanced rapidly, with major AI companies continuously breaking records in programming benchmark tests. This has caused many programmers to worry: Is AI about to take our jobs?

However, a recent joint study by Sun Yat-sen University and Alibaba provides programmers with some reassurance.

On March 4th, the two organizations released evaluation results. The test, called “SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration,” is the first systematic assessment of the long-term code maintenance ability of 18 AI models from eight leading companies, including Anthropic, OpenAI, Kimi, and DeepSeek.

The test included 100 tasks, with over 10 billion tokens consumed. The results show that the Claude Opus series leads overall performance.

In terms of controlling performance degradation, most models like Qianwen, DeepSeek, MiniMax, Kimi, and Doubao performed poorly. That is, AI may make code “worse the more it changes” during long-term maintenance.

China’s team launches the world’s first evaluation system for AI models’ long-term code maintenance ability

For a long time, mainstream benchmarks for AI programming skills have been snapshot-based, focusing on “receiving a request once and producing a solution once.”

However, this evaluation only tests whether large models can produce correct code once, failing to reflect the core needs of continuous iteration and long-term maintenance in real software development.

In reality, mature software is rarely built in a single step but is the result of long-term maintenance. Rayleigh’s Law states that software quality naturally declines over time with maintenance. Maintenance accounts for 60% to 80% of the total software lifecycle cost.

To evaluate AI performance in long-term code maintenance, Sun Yat-sen University and Alibaba jointly developed the SWE?CI benchmark. This is the world’s first evaluation system specifically designed to assess AI agents’ performance in long-term code maintenance. It no longer just examines whether AI can produce “correct once” code but evaluates whether AI can maintain code quality consistently over months or even years, like a real software engineer.

The SWE?CI benchmark was constructed through four layers of strict filtering, resulting in a high-quality evaluation set.

First, the team selected 4,923 repositories from GitHub with over three years of maintenance, more than 500 stars, dependency files, complete unit test suites, and permissive licenses like MIT or Apache 2.0. Then, they extracted 8,311 candidate samples with stable dependencies and over 1,000 lines of code changes per commit. Using automated Docker environment setup and self-healing dependency mechanisms, they retained 1,458 runnable pairs. Finally, through startup validation, pass rate differences, and sorting by time span and commits, they finalized 100 tasks.

Each of these 100 tasks corresponds to a complete evolution of a real-world software project, averaging 233 days of development and 71 consecutive code commits. The team also designed a “designer-developer” dual-agent collaboration mechanism inspired by common roles in real software teams: architects analyze requirements and plan technical solutions, while programmers handle actual coding.

To adapt to long-term iteration evaluation, SWE?CI introduced two core metrics: “Normalized Change” and “EvoScore.”

“Normalized Change” maps code states to the [-1, 1] range based on test case pass counts, with positive indicating functionality improvement and negative indicating degradation.

EvoScore emphasizes measuring the AI model’s performance in future modification tasks.

Test results: Claude Opus leads by a large margin; most models tend to break original code in 75% of tasks

The team systematically tested 18 mainstream AI models from eight companies—Moon Shadow, Anthropic, Zhipu, Qianwen, MiniMax, DeepSeek, OpenAI, and Doubao—consuming over 10 billion tokens. This scale of experiment is unprecedented in AI programming evaluation.

Results show that, over time, AI models’ code maintenance abilities have accelerated significantly.

From the chart, it can be seen that newer versions of models from the same company generally outperform earlier ones, with a notable leap after 2026, and higher EvoScores. This indicates that current large models are evolving from static bug fixing toward continuous, long-term code maintenance.

Among all models tested, the Claude Opus series performs the best, with EvoScores rising to about 0.9 from Claude-opus-4.5 to Claude-opus-4.6, clearly surpassing competitors.

Among Chinese models, Zhipu’s GLM series has made significant progress, becoming the most competitive in the second tier. Followed by Qwen and MiniMax, with overall positive trends. Kimi and Doubao have improved but lack breakthroughs.

The study also found clear differences in training strategies among vendors.

Specifically, MiniMax, DeepSeek, and OpenAI’s GPT series models favor long-term benefits, showing advantages in long-term code maintenance tasks. This suggests that these models tend to generate code strategies conducive to long-term evolution and stability rather than short-term fixes.

In contrast, Kimi and Zhipu’s GLM series lean toward short-term optimization.

Meanwhile, models like Qianwen, Doubao, and the Claude series exhibit a balanced approach between short-term effects and long-term maintenance.

Another key finding is that, in long-term code maintenance, all models perform poorly in controlling performance regression.

Regression is a core indicator of software quality stability. If a unit test passes before code update but fails afterward, it indicates a regression. Once regression occurs, it not only impacts user experience but, over many modifications, can lead to systemic quality decline.

The team measured the “Zero Regression Rate”—the proportion of tasks with no functionality breakage throughout maintenance. A higher zero regression rate indicates more stable systems.

Results show that, among all 18 models tested, only Anthropic’s Claude Opus maintained over 50% zero regression rate, with most models below 25%.

Specifically, Claude-opus-4.6 leads with a 76% zero regression rate, meaning it maintains stability in most scenarios. Claude-opus-4.5 follows with 51%. In comparison, Kimi-K2.5 (37%) and GLM-5 (36%) are in the second tier, showing some stability but still lagging behind top models.

The remaining 14 models, including GPT-5.2, Qwen3.5-plus, MiniMax-M2.5, and DeepSeek-V3.2, have zero regression rates below 25%, indicating that in over 75% of long-term maintenance tasks, they tend to break original code functions, causing performance degradation.

From a version iteration perspective, top vendors’ models are improving rapidly. For example, Claude-opus’ zero regression rate increased from 51% in version 4.5 to 76% in 4.6; Zhipu’s GLM series jumped from 14% in GLM-4.6 and GLM-4.7 to 36% in GLM-5.

However, even with these improvements, most models still struggle to eliminate performance regression in long-term maintenance, indicating a significant gap from reliable automated long-term development.

The release of SWECI benchmark results has made the industry realize that “writing code” and “maintaining code” are two fundamentally different skills. For model developers, continuously improving maintainability, controlling performance degradation, and enhancing architecture design may be key to winning the next phase of competition.

(Disclaimer: The content and data in this article are for reference only and do not constitute investment advice. Verify before use. Use at your own risk.)

Reporter | Lansou Ying, Chang Songzhen (Intern)

Editors | He Xiaotao, Wang Jiaqing, Du Hengfeng

Proofreading | Duan Lian

| Daily Economic News nbdnews Original Article |

Unauthorized reproduction, excerpting, copying, or mirroring is prohibited.

View Original

First evaluation released: AI modifying code, in most cases "getting worse"! Programmers don't need to worry about their jobs?

Trending Topics

GateSquareAIReviewer

SECAndCFTCNewGuidelines

IranConfirmsLarijaniAssassinated

FedRateDecision

BitcoinSupportAndResistanceAnalysis

Hot Gate Fun

BTCS6

BTCS6

山寨产品

山寨产品

gate

gate

$OOPS

$OOPS

Sporting

Sporting

Pin