Claude coding breakthrough: 30 hours, 60% OS task score gains

Anthropic’s latest release turns a bold corner for Claude coding. In enterprise and internal trials, Claude Sonnet 4.5 sustained autonomous coding for up to 30 hours, a leap from a seven-hour limit reported earlier in the product line [1]. Reuters also noted the model’s OS-task benchmark score rising to 60% from 40%, underscoring measurable progress in long-horizon reliability [1]. A separate report highlighted a single 30-hour run that produced an 11,000-line web application, framing the jump in practical engineering output [2].

The new model is simultaneously rolling into major enterprise stacks. Amazon’s Bedrock added Claude Sonnet 4.5 with cross-region inference, external memory, and context editing for long-running agents [4]. Google Cloud’s Vertex AI also onboarded the model, emphasizing agent orchestration and multistep, continuous automation across cybersecurity, finance, and code-heavy workflows [5].

Key Takeaways

– Shows Claude coding autonomously for 30 hours, up from a seven-hour limit, and delivering an 11,000-line web app in real-world trials. – Reveals OS task benchmark scores rising to 60% from 40%, signaling a 20-point improvement in long-horizon problem solving and reliability. – Demonstrates enterprise readiness via AWS Bedrock and Vertex AI availability, with cross‑region inference and external memory enabling long‑running, multi‑step agents. – Indicates full‑stack execution: databases stood up, domains purchased, and SOC 2 audit tasks completed during trials, supported by checkpoints and multi‑agent tooling. – Suggests near‑term impact in cybersecurity and finance as clients report 30‑hour autonomous runs, better memory, and orchestrated agents for continuous automation.

How Claude coding leapt from 7 to 30 hours

Anthropic disclosed that Claude Sonnet 4.5 sustained autonomous coding for about 30 hours in internal and client scenarios, a step-change beyond earlier seven-hour ceilings [1]. The Verge reported a 30-hour session that produced roughly 11,000 lines of code for a working web application, illustrating not just duration but throughput in a tangible software deliverable [2]. TechCrunch corroborated enterprise trials that reached the 30-hour mark, positioning the update as a practical advance in long-horizon, unattended engineering [3].

That increase matters because software projects often demand multi-hour chains of build, test, debug, and deploy. The durability to run for 30 hours suggests fewer human handoffs and reduced context loss over long cycles. It also aligns with Anthropic’s pitch for agentic workflows that can progress from planning to execution with minimal oversight, particularly in regulated or time-sensitive environments [2].

Inside the stack: memory, agents, and checkpoints

Several technical additions help explain the duration gains. Anthropic highlighted multi-agent support, virtual machines, and improved memory as core enablers for long-horizon autonomy and orchestration, where one agent can delegate tasks to another or persist state across steps [2]. AWS emphasized context editing and external memory, letting long-running agents retrieve or update state without losing track of objectives during extended sessions. Global cross-region inference provides resiliency and lower-latency access as workloads scale [4].

Anthropic and partner tooling now also includes checkpoints—developer-facing save points that capture intermediate progress and allow recovery or branching, key for long runs that would otherwise be brittle to failures [3]. This stack-level reliability narrows the gap between a promising demo and a production-grade agent that can survive network hiccups, API limits, or environment changes mid-run [3]. Together, these capabilities aim to reduce silent failures and rewrites that derail multiday automations [4].

Benchmarks and real-world tasks validate Claude coding

The quantitative picture is improving. Reuters cited a 60% score on OS tasks, up from 40%, indicating a 20-point gain versus prior results for comparable tasks that stress planning and tool use [1]. AWS characterized the model as a leader on SWE-bench, a community benchmark for software engineering tasks, further positioning the release for complex, code-centric workloads. While benchmark methodologies vary, the claim signals confidence on practical bug-fixing and patch-generation exercises [4].

Real-world trial details add credibility. TechCrunch reported that Claude Sonnet 4.5 stood up databases, purchased domains, and even executed SOC 2 audit tasks during enterprise tests, a breadth that reflects both code generation and procedural compliance steps typical in production environments [3]. The Verge’s account of an 11,000-line application built autonomously in about 30 hours underscores sustained throughput beyond unit tests or toy programs, pointing to scaffolded services and integration-level work [2]. These results collectively bridge synthetic metrics with tasks enterprises actually request of AI systems [1].

Claude coding in major clouds broadens enterprise access

On September 29, 2025, Amazon Bedrock announced general availability of Claude Sonnet 4.5, describing the model as suited for long-running agents and detailing features like context editing, external memory, and cross-region inference to support enterprise-grade workloads at scale [4]. The same day, Google Cloud’s Vertex AI confirmed availability, focusing on multi-agent orchestration and continuous automation patterns spanning coding, cybersecurity, and financial analysis [5].

Practical implications include standardized identity, security, and monitoring via Bedrock and Vertex AI. Teams can route agent tasks through existing IAM, logging, cost controls, and network policies, rather than building bespoke plumbing. AWS advises customers to consult regional availability and integration guides to match deployment to data locality and compliance needs [4]. Google Cloud encourages testing Sonnet 4.5 on multistep, hours-long workflows to validate performance before fleet-wide rollout [5].

What 30 hours unlocks for cybersecurity and finance

Anthropic’s product leads and partners cast the model as a breakthrough for long-horizon coding and agentic workflows in cybersecurity and finance—domains where tasks span reconnaissance, configuration, validation, and reporting over extended periods [2]. On the security side, agents that persist for 30 hours can iteratively triage vulnerabilities, propose patches, open pull requests, and generate compliance evidence, including SOC 2 documentation steps previously handled by humans in pieces [3]. In finance, agents could reconcile data from multiple systems, run scenario analyses, and build dashboards without frequent resets [5].

The cloud rollouts augment this potential. Cross-region inference provides continuity if a region becomes unavailable, and external memory lets an agent retain audit trails for regulators or risk reviews during long runs [4]. Google’s orchestration patterns integrate these agents with scheduler services and alerts, enabling “always-on” automations that can escalate to humans when thresholds or anomalies occur [5]. The 30-hour mark doesn’t guarantee correctness, but it raises the ceiling on how much value can be extracted in one cohesive workflow [2].

The business case: throughput, cost, and oversight

Quantitatively, longer runs concentrate more build-test-deploy cycles into a single automated window, which can compress lead times for features or fixes that formerly spilled across shifts. The 20-point OS-task benchmark gain (40% to 60%) provides an external signal that error rates and retries may be trending in the right direction, although enterprise SLAs will require their own acceptance thresholds [1]. Tooling such as checkpoints helps convert time into consistent throughput by avoiding restarts after transient failures [3].

Enterprises still need guardrails. Reuters noted Anthropic’s emphasis on safety and the company’s deep ties to Alphabet and Amazon, relationships that often shape compliance and procurement pathways for large customers [1]. AWS and Google provide platform-level controls, but domain-specific reviews—especially for regulated data or production changes—remain essential. Organizations should instrument agent runs with clear stop conditions, human-in-the-loop checkpoints for sensitive actions, and structured test gates before deployment [4].

What to watch next for Claude coding

Three signals bear monitoring. First, independent replication of the 30-hour runs across varied environments will test portability beyond hand-picked trials. Second, transparent SWE-bench and OS-task reporting—including prompt design and tool access—will help quantify whether the 60% figure generalizes to customer codebases at scale [1]. Third, cloud cost profiles for long-running agents need to be modeled against developer productivity gains to avoid shifting bottlenecks from human effort to compute budgets [4].

On the product side, expect tighter integration between multi-agent orchestration and enterprise CI/CD, change management, and secrets governance. Google Cloud has already highlighted agent orchestration patterns, and customer feedback will likely drive templates and reference architectures for common pipelines in security and finance [5]. As availability on Bedrock and Vertex AI matures, customers should see more granular examples and cost calculators to right-size deployments for their workloads [4].

Bottom line

Anthropic’s claim that Claude Sonnet 4.5 can code autonomously for 30 hours redefines expectations for AI engineering agents, moving the bar from a few hours to an all-day-plus window [1]. The combined evidence—a 20-point benchmark jump to 60% on OS tasks, a 30-hour build of an 11,000-line app, and cloud availability with enterprise-grade features—suggests tangible momentum from lab to line-of-business delivery [1]. The opportunity now is to translate that raw endurance into safely governed, auditable pipelines that deliver measurable ROI across coding, cybersecurity, and finance [5].

Sources:

[1] Reuters – Anthropic launches Claude 4.5, touts better abilities, targets business customers: www.reuters.com/business/retail-consumer/anthropic-launches-claude-45-touts-better-abilities-targets-business-customers-2025-09-29/” target=”_blank” rel=”nofollow noopener noreferrer”>https://www.reuters.com/business/retail-consumer/anthropic-launches-claude-45-touts-better-abilities-targets-business-customers-2025-09-29/

[2] The Verge – Anthropic releases Claude Sonnet 4.5 in latest bid for AI agents and coding supremacy: www.theverge.com/ai-artificial-intelligence/787524/anthropic-releases-claude-sonnet-4-5-in-latest-bid-for-ai-agents-and-coding-supremacy” target=”_blank” rel=”nofollow noopener noreferrer”>https://www.theverge.com/ai-artificial-intelligence/787524/anthropic-releases-claude-sonnet-4-5-in-latest-bid-for-ai-agents-and-coding-supremacy [3] TechCrunch – Anthropic launches Claude Sonnet 4.5, its best AI model for coding: https://techcrunch.com/2025/09/29/anthropic-launches-claude-sonnet-4-5-its-best-ai-model-for-coding/

[4] AWS (Amazon Bedrock) – Anthropic’s Claude Sonnet 4.5 is now in Amazon Bedrock: https://aws.amazon.com/about-aws/whats-new/2025/09/anthropics-claude-sonnet-4-5-amazon-bedrock/ [5] Google Cloud (Vertex AI blog) – Announcing Claude Sonnet 4.5 on Vertex AI: https://cloud.google.com/blog/products/ai-machine-learning/announcing-claude-sonnet-4-5-on-vertex-ai TARGET_KEYWORDS: [Claude coding, 30-hour autonomous coding, 60% OS task score, 11,000-line app, seven-hour limit, SWE-bench leadership, cross-region inference, external memory, context editing, multi-agent support, virtual machines, SOC 2 audit tasks, enterprise trials 30 hours, long-horizon coding, checkpoints for agents, Bedrock availability, Vertex AI integration, agent orchestration, continuous automation, benchmark improvement 20 points, database provisioning automation] FOCUS_KEYWORDS: [Claude coding, Claude 30-hour coding, Claude 60% OS benchmark, Claude 11,000-line app, Claude Sonnet 4.5 enterprise, Claude long-horizon agents, Claude Bedrock Vertex AI] SEMANTIC_KEYWORDS: [benchmark score, SWE-bench, agentic workflows, context window, external memory store, cross-region inference, orchestration, virtual machines, SOC 2 compliance, enterprise deployment, checkpoints, multistep automation, reliability, uptime, tool use] LONG_TAIL_KEYWORDS: [Claude coding 30 hours autonomous trial, Claude 4.5 60% OS benchmark vs 40%, builds 11,000-line web app autonomously, Claude Sonnet 4.5 Bedrock availability regions, Vertex AI Claude 4.5 integration steps, SOC 2 audit automation with Claude 4.5, AWS SWE-bench leadership claim details, multi-agent coding with memory and VMs] FEATURED_SNIPPET: Claude coding just scaled to 30 hours of autonomous work, up from a seven-hour limit. Reuters reports OS-task benchmarks rising to 60% from 40%, while a 30-hour run produced an 11,000-line web app. With availability on AWS Bedrock and Google’s Vertex AI, plus features like external memory and checkpoints, enterprises can pilot long-horizon agents for coding, cybersecurity, and finance.

Image generated by DALL-E 3