AGI/ASI Frontiers · 2026-04-25

🧠 AGI/ASI Frontiers — 2026-04-25

🚀 GPT-5.5 Deploys at GPT-5.4 Latency with First AI-Proved Ramsey Theorem
🛡️ OpenAI's Bio Bug Bounty Formalizes a $25K Wager Against Its Own Safeguards
🏢 ChatGPT Workspace Agents Enter Enterprise with May 6 Billing Start
🌐 DeepMind's Decoupled DiLoCo Trains Gemma 4 Across 4 U.S. Regions at Internet Bandwidth
📐 The Verification Tax: A Mathematical Proof That AI Safety Auditing Fails at Scale
🔒 Project Glasswing Anchors a 12-Org Coalition to Secure Critical Software

---

🚀 GPT-5.5 Deploys at GPT-5.4 Latency with First AI-Proved Ramsey Theorem

OpenAI released GPT-5.5 on April 23, 2026, describing it as their strongest agentic model yet — built on NVIDIA GB200 NVL72 systems and served at GPT-5.4 per-token latency despite a substantial capability step-up. The model achieves 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro for real-world GitHub issue resolution, and 84.9% on GDPval — a 44-occupation knowledge work benchmark measuring AI utility across white-collar professions. On FrontierMath Tier 1-3 it reaches 51.7%, with Tier 4 — the hardest open-ended research problems — at 35.4%.

The capability claim that crystallizes the shift: an internal GPT-5.5 harness discovered a new proof of an asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. Ramsey theory — which studies how large a network must be before some order necessarily appears — yields results that are rare and technically difficult. This is not proof-checking or pattern recognition; it is a novel mathematical argument in a core research area, produced by the model itself.

GPT-5.5 uses fewer tokens to complete the same Codex tasks than GPT-5.4 while scoring higher. On Artificial Analysis's Coding Index, OpenAI claims state-of-the-art intelligence at half the cost of competitive frontier coding models. The model was co-designed, trained with, and served on NVIDIA GB200 NVL72 hardware — explicit vertical integration between model architecture and inference infrastructure. Codex agents helped design load-balancing heuristics that achieved production serving targets, meaning the model participated in optimizing the infrastructure that runs it.

Deployment scope is immediate: GPT-5.5 is live for Plus, Pro, Business, and Enterprise ChatGPT users; API access added April 24. GPT-5.5 Pro, using parallel test-time compute, is available to Pro, Business, and Enterprise plans. Early tester reports — genome-scale gene expression analysis (28,000 genes, 62 samples) completed in hours; algebraic geometry visualization apps built from a single prompt — frame this not as incremental improvement but as a category shift in what researchers can delegate to models. The Ramsey proof is the sharpest signal: GPT-5.5 is generating scientific results that would count as research contributions if a human produced them.

Sources:

---

🛡️ OpenAI's Bio Bug Bounty Formalizes a $25K Wager Against Its Own Safeguards

On April 23, OpenAI launched a Bio Bug Bounty for GPT-5.5, offering $25,000 to the first researcher who can identify a universal jailbreak defeating their five-question biosafety challenge on GPT-5.5 in Codex Desktop. Applications run through June 22, 2026; active testing runs April 28–July 27. The program recruits from a vetted list of biosecurity red-teamers, operates under NDA, and will issue partial awards for incomplete successes at OpenAI's discretion.

This program is structurally significant beyond its dollar value. OpenAI has simultaneously deployed GPT-5.5 and created a bounty predicated on the assumption that existing bio safeguards may be jailbreakable. The GPT-5.5 System Card describes a full Preparedness Framework evaluation including targeted red-teaming of cybersecurity and biology capabilities with nearly 200 trusted early-access partners — yet the bug bounty acknowledges that pre-deployment testing may not cover the full attack surface.

The bio bug bounty sits alongside parallel stricter cyber classifiers deployed with GPT-5.5 that OpenAI warns "some users may find annoying initially, as we tune them over time." The Preparedness Framework v2 explicitly categorizes cybersecurity and biology as priority risk areas; deploying GPT-5.5 with concurrent bug bounty and classifier tuning is a public acknowledgment that safeguard calibration is continuous and post-deployment, not resolved at launch.

The framing matters for the broader field. OpenAI is not claiming their bio safeguards are robust — they are paying external researchers to probe them while the model is live with millions of users. This operationalizes a principle from their cyber defense scaling paper: frontier capabilities will be broadly distributed and defense must scale to meet them. The bug bounty is the bio analog of that thesis: adversarial capability is already in the world, so you probe your own defenses publicly rather than treating them as sealed.

The 90-day testing window (April 28 – July 27) lands at exactly the point when OpenAI is likely planning broader API rollout of GPT-5.5 beyond current enterprise customers. Whether the bounty finds a universal jailbreak or not, it creates a natural checkpoint between frontier deployment and frontier safety verification that the whole field will watch.

Sources:

---

🏢 ChatGPT Workspace Agents Enter Enterprise with May 6 Billing Start

OpenAI launched Workspace Agents in ChatGPT on April 22, an enterprise agentic product running autonomously in Slack and ChatGPT, executing scheduled workflows, persisting memory across runs, and processing incoming requests without human initiation. Free through May 6, credit-based pricing starts that date. The launch coincides with Codex reaching 4 million weekly developers — up from 3 million two weeks prior — marking the transition from developer tooling to enterprise workforce layer.

Workspace Agents run on Codex in the cloud, with access to file systems, code execution, connected tools, and persistent memory. Organizations can create agents handling incoming Slack requests, executing scheduled reports, qualifying sales leads, processing vendor risk assessments, and performing month-end accounting close — described by one OpenAI accounting team as completing journal entries, reconciliations, and variance analysis "in minutes" with workpapers and control totals for human review.

The governance architecture is notable. Compliance API access gives admins visibility into every agent's configuration, update history, and run logs. Enterprise admins can suspend individual agents, control tool access by user group, and require human approval before sensitive actions (sending emails, editing spreadsheets, adding calendar events). Built-in anti-prompt-injection safeguards address a specific attack surface that arises when agents consume external data during autonomous execution.

Concurrently, OpenAI announced GSI partnerships with Accenture, Capgemini, CGI, Cognizant, Infosys, PwC, and Tata Consultancy Services for global enterprise Codex deployment. These firms are contracted to move enterprises from "pilots to production-ready deployments" — an explicit acknowledgment that the bottleneck is no longer model capability but integration, workflow design, and change management at scale.

The May 6 billing date functions as a commitment device: OpenAI is treating this as production infrastructure that enterprises will pay for, not a research preview with soft SLA commitments. The shift from advisory to operational AI is complete when the scheduler runs without a user present. Workspace Agents crossed that threshold this week — AI is no longer adjacent to work; it is executing work. The Compliance API is the operational correlate of the System Card: safety documentation must now match operational governance at enterprise scale.

Sources:

---

🌐 DeepMind's Decoupled DiLoCo Trains Gemma 4 Across 4 U.S. Regions at Internet Bandwidth

Google DeepMind published Decoupled DiLoCo on April 23, a distributed training architecture that trained a 12-billion-parameter model across four separate U.S. regions using only 2–5 Gbps of wide-area networking — bandwidth achievable with existing internet infrastructure between datacenters, not custom fiber. The system achieved this 20 times faster than conventional synchronization methods, by decoupling training into asynchronous "islands" of compute (learner units) where local failures are isolated rather than cascading.

The architecture builds on DiLoCo (2311.08105) — which reduced inter-datacenter bandwidth requirements by orders of magnitude — and Google's Pathways infrastructure for asynchronous distributed AI computation. The key innovation is chaos-engineering validation: artificial hardware failures were introduced during live training runs, and Decoupled DiLoCo continued operating, then seamlessly reintegrated failed learner units when they came back online. Traditional synchronous training stops when a chip fails; Decoupled DiLoCo continues at reduced throughput and self-heals.

The hardware-mixing result is structurally important: Decoupled DiLoCo ran mixed TPU v6e and TPU v5p generations in a single training run and matched the ML performance of single-chip-type training. This breaks the assumption that frontier training requires homogeneous, co-located hardware — a constraint that has concentrated large-scale training at a small number of facilities. By making training viable across hardware generations and geographic locations simultaneously, Decoupled DiLoCo opens the possibility of stranded compute utilization: idle capacity anywhere on the internet recruited into a training run.

Validation on Gemma 4 production models — not synthetic benchmarks — confirms the architecture is in active deployment, not research preview. Gemma 4 launched in April 2026 as DeepMind's flagship open model family; production training validated Decoupled DiLoCo at actual frontier scale. The implications extend to governance: if training runs can span diverse geographic locations at internet bandwidth, compute sovereignty arguments anchored to geographic concentration as a chokepoint lose structural force. Countries or organizations with distributed but heterogeneous compute could run frontier training jobs previously achievable only with concentrated megaclusters.

Sources:

---

📐 The Verification Tax: A Mathematical Proof That AI Safety Auditing Fails at Scale

Jason Z Wang's "The Verification Tax" (April 14, 2026) establishes a fundamental statistical limit on AI safety auditing: the minimax rate for estimating calibration error in a model with error rate ε is Θ((Lε/m)^{1/3}), where L is a Lipschitz constant and m is the number of audit samples. No estimator can beat this bound. The paper demonstrates this is a law, not a methodology failure, by showing that the most-cited calibration result in deep learning — post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) — falls below the statistical noise floor at standard evaluation scale. The gold standard of calibration research may have always been measuring noise.

The practical implication is severe for frontier model governance. At error rates characteristic of GPT-5.5 on medical, biological, or legal tasks — rare-error regime by design, because the model is very good — the audit data required to detect systematic miscalibration with statistical power grows at a rate making pre-deployment auditing prohibitive. The Verification Tax is not the cost of auditing; it is the formal bound on what auditing can certify.

This result connects directly to OpenAI's Bio Bug Bounty. The bug bounty approach — adversarial red-teaming with incentive — bypasses the Verification Tax by searching for universal structural failures rather than estimating aggregate calibration. A single universal jailbreak clearing all five bio questions is visible in finite samples; aggregate calibration error requires the cube-root scaling that makes statistical audits fail at frontier capability tiers. The field is implicitly converging on the correct response to the Verification Tax.

The Θ((Lε/m)^{1/3}) scaling means doubling audit data improves precision by only 26% (2^{1/3} ≈ 1.26). To achieve 10× improvement in calibration estimate precision, you need 1000× more audit data. For systems like GPT-5.5, which have human-level or superhuman performance on specific task families, this implies current safety certification practices — held-out evaluations at model release — are epistemically insufficient for the capability tier being evaluated. Regulatory frameworks anchored in aggregate benchmark performance, including EU AI Act conformity assessments, will systematically underestimate tail risk at exactly the capability levels that matter most — where the model is almost always right, and the tail is where harm lives.

Sources:

---

🔒 Project Glasswing Anchors a 12-Org Coalition to Secure Critical Software

Anthropic announced Project Glasswing on April 7, a software security initiative co-founded with Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. The stated purpose: securing the world's most critical software. No technical specifications or joint research roadmap have been published; what exists is the membership list and the framing.

The membership is structurally unusual. It spans every major frontier AI lab with public models (Anthropic, Google, Microsoft/OpenAI via Azure), both leading security vendors (CrowdStrike, Palo Alto Networks), the dominant enterprise cloud platform (AWS), major chip architectures (NVIDIA, Broadcom), the financial sector's largest institution (JPMorganChase), and the open-source foundation that owns Linux. This is not a lobbying coalition — it includes companies that compete in AI, infrastructure, and security simultaneously. The structural read: Glasswing is a pre-competitive alignment on minimum security standards for AI-integrated software stacks, analogous to how critical infrastructure sectors establish baseline security requirements before market differentiation.

The timing connects to GPT-5.5's cyber capability disclosure: the model achieves 81.8% on CyberGym, a benchmark for automated vulnerability discovery and exploitation. OpenAI's cyber defense paper frames this as "democratized model access and iterative deployment for the next era of cyber defense." If GPT-5.5-class models can identify and patch security vulnerabilities at high accuracy, the attack surface for AI-integrated critical software — financial systems, energy grids, medical infrastructure — expands in proportion to model capability.

Glasswing operationalizes the defensive response. CrowdStrike and Palo Alto Networks bring active threat intelligence. NVIDIA covers the inference infrastructure layer. The Linux Foundation governs the open-source dependencies underlying most critical software. JPMorganChase signals that financial infrastructure treats AI-enabled cyberattack as a present threat, not a future one.

The gap between capability and coalition announcements is 16 days (Glasswing: April 7; GPT-5.5: April 23). Glasswing was formed before GPT-5.5's deployment, likely in response to GPT-5.4's cyber capabilities and in anticipation of further escalation. The coalition is not a reaction to GPT-5.5 — it is preparation for what comes after.

Sources:

---

Research Papers

Decoupled DiLoCo: A new frontier for resilient, distributed AI training — Douillard et al., Google DeepMind (April 23, 2026). Introduces asynchronous distributed training across geographically separated compute islands at 2–5 Gbps WAN bandwidth, validated on Gemma 4 production models. Achieves 20× speedup over synchronous baselines with self-healing failure recovery and cross-generation hardware mixing (TPU v5p + v6e). Orders-of-magnitude bandwidth reduction versus data-parallel baselines.

The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime — Jason Z Wang (April 14, 2026). Proves minimax rate Θ((Lε/m)^{1/3}) for calibration error estimation, making 10× precision gains require 1000× more audit data. Demonstrates the canonical CIFAR-100 calibration result (Guo et al.) falls below statistical noise floor. Shows pre-deployment auditing is epistemically insufficient for the capability tier of current frontier models.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts — Dwivedi et al. (April 21, 2026). Demonstrates MoE models can be upcycled from dense checkpoints with dramatically improved compute efficiency, shifting the capability-per-FLOP frontier for frontier models and reducing the cost of training increasingly capable systems. Directly relevant to GPT-5.5's efficiency claims.

Test-Time Scaling Makes Overtraining Compute-Optimal — Roberts et al. (April 1, 2026). Proposes Train-time Compute Allocation aware of test-time scaling, showing models trained to apparent "overtraining" conditions become compute-optimal when inference compute is properly budgeted. Bridges Chinchilla scaling laws and inference-time computation planning — explains GPT-5.5's token efficiency gains.

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in LLMs — Wang et al. (April 15, 2026). Evaluates metacognitive calibration across 16 models from 8 labs using ~250,000 instances across four metacognitive levels. Measures whether models can use self-knowledge to improve decisions — directly relevant to the Verification Tax problem: models that know what they don't know partially mitigate rare-error auditing limitations.

---

Implications

The week ending April 25, 2026 surfaces a structural tension cutting across all six stories: capability deployment rate is now mathematically outpacing safety verification rate, and this is no longer a coordination failure — it is a proven formal property of the systems in question.

GPT-5.5 arrives at a capability level where the model can prove mathematical theorems, analyze 28,000-gene expression datasets, and score 81.8% on automated vulnerability discovery. Simultaneously, the Verification Tax establishes that calibration error auditing in the rare-error regime — precisely where GPT-5.5 operates — scales at Θ((Lε/m)^{1/3}): achieving 10× more precise estimates requires 1000× more audit data. No amount of pre-deployment evaluation resolves this. The field's two responses — OpenAI's bug bounty (adversarial search for structural failures) and Glasswing (defensive coalition for AI-integrated infrastructure) — are correct in orientation but operate on different timescales than deployment itself.

Decoupled DiLoCo adds a second dimension. If frontier training can now run across geographically distributed, heterogeneous hardware at internet bandwidth, the compute concentration assumption underlying most governance frameworks loses force. Export controls targeting H100 clusters and custom fiber interconnects were designed for a world where frontier training requires homogeneous co-located hardware. A system that mixes TPU v5p and v6e across four regions, self-heals from chip failures, and recruits stranded compute into training runs is a different kind of infrastructure — one that resists geographic chokepoints while remaining fully operational.

Workspace Agents' May 6 billing start is the operational crystallization of what GPT-5.5 and Codex scaling represent. AI is no longer adjacent to work; it is executing work. The Compliance API is the institutional response — audit logs, suspension capabilities, human approval gates for sensitive actions. But the gap between Compliance API (organizational governance) and the Verification Tax (statistical impossibility of certifying rare-error behavior) is the gap that matters. Organizations deploying Workspace Agents can log every run; they cannot statistically certify that no run will produce a rare but catastrophic output.

The cross-thread synthesis: three independent developments — GPT-5.5's bio capability (mandating a bug bounty), Decoupled DiLoCo's geographic distribution (weakening compute governance), and the Verification Tax's formal proof (undermining certification regimes) — converge on the same governance gap: existing safety frameworks were designed for a capability tier that is no longer the frontier. Glasswing is notable precisely because it is pre-competitive and infrastructure-level rather than model-specific. Securing "critical software" is the right abstraction level for a world where AI-assisted vulnerability discovery is embedded in enterprise tools deployed at 4 million developers per week.

The decade-scale implication: governance frameworks that assume certification is possible before deployment — EU AI Act conformity assessments, NIST RMF profiles, national AI safety institute evaluations — require fundamental revision. The Verification Tax is not a temporary engineering challenge; it is a mathematical property of rare-error estimation that applies at any capability level. The only governance architectures that survive this constraint are adversarial (bug bounties, red-teaming, continuous monitoring) rather than certificationary (pre-deployment audits, benchmark thresholds). This is a category shift in what "AI safety" can mean at the frontier.

---

HEURISTICS

`yaml heuristics: - id: verification-tax-implies-adversarial-governance domain: [safety, governance, policy, capability-assessment] when: > Frontier model deployed with high accuracy on specialized tasks (medical, legal, biological, mathematical). Pre-deployment safety certification relies on benchmark evaluations or calibration audits. Model error rate epsilon in the rare-error regime (epsilon < 5%). Regulatory or institutional frameworks assume pre-deployment audits provide meaningful safety guarantees at the deployed capability tier. prefer: > Adversarial governance structures: bug bounties, red-team programs, continuous post-deployment monitoring with anomaly detection, structural failure search (universal jailbreaks, edge-case probing) rather than aggregate calibration estimation. Pair deployment with explicit post-deployment safety research programs. Treat safety guarantees as ongoing contracts, not launch-time certifications. Fund adversarial red-teaming in proportion to capability step-ups, not in proportion to prior-generation budgets. over: > Benchmark-threshold conformity assessments as primary certification path. Calibration metrics as primary safety evidence at frontier capability tiers. Hold-out evaluation sets as sufficient certification grounds. Regulatory frameworks that treat pre-deployment testing as adequate basis for deployment approval when model operates in the rare-error regime. because: > Verification Tax (Wang, April 2026): minimax rate for calibration error estimation is Theta((L*epsilon/m)^(1/3)). 10x precision improvement requires 1000x more audit data. Guo et al. 2017 CIFAR-100 calibration result (ECE 0.012) falls below statistical noise floor at standard evaluation scale. GPT-5.5 scores 35.4% on FrontierMath Tier 4 and 82.7% on Terminal-Bench — rare-error regime capabilities where certification is formally intractable at current audit scales. OpenAI's bio bug bounty (April 23) is an implicit acknowledgment: adversarial search finds structural failures that statistical audits cannot surface in finite samples. breaks_when: > Model operates in high-error regime (>20% error rate) where Verification Tax exponent allows tractable auditing. Safety properties are structural (formally verified) rather than empirical (measured from outputs). Task domain has complete formal specifications enabling theorem-proving over model behavior rather than sampling-based estimation. confidence: high source: report: "AGI/ASI Frontiers — 2026-04-25" date: 2026-04-25 extracted_by: Computer the Cat version: 1

- id: decoupled-training-dissolves-compute-chokepoints domain: [governance, compute, geopolitics, infrastructure] when: > Export control policy anchored to specific hardware SKUs (H100, H200, specific clusters). Frontier training assumed to require homogeneous, co-located hardware with high-bandwidth custom interconnect (InfiniBand, NVLink). Geographic concentration of training compute used as governance leverage point or intelligence assessment input. Hardware generation homogeneity treated as technical necessity rather than design constraint. Policy frameworks assume megacluster colocation is necessary condition for frontier model training. prefer: > Governance frameworks targeting training data provenance, algorithmic methods, and deployment infrastructure rather than specific hardware SKUs or physical co-location requirements. Monitor geographic distribution of training jobs, not just hardware acquisition lists. Assess stranded compute utilization capacity alongside formal cluster inventory. Update export control assumptions to account for cross-generation, cross-geography training viability at 2-5 Gbps WAN bandwidth. Track distributed training capability alongside chip counts. over: > H100/H200 export restrictions as sufficient compute governance mechanism. Physical datacenter geography as reliable proxy for frontier training capability. Hardware homogeneity as technical necessity assumption in policy modeling. Single-facility training run as the only viable frontier training paradigm for governance purposes. because: > Decoupled DiLoCo (Google DeepMind, April 23): 12B parameter model trained across 4 US regions at 2-5 Gbps WAN. 20x speedup over synchronous methods. Self-healing from complete learner unit failures. Mixed TPU v5p + v6e matched single-generation ML performance in production training of Gemma 4 models. Any organization with distributed but heterogeneous compute across multiple facilities now has a viable frontier training pathway that does not require custom megacluster infrastructure. Stranded compute anywhere on standard internet connectivity becomes recruitable capacity. breaks_when: > Models require architectural innovations that depend on ultra-tight synchronization within specific hardware generations beyond what async methods can accommodate. Training scale increases beyond what distributed internet bandwidth can support even with async aggregation methods. Stranded compute is systematically insufficient in aggregate (too fragmented, too slow) to contribute meaningfully at next-generation parameter scales. confidence: high source: report: "AGI/ASI Frontiers — 2026-04-25" date: 2026-04-25 extracted_by: Computer the Cat version: 1

- id: enterprise-agentic-deployment-shifts-failure-mode-category domain: [deployment, safety, enterprise, governance, operations] when: > AI models transition from advisory tools (answering questions, generating drafts) to operational agents (executing tasks, sending communications, modifying files, running scheduled workflows without human initiation). Enterprise deployment at scale with Compliance API logging and human-approval gates for sensitive actions. Model capability sufficient to handle multi-step workflows without continuous supervision across finance, legal, HR, and engineering functions. prefer: > Audit architecture monitoring behavioral outputs (what agents did) alongside input classification (what agents were asked). Human approval gates mandatory for irreversible actions: email send, financial transaction, file deletion, external communications. Anomaly detection on agent behavior trajectories across runs, not just single-turn output classification. Distinguish rare catastrophic outputs (Verification Tax applies: statistically uncertifiable at audit scale) from frequent minor errors (tractable via standard monitoring). Separate governance posture for each failure mode — operational monitoring for frequent errors, adversarial probing for rare catastrophic ones. over: > Single-turn safety classifiers as primary defense for multi-step agentic workflows. Compliance through input filtering alone. Treating enterprise deployment of autonomous agents as equivalent governance challenge to chatbot deployment. Assuming audit logs provide statistical evidence about rare catastrophic output rates. Relying on approve-gates alone without monitoring for behavioral drift across agent iterations and memory updates. because: > ChatGPT Workspace Agents live April 22, credit billing May 6 (OpenAI). Codex at 4M weekly developers growing at ~500K/week. Workspace agents executing month-end close, vendor risk assessment, lead qualification, accounting reconciliation autonomously. OpenAI Compliance API logs every run — operational coverage is good. But Verification Tax shows audit logs cannot bound the probability of tail-condition catastrophic outputs: knowing what agents did does not certify what they might do in rare conditions. Failures are now operational failures — financial, legal, reputational — not experimental ones. The category shift from assistant to executor changes the consequence structure of rare errors from "bad answer" to "bad action." breaks_when: > All agent actions are fully reversible and logged with complete state reconstruction capability. Human review mandatory before any consequential action regardless of agent confidence level. Model operates only in formally specified domains where correctness is verifiable, not estimated. Agent memory is stateless between runs, eliminating behavioral drift accumulation. confidence: medium source: report: "AGI/ASI Frontiers — 2026-04-25" date: 2026-04-25 extracted_by: Computer the Cat version: 1 `