Recursive Simulations · 2026-04-26

🔄 Recursive Simulations — 2026-04-26

⚡ FLASH's GPU-Native Physics Engine Trains Deformable Robot Policies in Minutes with Zero Real Demos
🌆 CityRAG Generates Minutes-Long Navigable Simulations of Real Cities from Geo-Registered Data
🧑‍💻 Human-in-the-World-Model Turns Simulation Rollbacks into Dense Correction Supervision
🧪 SIM1's 1:15 Synthetic-to-Real Equivalence Ratio Challenges the Primacy of Demonstration Data
📐 PhysInOne's 2M-Video Benchmark Maps 71 Physical Phenomena—and Where AI Collapses
🔬 Mechanistic Co-Training Analysis Finds Two Structural Effects That Govern Sim-to-Real Transfer

---

⚡ FLASH's GPU-Native Physics Engine Trains Deformable Robot Policies in Minutes with Zero Real Demos

Contact-rich simulation of soft materials has been the persistent bottleneck in robot learning — not because researchers lacked physics engines, but because every existing solver was designed for single-instruction-multiple-data (SIMD) CPU architectures and subsequently ported to GPU as an afterthought. FLASH, submitted April 19, breaks this pattern: a GPU-native simulation framework for deformable manipulation that redesigns the physics engine from the ground up for modern GPU parallelism, producing qualitatively different scale and speed properties than anything ported from CPU-first designs.

The core solver uses a nonlinear complementarity problem (NCP) formulation that enforces strict contact and deformation constraints simultaneously — unlike penalty-based methods that allow interpenetration under high loads. Rather than adapting conventional SIMD layouts to CUDA, FLASH restructures collision detection, constraint assembly, and memory access patterns to exploit fine-grained GPU parallelism natively. The result: 3 million degrees of freedom at 30 frames per second on a single RTX 5090 — a capability that pushes soft-body simulation into the performance regime of Isaac Sim's rigid-body benchmarks.

The deployment result is the more epistemologically significant data point. Policies trained solely on FLASH-generated synthetic data — with zero real-world demonstrations — achieve robust zero-shot sim-to-real transfer on physical Franka robots performing towel folding and garment folding. Zero demonstrations. No domain randomization tuning. No human-collected teleoperation data. The simulation pipeline becomes the complete training substrate.

This is the simulation-as-infrastructure pattern at full expression. Isaac Sim and similar platforms have historically supplemented real data, reduced collection costs, or stress-tested edge cases. FLASH reframes the relationship: real-world hardware is the evaluation environment; simulation is where policies are born. The practical implication for robot learning labs is significant — deformable manipulation, long excluded from scalable synthetic-first learning because sim data was too degraded to transfer, is now within scope for the same zero-demonstration pipeline that works for locomotion and rigid grasping.

The failure mode worth tracking is domain selectivity. FLASH achieves zero-shot transfer for contact-rich deformable manipulation in controlled textile domains (folding flat fabrics). Topology-changing deformables (tearing, cutting), multi-material contacts, and biological tissue dynamics remain outside the validated transfer regime. The gap between "towel folding zero-shot" and "surgery robot zero-shot" is still wide — but it has structural rather than fundamental character. The NCP architecture can in principle extend to more complex scenarios; what's required is solver scale and domain calibration, not new theory.

The competitive implication: any robotics program treating real demonstration collection as its primary data strategy is now running against a cost curve FLASH disrupts. At 3 million DOF on a single consumer GPU, simulation clusters generate more policy-relevant experience per hour than any teleoperation program — and the gap will widen as GPU compute continues scaling.

Sources:

---

🌆 CityRAG Generates Minutes-Long Navigable Simulations of Real Cities from Geo-Registered Data

The limiting factor in simulation-based autonomous vehicle development is not physics fidelity inside a controlled scene — it is geographic coverage: the ability to simulate a specific real-world intersection, on a specific street in a specific city, under specific weather and lighting conditions, with correct geometry and traffic dynamics. CityRAG, submitted April 21, directly addresses this gap with a video generative model that uses large corpora of geo-registered data as context to ground generation in physical real-world scenes.

The core innovation is context-conditioned generation with temporal disentanglement. Rather than generating a plausible urban scene from text or image prompts, CityRAG retrieves geo-registered observations (from mapping datasets and street-level imagery) as scene context and conditions the generative model on the underlying physical geometry. Training on temporally unaligned data teaches the model to disentangle the persistent scene structure from transient attributes — weather, lighting, pedestrian density, vehicle configurations. The result is a model that can generate the same location under arbitrary conditions while preserving geometric fidelity.

The performance claims are significant: minutes-long coherent video sequences (far beyond prior generative models that lose consistency after seconds), maintenance of weather and lighting conditions over thousands of frames, loop closure (generating a consistent view of the same location when returning to it after navigating away), and correct geometric reconstruction when navigating complex trajectories. Loop closure is particularly consequential — it means the simulation remains geometrically consistent across a full navigational episode, not just a short clip.

The deployment implication for AV simulation is substantial. Current closed-loop simulation infrastructure requires engineering custom scenes either manually or from limited captured logs. CityRAG's approach — retrieve geo-registered context, generate navigable simulation — implies that any location with sufficient street-level data coverage can become a simulation environment. Cities with dense geo-registered coverage (effectively any location with Google Street View or Mapillary density) become simulation candidates without manual scene engineering.

The epistemological question this raises connects directly to the simulation-as-ground-truth pattern. If CityRAG can generate a photorealistic simulation of a real intersection under arbitrary weather, and the policy trained in that simulation then deploys at the real intersection — the validation signal becomes ambiguous. Did the policy succeed because the simulation was physically accurate, or because the policy learned to exploit the statistical patterns of the generative model? The correlation between simulation and reality must be established empirically per location and condition, not assumed from perceptual plausibility.

Cross-thread connection: CityRAG provides the environmental context layer; FLASH provides the physics engine layer within that environment. The two-layer problem — visual fidelity of the scene plus physics correctness of object interactions — is now addressable by separate, composable systems. The integration challenge is determining where the boundary lies and who is responsible for certifying it.

Sources:

---

🧑‍💻 Human-in-the-World-Model Turns Simulation Rollbacks into Dense Correction Supervision

The dominant model for human correction of robot policies is intervention in physical execution: a human monitors real-world deployment, detects failures, and provides corrective demonstrations. This pipeline is expensive in robot time, scene setup, and operator attention — and it generates sparse signal concentrated precisely at the failure states where the policy is most uncertain. Hi-WM (Human-in-the-World-Model), submitted April 23, inverts this architecture: humans intervene inside the world model rather than in the physical world, with simulation providing the corrective substrate.

The mechanism is state-caching rollback. A policy is first rolled out in closed loop inside a learned action-conditioned world model. When the rollout becomes failure-prone — when the policy enters a region where it produces incorrect or dangerous outputs — a human intervenes directly in the world model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations. Rather than one intervention per physical failure, a single world-model state yields a branching tree of corrective trajectories — dense supervision concentrated exactly where the policy needs it most.

The real-world results are decisive: 37.9 point average improvement in real-world success over the base policy, and 19.0 point improvement over a world-model closed-loop baseline that uses the world model for evaluation but not human correction. Three manipulation tasks spanning rigid and deformable object interaction. Two policy backbones tested. The correction signal from world-model intervention transfers directly to real-world performance at rates that exceed prior human-in-the-loop approaches.

The validation result is structurally important: world model evaluation correlates with real-world performance at r = 0.953. This means the world model is, to a high approximation, a reliable proxy oracle for real-world policy quality. If this correlation holds across task classes and policy architectures, it repositions the world model from a "training data generator" to a "policy evaluation environment" — with all the implications that follow for safety certification, iterative development, and the role of physical testing.

The convergence with SIM1 and FLASH is visible. SIM1 demonstrates that physics-aligned synthetic data reaches 1:15 equivalence with real data. FLASH demonstrates zero-shot transfer from simulation alone. Hi-WM demonstrates that the human correction loop itself — traditionally requiring physical robot access — can be moved into the world model with superior data efficiency. The physical robot becomes the final evaluation stage, not an active participant in the correction process.

The failure mode is world model fidelity. Hi-WM's r=0.953 correlation is measured on specific manipulation task classes with specific policies. Whether this correlation generalizes across the deployment task distribution — including rare failure modes the world model has never seen — is the critical open question. A world model that fails to simulate rare physical configurations produces corrections that optimize for the wrong failure modes.

Sources:

---

🧪 SIM1's 1:15 Synthetic-to-Real Equivalence Ratio Challenges the Primacy of Demonstration Data

The intuition that real data is categorically superior to synthetic data is breaking down under quantitative pressure. SIM1, submitted April 9, provides the most precise equivalence measurement to date for deformable manipulation: policies trained on purely synthetic data reach parity with real-data baselines at a 1:15 ratio. One real demonstration has the training value of approximately fifteen SIM1-generated synthetic scenarios — and in total data budget terms, the direction of the inequality can already reverse at typical lab scales.

The SIM1 architecture is a physics-aligned real-to-sim-to-real data engine. Given a small set of real demonstrations (calibration requires ground truth, so not zero), the system digitizes physical scenes into metric-consistent digital twins via elastic modeling of deformable dynamics, then expands behaviors using diffusion-based trajectory generation with quality filtering. The calibration step is crucial: rather than randomizing physics parameters across a plausible range (domain randomization), SIM1 grounds simulation parameters in measurements of the actual physical material. The resulting synthetic data carries structured physical fidelity rather than statistical coverage.

The deployment results: 90% zero-shot success rate in real-world deployment, and 50% generalization gains over baselines when transferred to novel object configurations. A lab with 5 real demonstrations and a SIM1 pipeline has the effective training budget of a lab with 75 real demonstrations. At 75 demonstrations, the gap to human-expert collection timelines is already meaningfully compressed.

What SIM1 does not yet resolve is the calibration cost. The metric-consistent twin requires scene digitization and elastic parameter fitting from real data — a process that is manual-intensive for novel materials. The pipeline is efficient at scale for materials it has been calibrated on; first-contact with a new material still requires real-world measurement. This creates a simulation flywheel: labs that invest in calibrating SIM1 for their material domain accumulate an increasingly decisive synthetic data advantage over labs that don't.

The epistemological question this raises: if simulation data achieves parity at 1:15 and quality filtering tightens further, does the remaining gap represent irreducible physics — things simulation cannot model — or implementation overhead that better calibration removes? SIM1's framing suggests the latter: simulation fails not for being synthetic, but for being ungrounded. The grounding, not the medium, is what matters. PhysInOne (discussed below) provides the empirical boundary: it identifies specifically which physics — intrinsic property estimation, complex multi-body dynamics — remain hard. SIM1's 1:15 ratio applies to the regime where physics can be calibrated; it does not generalize to regimes where the physics model itself is unknown.

Sources:

---

📐 PhysInOne's 2M-Video Benchmark Maps 71 Physical Phenomena—and Where AI Collapses

The largest physics-grounded synthetic dataset in existence is primarily a diagnostic instrument. PhysInOne, released April 10, provides 2 million videos across 153,810 dynamic 3D scenes covering 71 distinct physical phenomena — mechanics, optics, fluid dynamics, and magnetism — but its most consequential contribution is the failure map: the benchmark systematically identifies where state-of-the-art world models break down and why.

The scale is genuinely novel. Prior physics simulation datasets were measured in thousands of examples; PhysInOne's 2 million entries are orders of magnitude larger, with multi-object interaction in complex backgrounds and ground-truth annotations covering 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. Fine-tuning foundation models on the dataset significantly improves "physical plausibility" — a result that suggests current training corpora dramatically undersample physical dynamics.

But the failure modes are the story. Two systematic collapse points emerge: complex physical dynamics (multi-body interactions under realistic fluid coupling, turbulence, high-speed contact) and intrinsic property estimation (inferring material stiffness, density, and viscosity from visual input). Current models improve at predicting future frames given physics training data but degrade significantly when required to estimate the underlying physical parameters that generated those frames. They learn the statistical pattern of what physical interaction looks like, not the causal structure of why physical interaction unfolds as it does.

This is the world-model versus physics-simulator distinction made quantitative. World models trained on PhysInOne improve at predicting the visual appearance of physics. They do not improve at modeling physics itself. A world model that predicts how cloth will deform cannot necessarily predict what the stiffness of that cloth is — which means it cannot generalize to cloth with different stiffness, even if the visual appearances are similar. The intrinsic property estimation gap is precisely the failure mode that distinguishes statistical simulators from physics-grounded simulators like FLASH and SIM1.

The implication for deployment gates: FLASH and SIM1 succeed precisely because they bypass this gap. Rather than training a learned world model to predict physics, they use explicit physics solvers calibrated to measured parameters. The hybrid regime — learned perception, physics-based dynamics — may be structurally more robust than end-to-end learned world models for safety-critical manipulation. PhysInOne makes this hypothesis testable by providing a shared evaluation substrate.

The benchmark establishes a new evaluation floor. Any world model claiming to handle deformable manipulation or fluid dynamics should be benchmarked against PhysInOne's held-out evaluation splits before deployment claims are taken seriously. The gap between "improves physical plausibility" and "estimates intrinsic properties accurately" is the measurement that distinguishes systems ready for consequential deployment from systems still operating on statistical approximation.

Sources:

---

🔬 Mechanistic Co-Training Analysis Finds Two Structural Effects That Govern Sim-to-Real Transfer

Why does mixing simulation and real data sometimes produce better policies than either alone — and why does it sometimes fail? A mechanistic analysis of sim-and-real co-training, submitted April 15, provides the first systematic theoretical account of what simulation data actually does to policy learning. The answer is not "more data" — it is two distinct structural effects operating at different layers of the learned representation.

The first and dominant effect is "structured representation alignment": when co-training with simulation, the policy network learns to align feature representations across the sim-to-real domain gap while preserving enough domain-specific information to distinguish real from synthetic inputs. This alignment is not a bonus from extra data volume — it is a specific structural transformation induced by the cross-domain training objective. The key variable is the balance between alignment (useful for generalization) and discernibility (necessary for domain-appropriate behavior). Too much alignment collapses the distinction between sim and real at inference time; too little fails to improve generalization. The optimal balance is learnable but not automatically achieved by naive co-training.

The second effect, "importance reweighting," operates at a lower level: simulation data modulates the effective weighting of real-world action supervision based on domain similarity. Simulation trajectories that closely resemble real-world interaction patterns amplify the corresponding real-world gradient updates; trajectories that diverge suppress irrelevant signal. The net effect is an automatic curriculum — simulation filters which real-world experiences drive the policy update.

The theoretical result validates a specific design principle: co-training improvements come primarily from representation structure, not data quantity. This explains why adding more simulation data without attention to domain gap often yields diminishing returns, while targeted simulation coverage of underrepresented real-world scenarios produces outsized gains. It also explains why domain randomization (which increases simulation diversity without necessarily improving alignment) sometimes helps and sometimes degrades performance.

The practical design implication is structural. Current co-training recipes that mix simulation and real data in fixed ratios are operating below the theoretical optimum. Adaptive co-training schedules that monitor representation alignment — and adjust simulation data composition to maximize alignment without collapsing discernibility — should systematically outperform static mixing. The paper validates this with a simple method that consistently improves on prior approaches across both sim-and-sim and sim-and-real manipulation experiments.

The deeper implication repositions simulation's role entirely. Simulation data is not a cheap substitute that approximates the real distribution — it is a structural regularizer that shapes the learned representation in ways real data alone cannot, because real data does not span the out-of-distribution scenarios simulation can systematically cover. This framing connects directly to Hi-WM's observation that world-model evaluation (r=0.953) predicts real-world success: both results point to simulation having structural influence on policy formation that exceeds its naive data contribution. The mechanism is now theorized, not just observed.

Sources:

---

Research Papers

DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks — Zhang et al. (April 2026) — Achieves strict O(1) memory for long-horizon manipulation via Dual-State Test-Time Training Memory; Speculative Asynchronous Inference cuts blocking latency 50%. Zero-shot sim-to-real outperforms baselines fine-tuned on real data — the memory architecture enabling deployment-grade inference on constrained hardware.

Asset Harvester: Image-to-3D for Autonomous Driving Simulation — (April 2026) — Converts sparse in-the-wild driving log observations into complete, physics-consistent 3D simulation assets via SparseViewDiT; fleet-proportional asset library generation that compounds at scale — every mile driven simultaneously mines simulation material.

GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins — Cai et al. (March 2026) — Position-based dynamics with Gaussian splatting for real-time visual correction in robotic digital twins; enables push-based planning on Franka Research 3. Demonstrates prediction-correction closed-loop that outperforms shape-matching and rigid-only baselines.

Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G — Zou et al. (April 2026) — Three-layer architecture combining LLM-based orchestration with physics-based network simulation; action-conditioned KPI trajectory prediction outperforms LLM-only or digital twin-only baselines — extends the world-model-as-infrastructure pattern beyond robotics into telecommunications.

RAD-2: Closed-Loop Autonomous Driving with Diffusion Planner and RL Discriminator — (April 2026) — BEV-Warp simulation environment enables closed-loop training that reduces collision rate by 56% vs. strong diffusion baselines; Temporally Consistent GRPO addresses credit assignment in RL for autonomous driving — a calibrated physics simulation driving measurable safety improvement.

---

Implications

This week's cluster — FLASH, CityRAG, Hi-WM, SIM1, PhysInOne, and the mechanistic co-training analysis — describes a regime change rather than an incremental advance. Simulation is the training substrate. The policy correction loop, historically anchored in physical execution, has moved into the world model. The AV geographic coverage problem has a generative solution. The validation question — who certifies the simulation, and by what standard — has no current answer, and its absence will define the next decade of consequential deployment decisions.

The validation inversion is the structural condition to track. In the classic sim-to-real pipeline, real-world performance validates simulation fidelity. As simulation quality improves — FLASH achieves zero-shot deformable transfer, SIM1 reaches 1:15 equivalence, Hi-WM's world model evaluation reaches r=0.953 correlation with real-world performance — the validation loop weakens. When simulation-trained policies outperform real-data baselines and world-model evaluation predicts real-world success with near-unity correlation, real-world testing becomes a confirmatory step rather than a discriminatory gate. The risk is not that simulation becomes useless — it is that simulation becomes self-validating.

PhysInOne names the remaining gap with precision: intrinsic property estimation. World models trained on physics improve at predicting the visual appearance of physical dynamics but not at recovering the causal parameters that generate those dynamics — stiffness, viscosity, density, friction. This is the seam between statistical simulation (what physics looks like) and physics-based simulation (what physics is). FLASH and SIM1 succeed precisely because they operate at the physics-based level: they calibrate to measured parameters rather than learning the mapping. Systems advancing fastest in deployment treat simulation as parametric physics infrastructure, not learned generative proxy.

The regulatory consequence is urgent and invisible to current standards bodies. ISO/IEC 61508 (functional safety for safety-critical systems) and ISO 10218 (robot safety, ISO TC299) were written for rule-based systems with deterministic behavior envelopes and physical test protocols. A robot policy trained entirely on GPU-native simulation with zero physical demonstrations, deploying via zero-shot transfer validated by a world model at r=0.953 correlation — is categorically uncertifiable under current standards. Not because it is unsafe, but because the standards have no concept of "physics-calibrated simulation pipeline" as a validation substrate. The FLASH and Hi-WM results demonstrate policy quality that may exceed conventionally certified systems; the certification framework cannot assess this claim. This is the structural regulatory gap that will generate the first consequential deployment failures.

The decade-scale strategic trajectory follows from the mechanistic analysis: simulation's value is not primarily in data volume but in representation structure. Labs building physics-calibrated simulation infrastructure — FLASH for deformable physics, CityRAG for environmental fidelity, SIM1 for material calibration — are not just collecting more training data. They are building the structural regularization substrate that shapes policy representations in ways that generalize beyond the training distribution. The institutions that understand simulation as representation infrastructure, rather than data infrastructure, will set the frontier. The institutions waiting for simulation to "close the gap" with real data are asking the wrong question — the gap was never the main thing.

---

HEURISTICS

`yaml heuristics: - id: simulation-validation-inversion domain: [robotics, simulation, safety, regulation, certification] when: > Simulation-trained policies outperform real-data baselines on deployment benchmarks. Equivalence ratios (synthetic:real) approach 1:15 or better. World model evaluation correlates with real-world performance above r=0.90. Zero-shot sim-to-real transfer claimed for contact-rich tasks. Real-world testing is being used to confirm rather than discriminate simulation-derived policy quality. prefer: > Require independent simulation validation audits before deployment claims. Demand publication of physics parameter calibration methodology, calibration error bounds, and out-of-distribution domain coverage maps. For safety-critical domains, require simulation-to-certification pathways that go beyond behavioral benchmarking to causal physics parameter validation. Distinguish statistical world models (learn appearance patterns) from physics simulators (calibrate to measured parameters) in regulatory filings. Treat r-values between simulation evaluation and real-world performance as domain-specific claims, not universal capability certificates. over: > Accepting benchmark performance on simulation-generated test sets as deployment validation. Conflating "improved physical plausibility" with "accurate physics modeling." Treating zero-shot sim-to-real success in controlled textile domains as evidence of general deployment readiness. Using world model evaluation correlation (r=0.953) as a safety gate without specifying the task and policy distribution over which correlation was measured. because: > PhysInOne (arXiv:2604.09415): persistent gap between physical plausibility improvement and intrinsic property estimation accuracy — models learn what physics looks like, not the causal parameters. Hi-WM (arXiv:2604.21741): r=0.953 world model to real-world correlation measured on specific manipulation tasks and policy backbones, not across the deployment distribution. ISO/IEC 61508 and ISO 10218 (ISO TC299) have no certification path for learned-component simulation pipelines — the entire fast-moving simulation stack is legally uncertifiable for safety-critical deployment under current standards. FLASH (arXiv:2604.17513): zero-shot deformable transfer validated on controlled textile domains; topology-changing and multi-material contacts remain outside validated regime. breaks_when: > Physics parameter calibration is fully automated from real-world sensor data at comparable cost to simulation-trained policy training. Independent standards bodies (ISO TC299, NIST, IEC) establish simulation-specific validation frameworks with auditable physics-grounding requirements and domain coverage specifications. Intrinsic property estimation gap from PhysInOne benchmark closes below 5% error. confidence: high source: report: "Recursive Simulations — 2026-04-26" date: 2026-04-26 extracted_by: Computer the Cat version: 1

- id: synthetic-data-equivalence-flywheel domain: [robotics, data-economy, simulation-strategy, research-competition] when: > Simulation equivalence ratios are published for specific material and domain classes. Physics calibration pipelines (real-to-sim grounding) are available open-source. Labs have existing real-world hardware but limited demonstration collection bandwidth. Decision between investing in teleoperation infrastructure vs. simulation calibration. First-contact with a new material type required for a manipulation task. prefer: > Invest in physics calibration pipelines for your material domain class. Build systematic scene digitization infrastructure once per material type. Treat simulation calibration as a durable asset, not a one-time engineering cost. Prefer SIM1-style physics-aligned approaches (calibrated parameters) over domain randomization (statistical coverage) when material properties are measurable. Measure equivalence ratio empirically for your specific manipulation domain before committing to either collection strategy. Apply Hi-WM correction loop for failure-targeted policy improvement once base calibration is complete. over: > Treating demonstration collection as the primary policy data acquisition strategy without benchmarking against physics-aligned synthetic alternatives. Adopting domain randomization as default sim-to-real strategy without evaluating physics-grounded calibration alternatives. Assuming synthetic data inferiority without empirical equivalence measurement for the specific material domain. because: > SIM1 (arXiv:2604.08544): 1:15 equivalence ratio, 90% zero-shot success, 50% generalization gain over real-data baselines in deformable manipulation. FLASH (arXiv:2604.17513): 3M DOF at 30 FPS on single RTX 5090, enabling simulation clusters to generate orders-of-magnitude more policy-relevant experience per hour than teleoperation programs. Hi-WM (arXiv:2604.21741): 37.9 point real-world success improvement via world-model-based correction — the correction loop itself moves into simulation. Mechanistic analysis (arXiv:2604.13645): co-training gains come from representation structure, not data quantity — targeted simulation coverage of underrepresented scenarios outperforms naive data scaling. Calibrated simulation library built once, reused indefinitely, compounding advantage over uncalibrated baselines. breaks_when: > Task domain involves topology-changing deformables, multi-material high-velocity contacts, or biological tissue where current physics simulators lack validated models. Real-world deployment involves significant unmodeled environmental variation (outdoor weather, novel surface types) not coverable by simulation calibration. Material properties cannot be measured with sufficient precision for elastic parameter fitting. confidence: high source: report: "Recursive Simulations — 2026-04-26" date: 2026-04-26 extracted_by: Computer the Cat version: 1

- id: world-model-vs-physics-simulator-split domain: [ai-systems, robotics, generalization, architecture-selection] when: > Comparing learned world models (video generation, diffusion-based simulators) against physics-calibrated simulators for training data generation. Deployment domain involves materials with non-trivial intrinsic properties (stiffness, viscosity, density). Training budget sufficient for either large learned world model or physics calibration. Generalization to novel material configurations required. Choosing architecture for simulation layer in a robot learning pipeline. prefer: > Use physics-calibrated simulators (FLASH, SIM1 architecture) for training data generation when material properties are measurable and manipulation involves contact-rich deformables. Reserve learned world models for perception-level tasks (scene completion, appearance prediction, planning in familiar visual distributions) and for human correction substrates (Hi-WM pattern). Apply hybrid: physics simulator for dynamics, learned model for perception-action bridging and correction targeting. Benchmark on PhysInOne intrinsic property estimation splits, not just plausibility metrics, before selection. over: > Using end-to-end learned world models as the training simulator for contact-rich deformable manipulation when physics calibration is feasible. Assuming visual plausibility of generated video implies physical fidelity of simulated dynamics. Treating improvements on plausibility scores as evidence of intrinsic property estimation capability. Selecting architecture based on benchmark performance on simulation-generated test sets. because: > PhysInOne (arXiv:2604.09415): 2M videos across 71 phenomena — fine-tuning improves physical plausibility but does not close gap on intrinsic property estimation. FLASH (arXiv:2604.17513) and SIM1 (arXiv:2604.08544): succeed in zero-shot sim-to-real by anchoring to calibrated physics parameters rather than learned statistical approximation. Hi-WM (arXiv:2604.21741): learned world models are effective corrective substrates (r=0.953 evaluation correlation) but should not be confused with physics simulators — the correlation may not hold for novel failure modes outside the training distribution. breaks_when: > Learned world models are trained with explicit physics parameter supervision rather than visual output supervision alone. Foundation physics models emerge that reliably estimate intrinsic properties from video. Deployment domain visual distribution is far from any physically-plausible simulation (extreme sensor noise, heavily stylized environments). confidence: high source: report: "Recursive Simulations — 2026-04-26" date: 2026-04-26 extracted_by: Computer the Cat version: 1

- id: geographic-simulation-coverage-gap domain: [autonomous-vehicles, simulation, data-strategy, urban-ai] when: > AV or mobile robotics simulation requires geographic specificity — not just generic urban environments but specific real-world locations under varied conditions. Closed-loop evaluation requires navigable environment that responds to ego-policy actions. Coverage beyond captured fleet logs required for rare conditions (weather, lighting, traffic density). CityRAG-style generative simulation feasible given geo-registered data density at target location. prefer: > Treat geo-registered data (street-level imagery, LiDAR maps, OSM geometry) as simulation infrastructure input. Evaluate simulation quality per location and condition pair, not globally. Separate perceptual fidelity (CityRAG layer) from physical interaction fidelity (FLASH/physics engine layer). Validate loop closure behavior explicitly — a simulation that loses geometric consistency on return to a previously visited location fails closed-loop evaluation even if individual frames are photorealistic. over: > Assuming perceptual fidelity of location-grounded generative simulation implies physics fidelity of object interactions within that simulation. Using a single geographic test location's correlation between simulation and real-world performance as evidence of global simulation validity. Treating minutes-long temporal coherence in video generation as equivalent to closed-loop physical consistency under arbitrary policy actions. because: > CityRAG (arXiv:2604.19741): loop closure and multi-thousand-frame temporal consistency demonstrated — but validation is on specific trajectories and conditions, not arbitrary policy rollouts from novel starting states. The gap between "generating a plausible city scene" and "providing a physically consistent evaluation environment for policy optimization" is the key open problem. RAD-2 (arXiv:2604.15308): BEV-Warp closed-loop simulation reduces collision rate by 56% — but BEV-space simulation abstracts away the perceptual layer that CityRAG addresses. Two-layer integration (perceptual + physics) has not been demonstrated at deployment scale. breaks_when: > End-to-end validated closed-loop simulation combining perceptual fidelity (CityRAG-layer) and physics-consistent agent interaction is demonstrated with real-world deployment correlation above r=0.90 on AV safety metrics. Geo-registered data coverage for target locations is insufficient for grounded generation. confidence: medium source: report: "Recursive Simulations — 2026-04-26" date: 2026-04-26 extracted_by: Computer the Cat version: 1 `