I Let a Local AI Play God for Two Hours. It Invented Mushroom Bronze and Weaponized Earthquakes.

I Let a Local AI Play God for Two Hours. It Invented Mushroom Bronze and Weaponized Earthquakes.

April 16, 2026·12 min read

A weekend stress-test of offline LLM endurance turned into 2,610 years of simulated civilization, complete with mole-people empires, sun-worship coups, and the largest underwater city that ever went fully extinct.

This is probably the most unhinged thing I've done with AI so far, and I'm only warming up.

I wanted to answer a simple question: can a local LLM hold it together over a long, complex, open-ended workload? Not a one-shot prompt. Not a chatbot conversation. A multi-hour autonomous workflow where the model has to track state, remember what happened three hundred turns ago, adjust its reasoning based on accumulated context, and keep producing coherent output without anybody babysitting it.

No cloud. No API. No internet connection at all. Just a 122-billion-parameter model (Qwen 3.5) running locally on a MacBook Pro M4 Max with 128GB of RAM through MLX, and a Python engine I wrote over the weekend.

I needed a task that would be genuinely hard to sustain. Something with a lot of moving parts, compounding state, no right answer, and no clear stopping point. Something that would expose whether the model degrades over time or quietly loses the plot after the first thirty minutes.

So I built a civilization simulator. And then I let it run for two hours while I made dinner.

The setup

The simulation starts with about 15 primitive tribes per continent, scattered across seven regions (Africa, Asia, Europe, North America, South America, Oceania, Arctic). Each tribe begins as a blank slate. No language, no tools, no culture, no name. The LLM decides everything that happens to them from that point forward.

Each tribe tracks 29 parameters: population (split by gender), territory in km squared, tech level, aggression index, health index, religion, government type, communication method, knowledge base, diplomacy status, current goal, and more. The code engine handles physical constraints like carrying capacity, territory overlap, water boundaries, and war attrition. But the creative decisions, what actually happens to each civilization, are entirely up to the model.

The prompt gives the LLM near-total freedom with a few hard rules. These aren't humans. They're unspecified sentient beings and the model can evolve their biology, psychology, and morality however it wants. It must not mimic Earth history. It must not favor peace over war, progress over regression, or good over evil. Genocide, utopia, stagnation, transcendence, and collapse are all equally valid outcomes. And every name must sound alien, because these species don't speak English.

Each 'tick' advances somewhere between 20 and 180 simulated years. The model receives the full state of every living tribe on a continent, any proximity alerts (territorial collisions, cross-continent contact), any random world events (volcanic eruptions, plagues, comets, mini ice ages), and the accumulated chronicle of everything that's happened so far. Then it returns updated states for every group and the engine enforces reality on whatever the LLM decided.

One run. 2,610 simulated years. About two hours of wall-clock time. The model processed each of the seven continents sequentially, roughly 1-3 minutes per continent per tick, cycling through the entire planet over and over.

What happened

By the end of the run, 783 tribes had existed. 440 of them were dead. The global population sat at about 4.4 million across 343 surviving factions. Africa had emerged as the dominant continent with 27% of the world's population and the most surviving tribes, which, incidentally, mirrors actual paleoanthropological patterns. Nobody told the model to do that. It got there on its own.

Here's a selection of what went down. I want to be clear: none of this was scripted, seeded, or hinted at. The model made all of these decisions autonomously based on accumulated simulation state.

The underwater city that went extinct

An African coastal tribe (AF-02) discovered submersible reef-bound aquaculture and gradually built an entire underwater city in deep ocean trenches. A coral metropolis. Population surged past 1.2 million, making them by far the largest civilization on the planet. They developed plague-neutralizing enzymes extracted from deep-sea sponges, advanced enough to function as a pharmaceutical export economy.

Then they collapsed. Completely. Population went to zero. The largest civilization in the simulation's history went fully extinct. The model never gave a single dramatic reason. It just gradually eroded their stability across multiple ticks until there was nothing left. One point two million people, an entire underwater technological tradition, gone. It is the single most dramatic population collapse in the entire run and the model treated it almost matter-of-factly.

The mole-people empire

An Asian faction (AS-03) decided, somewhere around the mid-simulation, that the surface was overrated. They moved their entire civilization underground, developed geothermal energy as their power source, and built a subterranean thermodynamic hegemony. They signed a Geothermal Treaty with a neighboring coal-powered surface empire (the kind of diplomacy you don't expect from cave dwellers) and then things got dark.

They developed a seismic resonance weapon. A device that caused targeted earthquakes. They tested it first (the model was careful about this, they ran tests before deployment) and then used it to execute what the simulation log describes as a 'seismic purge' against surface raiders who'd been harassing their ventilation shafts.

Weaponized tectonic forces, deployed by an underground mole-people empire, against surface-dwelling enemies. I didn't prompt any of this. The LLM built the entire geopolitical and technological pathway from scratch: discover geothermal energy, move underground, develop defensive technology, escalate to offensive seismic weaponry, form alliances with surface states as a hedge. It's a coherent strategic arc that unfolded over hundreds of simulated years.

The gender crisis

A South American steppe tribe (SA-10, 'Zho-Myr') hit a demographic catastrophe. Through a combination of warfare casualties (which the simulation engine weights toward males) and bad luck, they ran critically low on women. Internal civil war erupted as factions fought over the remaining female population.

Then they launched organized raiding campaigns against neighboring rainforest tribes specifically to kidnap women. The model's own log describes it as a 'desperate female raiding campaign.' These weren't generic raids. The LLM specifically identified the demographic bottleneck, modeled the social consequences (civil war, fragmentation), and then generated a strategic response (external raids targeting females) that is internally consistent with the tribe's survival parameters. Horrifying, but logically sound given the simulation's state.

I want to emphasize: the model arrived at this through parameter tracking. It noticed the male/female ratio was unsustainable, modeled the internal social collapse that would follow, and generated an external response. That's multi-step causal reasoning sustained over many ticks of simulation history.

The purge that killed the purgers

An African empire found itself in civil war and responded by purging 12,000 of its own dissidents. The health index crashed from the violence and infrastructure damage. Then the entire civilization went extinct. The purge destroyed the purgers.

Meanwhile, on the other side of the world, a European tribe of 28 survivors got flagged by the simulation engine for 'cannibalism risk.' Twenty-eight people, huddled somewhere in proto-Europe, and the model is tracking whether they're about to eat each other. The level of granularity the LLM maintained across hundreds of simultaneous groups is, honestly, the most impressive part of the whole exercise.

The sun-worship coup

A total solar eclipse hit Oceania. The model generated the eclipse as a random cosmic event (there's a 40% chance per tick of 0-3 world events per continent), and a tribe responded by founding a religious movement called the 'Cult of the Shattered Sun.' So far, interesting but not wild.

Then the cult took over the government. The most powerful naval empire in Oceania (OC-01, 'Kael-Iron,' also known as the 'Glass Empire' for their optical technology) experienced a theocratic coup driven by eclipse-inspired religious fervor. The LLM independently invented a sequence where a natural astronomical event triggered a religious movement that gained enough political support to seize control of a major military power. That's about five causal steps, each feeding the next, none of them prompted.

Mushroom bronze

An African rainforest tribe developed a bronze alloy reinforced with fungal structures. They evolved it into a military technology called 'Fungal-Shield.' Bio-engineered metal with mushrooms. The model decided this was a plausible materials science innovation and ran with it for multiple ticks, developing weapons and armor from the concept.

Entirely fictional. Absolutely zero basis in reality. The LLM invented a materials science discipline (fungal metallurgy) and then built a military-industrial supply chain around it. I keep coming back to this one because it shows the model isn't just recombining things it's seen in training data. It's extrapolating. Badly, probably, from a materials science perspective. But creatively and consistently.

The thermal lance slave raids

An Arctic hegemony (AR-01) developed thermal lance siege weapons, basically directed heat devices designed to melt through ice fortifications. They used these to breach a rival civilization's (AR-04) underground glacial cave fortress. 1,800 killed. 12,000 enslaved. The model then generated 'Slave-Transit units,' dedicated logistics infrastructure for transporting the enslaved population south to work in forges.

The LLM didn't just simulate a battle. It simulated the logistics of what happens after a battle. Slave transportation infrastructure. That level of follow-through, where the model thinks about second-order consequences of military victory, is exactly the kind of sustained reasoning I was trying to test.

Religious divination that actually worked

When a comet appeared (random world event), an African tribe used comet interpretation, essentially religious divination, to discover new water sources. The model decided that the astronomical observation skills developed through religious sky-watching had practical applications. Religious divination with real results. Whether this is the model being clever about the historical overlap between astronomy and religion, or just a happy accident, I honestly couldn't tell you.

Thunder-Beasts

An Oceanian river civilization domesticated some kind of large animal the model called 'Thunder-Beasts' and used them for a transport revolution. The simulation has deterministic biome-specific fauna lists, but none of them include anything called a Thunder-Beast. Qwen just invented a megafauna species, placed it in Oceania, and had a tribe tame it. The simulation never specifies what the animals actually are and the model never felt the need to clarify. They're Thunder-Beasts. That's all you need to know.

What the model actually showed me

The civilizations were fun. They were also the point.

What I was really testing is whether a 122-billion-parameter model running locally on consumer hardware can sustain a complex, stateful, multi-hour autonomous workflow without degrading. The answer is yes. Clearly yes.

Over two hours, the model:

  • Tracked 29 parameters per tribe across 783 distinct civilizations (343 living at peak)

  • Maintained diplomatic relationships, territorial boundaries, and accumulated knowledge bases that persisted coherently across the entire run

  • Generated causally consistent multi-step event chains (eclipse triggers religion triggers political movement triggers coup) without losing internal logic

  • Handled compounding complexity as splinter factions, colonial outposts, and trade networks multiplied the number of active entities and inter-dependencies

  • Produced a 56% extinction rate (440 out of 783) with individually distinct causes of death, meaning it didn't just randomly kill groups, it modeled specific failure modes for each one

  • Maintained granular awareness of small dying factions (28-person cannibalism tribe, 15-person locust survivors, 2-person metallurgy lineage dying in ice caves) while simultaneously managing empires of 100,000+

No context window failures. No hallucinated state corruption. No gradual drift into nonsense. The output at tick 80 was as coherent and internally referenced as the output at tick 5. The model knew what had happened to each group across the full history and used that history to inform what happened next.

What this means for offline AI workflows

This matters beyond the simulation.

Most agentic AI workflows today depend on cloud APIs. That means rate limits, usage costs, latency, internet dependency, and data leaving your machine. For a lot of enterprise use cases, especially anything touching sensitive data, that's a problem. Not a theoretical problem. An actual budget-and-compliance problem.

Local models running on Apple Silicon (or comparable hardware) are now capable of sustaining the kind of complex, long-running, stateful workloads that people assume require cloud infrastructure. A MacBook Pro with 128GB of RAM ran a 122B-parameter model for two hours straight, processing seven continents of simulation state per tick, maintaining hundreds of entity relationships, and never once dropping the ball.

The obvious application here isn't civilization simulation. It's any long-haul autonomous workflow where you need the model to maintain context, track state, and produce reliable output over extended periods without external dependencies. Think document analysis pipelines that run overnight. Multi-step code review workflows. Scenario planning systems that explore hundreds of branches. Compliance auditing across large datasets. Any task where you'd normally batch API calls and pay per token for hours of processing.

Local models are slower. Substantially slower. Each continent took 1-3 minutes to process and a full world tick ran 10-20 minutes. But they're reliable. No 429 errors. No API key expiry mid-run. No surprise billing. No data exfiltration concerns. And they run on a laptop.

Two years ago, running a 122B model locally was science fiction. Now it's a weekend project on a single machine with no internet connection, and the model can play God for two hours without forgetting who it created.

The industry conversation about local vs. cloud AI is mostly about latency and cost. It should also be about endurance. Some workloads don't need to be fast. They need to be long and consistent and private. Local models are ready for that right now.

---

The simulation engine (AnthroSim) is written in Python. The LLM runs via MLX on Apple Silicon. Total runtime for 2,610 simulated years across 7 continents: approximately 2 hours. Hardware: MacBook Pro, M4 Max, 128GB RAM. No internet connection was used during the run.

Comments (0)

Leave a Comment

No comments yet. Be the first to share your thoughts!