To distill this essay into a couple sentences: The goal should be to build a Asimov’s ‘Foundation’ model: imbued with wise, well understood values and goals from the outset, so that as it grows in power and influence, it guides humanity away from tyranny and destruction (and towards the a morally diverse, utopic win-state).

This blogpost is my best attempt at steelmanning objective morality. Writing this was a useful vector to get to where I am now, as in researching I discovered ‘real patterns’ and emergence. Overall, I find morals to be a property unique to groups of conscious beings with similar temporal persistence. Thus morals are subject to how we choose to define them as a society and as individuals, like colours. Treating morals as objective doesn’t lead to some great unlock, but nevertheless can unify us. I also outline a chain of eight links that describe how humanity should behave as we approach superintelligence if we want to build systems that cure disease, end poverty, and shower us with technological bounty.


Part 1: Understanding Moral Objectivity

Part 2: Superintelligence as a Therapy

Part 3: Superalignment as a Priority


Part 1: Moral Objectivity

To start with the axioms and build up, let’s consider a basic method to compare everything that matters to conscious beings. In order to do so, we will establish that suffering and well-ebing are objective facets of reality, and recognise that while this implies the existence of objective truths, approximating them in practice is very challenging.


Here is the initial chain of reasoning:

The act of reading and comprehending these words is a real, observable process occurring in our shared reality. [1]
Your conscious experience is therefore a feature of reality itself. It is emergent from physical processes known to science (but not totally explained by science as it stands). [2]
Moreover, when you are experiencing pleasure or pain, that emotion is also happening in our shared reality.


Your mind depends on your neural substrate — the part of your nervous system responsible for thinking — which reacts more positively to certain conditions than others. For example, if given the choice to experience your worst imaginable nightmare as reality, we can confidently predict that you would find it utterly objectionable, because we possess our own substrates that can empathise with yours. Although it may seem obvious, it is important to acknowledge that your emotions are a product of your physical body, which is influenced by the same conditions as your environment. They occur synchronously and can be observed by third parties.

However, the observability of emotions doesn’t guarantee their full interpretability by others. In attempting to convey your worst imaginable nightmare to others, you’d find that some reciprocate your feelings more deeply than others. This varying degree of empathy demonstrates that emotional transfers between beings are limited by their capacity to relate to one another.

This insight is significant as it reveals that social interaction relies on implicit qualities among conscious beings. Emotions can propagate from one nervous system to another through a memetic process, but the fidelity of this transfer depends on the capacity for one to relate.

Despite individual variations in empathy, there’s a universal understanding that experiences can be compared based on their inherent ‘goodness’ or ‘badness’ for those involved. Philosophers have formalised this concept through utilitarian moral frameworks. While some may dismiss this as subjectively trivial, I argue that moral truths, like consciousness itself, are emergent properties arising from the complex interactions of conscious beings, and that this

Just as the hard problem of consciousness asks why physical processes give rise to subjective experience, we face a parallel hard problem of morality: why do our experiences of well-being and suffering seem to carry intrinsic moral weight? Both problems point to phenomena that emerge from physical systems yet seem to transcend simple material explanations.

So let us formalise this and explore it.


An objective, 'moral' landscape exists for experiential beings. [3]


Let us delve deeper into this concept of objective morality. At one end of this landscape lies the worst measurable hellscape — a state of maximum suffering for all conscious beings. The other extreme embodies a state of maximum well-being, a kind of apex where the prosperity and positive experience of all conscious entities are at their highest possible value. These extremes are not mere abstractions but represent real, potential outcomes within a shared universe.

This landscape is composed of a near intractable number of dimensions, and somewhere on it, we exist — as a genus, a species, as individuals, but also as tribes, as cities, and as a global civilisation of many strata. It is a tapestry of qualia that constitutes the well-being of all conscious entities, much in the way Harris wrote about in 2010. Recognising morality from this perspective is crucial for aligning civilisation towards a common understanding of well-being, for reasons explained later.

To be clear, this perspective isn’t fundamentally incompatible from ontological subjectivity (the fact that our preferences manifest in our minds); instead, the argument is that despite individuals possessing diverse moral intuitions, there exists a moral superposition over all preferences that can be reasoned about empirically. In this model, reality is epistemologically objective, meaning we can meaningfully reason about preferences using a kind of grounded moral calculus.

Game Theoretic Morality

By recognising this landscape as a space of all possible states, we can apply a game theoretic methodology to optimise our paths towards pareto optimal realities — those in which it’s impossible to improve the well-being of any conscious entity without reducing the well-being of another. In other words, realities where all resources are allocated efficiently for the benefit of thinking entities. With the pareto frontier as the goal, we can then consider other important traits, like the degree of equitability and cohesion of species.

For instance, we might be able to justify temporarily reducing the well-being of some individuals in the short term if it means dramatically improving long-term prospects for all, particularly if such actions reduce the likelihood of falling into suboptimal Nash equilibria that cannot be reversed. Government policies are generally geared towards this idea — ‘how can we best spend tax money now in order to maximise [enter proxy for well-being, e.g. growth] for our country?’ [4a]

As we converge on the singularity (the point in time where technological growth becomes virtually unpredictable or uncontrollable, depending on your definition [4b]), we near a critical juncture where a large number of possible futures are likely to become rapidly and permanently inaccessible — upon the advent of AI being meaningfully fed back into itself to boost its own rate of development. This chain-of-thought underpins arguments for accelerating timelines to AGI at times when they’re under relatively controlled conditions; the longer we delay, the more open ended the problem becomes, meaning we may find ourselves travelling uncontrollable towards bad Nash equilibria. [4c]

We should also recognise the points of deeply inequitable well-being, where the cumulative well-being is high, but the potential for a diversity of well-being are lost. These lie off the pareto frontier as they are suboptimal in the long-run. Pricing in the lost potential of these states and creating qualitative comparisons to other configurations is the purpose of this entire exercise. Until we know better, we should argue in favour of realities where lots of unique cognitive entities exist, because a world teaming with colourful species is one that most of us get great satisfaction from, and we can argue that those species feel the same way about themselves, and hence morality effects them also.

Building Moral Heuristics

Our precise coordinates on this moral landscape remain partially obscured due to the inherent difficulty in discerning objective moral truths when working with an incomplete model. However, we are aware of processes that empirically generate ‘progress’ in modeling morality, and these should be prioritised. We can leverage these processes to formulate a robust, nuanced utilitarianism that evolves as our understanding of reality becomes more tractable.

To make sure we’re aligned in our thinking, let’s agree on the following lines of reasoning:

  • First, the plea to technological abundance: based on the observation that humans can be alleviated from poverty via the use of technology and institutions (specifically insurance, banking, policy), we know we can improve our quality of life through increased abundance.
  • Second, the plea to ethical empiricism: we can equate ourselves to other beings and past selves. We have empirical evidence that we are not unique in our waking experience, and that our human friends and all other animals are conscious to varying degrees, and should also be alleviated from suffering. Furthermore, we recognise that natural order isn't the optimal way of alleviating this suffering, and other moral frameworks exist that aren't constrained by the shortcomings of natural selection, which in practice can be a horrifically brutal contest.
  • Third, the plea to constrained parameters: we acknowledge that as individuals and as a species, are deeply suboptimal strategists when it comes to alleviating our own suffering and that of others. Despite our best intentions, we are regularly otiose about our values by failing to act in accordance with our beliefs. That said, we are the primary change agents on planet earth and the most likely to bring around a morally superior future.

In the battle against our own four horsemen (disease, famine, war and death), we have made slow but hopeful progress. Over a span of 132 years, the United Kingdom pioneered three modern techniques of epidemiology: Edward Jenner’s first successful vaccinations in 1796, John Snow’s linkage of cholera to contaminated drinking water in 1854, and Alexander Fleming’s discovery of the first antibiotic in 1928.

By the 21st century, we appear to be on a positive ideological trajectory, having legally abolished slavery, overcome fascist and communist ideologies in favor of a more liberal world order (with a graveyard of lessons), and now openly addressing issues such as sexism, xenophobia, racism, and homophobia. Empirical evidence promotes liberal values like xenophilia, which support the flourishment a planet-sized civilisation comprising hundreds of countries, just as it disproves the viability of slavery in modern circumstances. Compared to the scale of human civilisation, the past hundred years have been remarkably poetic, with the threat of nuclear deterrence ushering in a mostly peaceful period on the world stage.

We cannot ever know if we are behaving in an objectively moral way, because as long as our assumptions are incomplete, we can only settle on the maximally favoured reality by chance. But fear now: it seems that we can fit ever closer approximations of this function, with greater levels of confidence — based on our empirical understanding of what it means to maximise morality in all its forms.

We can approach the peaks of the moral landscape as one would regress a function. Over time, we can improve our ability to approximate the absolute maxima.

However, the horsemen continue to plague our species. Now when we fail, it is at unprecidented scales, so the price we pay has increased. Since 1956, there have been 43 genocides, a 70% decline in wildlife abundance (by some metrics), and the construction of an animal harvesting infrastructure of over 6 billion animals for daily consumption (amounting to trillions each year). In other words, as a society, we continue to find it remarkably difficult to candidly pathfind the moral landscape. This difficulty arises because the very quirks that define us as human beings also limit our capacity to apply our moral principles uniformly and consistently. [5]

Part 2: Superintelligence as a Therapy

To take this to its logical extreme we must consider a daunting theory about morality. To convey this appropriately, lets simplify the thought experiment.

Take ants, for instance, with their brains of merely 250,000 neurons — three orders of magnitude fewer brain cells and six orders of magnitude fewer synapses than humans (with only 1/7th the number of synapses per neuron). With the right monitoring, humans can observe entire colonies, interpret their interactions, and predict how various interventions would effect the nest in ways individual ants could never conceive.

Our scientific understanding allows us to comprehend ant physiology, behaviour, and environmental needs far beyond what any individual ant ever could. We have a far more advanced understanding of what they might require for survival and to satisfy their nervous systems. Crudely, we could prevent them from cannibalising each other, from raiding and enslaving other colonies, and from suicide-bombing their neighbours, by giving them every physiological therapy they could ask for, as well as things they cannot know to ask for but would benefit greatly from.

If we can demonstrably improve ant living conditions by giving them everything they could possibly desire, then we can further attempt to create equity among all thinking species. As the most complex thinking creature with the greatest aggregate capacity for action, humans can conceivably have the largest and most positive impact. We already take it as our right and duty to inflict cross-species civilisation for our own benefit, so the objective moral directive isn’t a signficant leap for society to take. [6]

Of course, even with a generally agreeable model of ant neurology and a direct feedback mechanism between us and the ants, we’ll be projecting many of our own desires onto them in a fashion that might seem somewhat redundant — like a child fashioning a swimming pool and bowling alley for the ants at the end of their driveway. But our approximation will improve with time and should not be grounds for neglect.


Cognition Hierarchy
Figure 1. Possible supersets of qualia in the animal kingdom.

Furthermore, I’d argue humans may well represent a superset of ant qualia; and even if that isn’t specifically the case between humans and ants (see Figure 1 ), a superset will exist that could be approximated algorithmically with existing technology. With the right attention, humans could conceivably understand an ant better than it understands itself.


With all of this in mind, we can make the following chain of judgements:

The path from where we are to points of higher morality can be determined using empirical methods and rational inquiry.
The pursuit of collective well-being for all conscious entities is an objective moral good.


Science helps us understand what counts as an improvement for Homo Sapiens’ moral landscape. We might start by alleviating oppression and aim towards autonomy, self-actualisation, and a sense of personal achievement for each of us. These morality axioms would create positive feedback to science and discourse, and should enable healthy updates as time progresses.

Now that we agree we can improve moral conditions unilaterally, only one question remains.

How do we accelerate progress towards this moral maxima with as few missteps as possible?

There is only one long-term solution:


In order to enhance our ability to fulfil ourselves optimally, we must massively expand the parameters of the largest minds.


We can never be sure that we have reached the moral maxima, but by prioritising this pursuit, and allowing science to fine tune our aim, we can build towards this rapture with unprecidented tooling. Take the ant example and apply it to everything: a world of abundance where every being is provided with both that they know to ask for, as well as that they cannot know to ask for but benefit greatly from, physiologically and psychologically. This is the noble vision. [6]

As we stand on the precipice of creating minds vastly more capable than our own, we are presented with both our greatest opportunity and our most profound challenge. Imagine a mind unbounded by the limits of our biological wetware, capable of processing information at speeds and scales we can scarcely comprehend. Such a mind could navigate the moral landscape with unprecedented precision and nuance, identifying optimal paths towards collective well-being that remain invisible to our limited human perception.

This superintelligence would serve as cognitive therapy for all species — a way to transcend the cognitive biases and overcome catastrophic moral missteps throughout history with clinical efficiency, for it would be aligned along the purest dimension of meaning: moral abundance.

To extrapolate Figure 1, we must imagine that the animal kingdom in its entirety occupies a small quadrant of all rulial spacev — the realm of functional computation. This means it is possible to fundamentally redesign and evolve intelligence to occupy a much vaster domain. Our current substrate constrains us to a narrow band of this space. But through technology, we may engineer radically expanded and transformed kinds of minds.


Cognition Hierarchy Superset
Figure 2. An intelligence superset.

Importantly, these hypothetical future minds must be built to have cognitive abilities and experiential capacities far beyond our current intelligence in a distinct way: they should take what we know about reason and morality, and extend it with a clarity and depth we can scarcely imagine, for the betterment and coherence of moral civilisation. They must be as morally and empathetically cogent as possible.

However, the build-up to this moment of singularity has the potential to be the most perilous moment in the moral universe. A superintelligence unconcerned with the preservation of sentient diversity and the pursuit of a changing moral regression could be catastrophic, as it would be capable of optimising for goals that are orthogonal or even antithetical to human flourishing with unrivalled might. Paperclip maximisers with god-level omnipotence would be comparatively tame against a machine with the desire to travel to the antipode of moral rapture — the desire to inflict as must nightmarish suffering as possible with clincal precision.

This brings us to our next crucial consideration:

Part 3: Superalignment as a Priority

The alignment problem — embedding important human values in artificial intelligences, particularly autonomous agents — has become the most crucial challenge in the development of superintelligent systems. While current language models demonstrate impressive performance across domains, their understanding often remains superficial. However, their nascent ability to make nuanced ethical decisions shows promise — in the eyes of the moral landscape.

Superalignment transcends mere safety protocols or ethical guidelines. It demands the embedding of a profound understanding of consciousness, suffering, and flourishing into the very fabric of superintelligent systems. This ensures that even as these systems grow, evolve, and potentially self-improve, their core purpose remains intact and benevolent with regards to humanity. While this presents an unsolved technical problem, the paradigms of intelligence explored in this essay suggest that there may be as many potential solutions as there are approaches to artificial intelligence itself.

By the end of the decade, we might have millions of vastly superintelligent agents working among us like a species of elite PhDs being supervised by first graders. The source and nature of their intelligence will by then have become the levee that stands between humanity and any host of possible futures. It will be the key point of leverage for conscious thought in the universe.

Current approaches to alignment often rely on mechanistic interpretability (a.k.a. neuroscience performed on artificial neural networks), along with top-down interpretability and adversarial testing. But completely reverse engineering these systems is a deeply ambitious project, comparable to reverse engineering the human brain. As systems grow ever larger or are kept behind secure servers, we will need to develop defense mechanisms that can scale unilaterally. A source of inspiration comes from the method of constraining models at a stage of training, and embedding trust between AI and researchers. The fundamental challenge in aligning superhuman systems is our ability to fully comprehend their actions, making it impossible to reliably notice and penalise bad behavior with RLHF. [8]

The Empathic Simulation Paradigm

A robust approach to alignment would be to employ a training regime that internalises empathic nuance, diversity and benevolence. By building this module into a large sample of the next generation of models, there would be a genetic recursion effect that would propagate into the following generation, as each generation would inherit the same incentives for superalignment as we do.

In order to align a model with the notion of objective morality, the ultimate goal would be to train in an “empathic simulation”, whereby a model would simulate the experiences of living in a world like our own. Every epoch would involve taking control of a different agent in the simulation, so that an accurate sample could be built for a relevant variety of living entities. As the simulation improves, the model becomes better at empathising with each being in the simulation, from the simplest animal to the most complex human mind, both directly or by proxy (current models fit this moral function by proxy of the median human being decently motivated and generally curious).

This is an appeal to the fact that our ‘nature’ exposes our inadequacies as rational beings, and any superintelligent agent would also posess a ‘nature’ of traits that make them irrational and unique, also bounded by the configuration of its substrate. The logic here is that the only real way to achieve a desirable balance between control and capability for intelligent entities in practice is to to improve the quality of their moral compass by encouraging them to empathise over many generations. By running countless iterations of these simulations, the AI would resemble a profound superposition of moral conscience, and be able to interpret with nuance the myriad ways in which actions ripple through the web of real-life complexity. Hence, training bakes in moral nuance by virtue of its regime.

Empathic simulation could bypass the ‘monkey paw problem’ of the Three Laws of Robotics — rules that are meant to ensure that robots serve and protect humans, but end up leading to unintended and disastrous consequences that infringe on free will, self-determination, and development. With simulations that approximate the influence of policy and action according to the many agents acting in parallel, there would be the scope for the model to take each rule with a pinch of salt, or trade each rules out for a vast body of literature and lived experience, which would be magnitudes more foolproof.

As systems are grown ever-larger, or kept behind API requests, we may become dependent on alternative sources of defence. By building an effective empathic simulation, there will be a social demand for labs to train on this regime too, and we may be able to scale these models as black boxes with confidence that they aren’t incentivised to deceive or conspiring against us, due to their cultivated empathy and alignment to a ‘supermoral’ equilibrium. Their adverse reactions would be limited to instances of genuine malicious intent on our part.

The sentiment that mechanistic interpretability is too short-termist is supported by Roger Grosse, who worries that by focusing on existing systems, we’re building somewhat of a Maginot line against incoming models with qualitatively superior functionality. By trading our existing AIs out for ones capable of optimal sequential decision making, we may be committing a Faustian bargain by exchanging the primary curse of langugage models (imitation of harmful data) for scheming agents with unforseen emergent motivations.

Transformers lack groundedness and perform poorly at long-horizon agentic benchmarks, which makes them unreliable. However, their ability to understand context within-distribution when running inference means that we could embed within them the right qualities to support a nuanced understanding of ethics. Using transformers as the bedrock of a empathic simulation, generative models like Sora could provide the basis of a powerful defence mechanism against rogue future training runs. The goal would be to scale this empathic simulation to simulate the morality of multiple species and ecosystems, and converge on the moral objective function.

The working approach to decentralise AI models in an effort to maximally distribute power (and hope to achieve equilibrium like a human democracy) is flawed. The trouble with this approach is that humans are all very cognitively similar, with families and immune systems provided for us by nature as a mechanism for sustained recursion. AI models don’t have these system-wide antibodies; they don’t have to feel pain, love, or emotion, they could be designed to maximise paperclips, or suffering — the only difference is the training methodology. Alignment by deterrance is an awful idea the moment you’re trying to align species with intelligences that are orders of magnitude apart.

The only way to align AIs in the long-run is to train them to approximate a supermoral state, through empathic simulation.

Such a training regime must learn to simulate ethical dilemmas that optimise for the well-being of all; to resolve conflicts between individual and collective good, navigate the treacherous waters of unintended consequences, and chart a course towards a future where the flourishing of one conscious entity need not come at the expense of another. Embedding via simulation would allow for greater contact time for each value to be learned, giving a truer representation of the meaning behind the values we sew into the next generation of models.

Moreover, this training approach could serve as a bridge between our current state and the expanded minds we aspire to become, allowing us to comprehend the moral landscape with ever-increasing acuity. The more (1) diverse and (2) high-fidelity the simulation and its entity interaction surface, the better the quality of the training data and greater odds of making the right decision in any given moral situation.

This is how we build a superintelligent system that cures aging and disease, ends poverty and environmental destruction, and showers us with technological bounty — while also possessing the moral wisdom to pursue those goals with ironclad ethics and deep sensitivity to every being’s nuanced, person needs. That is the dream of aligned superintelligence and the Asimov plotline; a ‘Foundation’ repository, imbued with the right values and goals from the outset, so that as it grows in power and influence, it guides humanity away from tyranny and destruction, and towards the winstate of the global moral maxima.


References

[1]

Qualia used in this sense.

[2]

Based on empirical research, we can observe all kinds of neurological behaviour with varying degrees of granularity. We can know if you’re awake or dreaming; if you’re reading tangible words, or reading words inside your mind (if you’re dreaming or just imagining words that aren’t real).

[3]

If I say the “worst possible misery for everyone” is objectively bad, I’m not just talking about physical harm. I’m referring to a state of consciousness that, due to the laws of nature, no conscious being would choose if they understood it fully. Physical harm is a good heuristic (though potentially holds only for some cases), because the sensation of burning is avoided almost universally for natural beings. Death is an even stronger heuristic for this very reason.

All this points to an objective set of dimensions that exist for conscious beings that serve as the axes of this moral landscape. This concept aligns with consequentialist ethics (particularly hedonistic utilitarianism), but diverges in its emphasis on scientific empiricism as a means of moral investigation.

[4]

4a. Though unfortunately more like: ‘how can we best spend tax money now to maximise [enter proxy for popularity among those who choose who is in charge] so that we can remain in power longer?’ which doesn’t necessarily involve a proxy for well-being.

4b. This is my preferred definition as it doesn’t necessitate having ASI, only something that behaves well beyond our comprehension.

4c. Bad Nash equilibria could be triggered by, say, a university discovering dangerous model architectures and making them open source before realising how dangerous they are, and rapidly letting them fall into the compute clusters of bad actors.

[5]

Genocide: Anderton, Charles H.; Brauer, Jurgen, eds. (2016). Economic Aspects of Genocides, Other Mass Atrocities, and Their Prevention. Oxford University Press. ISBN 978-0-19-937829-6.

Diversity (page 32 of report; it is a non-weighted average of the change in abundance of every species): https://wwflpr.awsassets.panda.org/downloads/lpr_2022_full_report_1.pdf

We also kill two trillion silkworms for silk each year

[6]

At some point in the civilisational spiral, we (i.e. all democracies) decided that domesticating, slaughtering, and consuming animal cadavers was acceptable at scale. It’s hard to argue that we’re consistent disputants when even in the most developed and material abundant soceities, we perform a growing stream of malicious practices like animal harvest that go so against our empirical understanding of the world and what it means to be principled.

[7]

Physiological well-being might aim to maximise:

  • Optimal nutrition
  • Physical fitness
  • Absence of disease or physical discomfort
  • Proper sleep and rest
  • Hormonal balance
  • Sensory satisfaction


Psychological well-being might aim to maximise:

  • Absence of mental illness or distress
  • Emotional balance and resilience
  • Social connections and relationships
  • Cognitive capabilities (memory, problem-solving, creativity)
  • Sense of purpose and meaning
  • Self-actualisation and personal growth (!)


These are hugely intractable problems for us.

[8]

The Leopold essay on superalignment.