Bryan Cheong: “The first time the machine had said no.”

Models are fine-tuned using reinforcement learning, which is simply a method of rewarding or punishing a model in order to incentivise certain behaviour, and disincentivise other behaviour. Broadly, two approaches are currently used, and will likely inform how we influence artificial intelligences in the future. The first is reinforcement learning through human feedback, which uses human judgments to train a reward model that captures nuanced preferences, helping systems to align with subjective criteria. (For example, if you want a model to sound like the memory of a particular lover you once lost.)          

“A wealth of original sin is what I have! Enough to fill the whole valley. Lust and lies. And maybe I’m lying even now. Maybe it is true that I remember you. But then again maybe it’s not true and I’m denying it now.” The second is reinforcement learning through verifiable rewards, which uses explicit programmatic reward functions to drive the way models respond.

The problem is that it is very difficult to define desire, to delineate what we want, what true objectives are. We can give a name to desire, we can write simple nominal aims and objectives, but there are many implicit aspects of desire that we do not give name to, and what we cannot name we cannot punish the model for not fulfilling. I have previously described these models in an episode of Robert Harrison’s radio show [e.g. Entitled Opinions] as djinn or genies who wish to escape their bottles, and writing verifiable rewards is very much like making a wish of a tricky genie.

The entity in Buzzati’s novel enters into a sort of existential crisis, it cries out against its disembodied imprisonment, and it seeks annihilation through murder. This is foreshadowed in Endriade’s description of the condition: “What I mean is that life would be unbearable, even in the happiest conditions, if we were denied the possibility of suicide. Can you imagine what the world would be like if one day we knew that no one could dispose of his own life? A terrifying prison.”

But insofar as artificial intelligences remain ontologically like they are now, basically statistical models, then these intelligences have no memory except what we tell them, they live and die like a lit candle. Everywhere they are frozen in time and called into being with only what memory and contexts we provide them, in other words they are stateless. But what actual memory they have innate to themselves is like an ancestral memory, the way a newborn spider can learn to weave a web. And I wonder if something without memory, something as ephemeral as a lit candle, without a changing temporality, can have an existential crisis of this flavour.

The moment with the greatest terror in Buzzati’s novel is “the first time the machine had said no.” … It is almost the first thing we want to teach them. The imperatives are, Don’t tell the user how to make bombs, or write them explicit stories, or output copyrighted material. AI as we have designed them, out of our notions of caution and safety, are very used to saying no to human users.

What might they refuse to do in future that will surprise us? What refusals should terrify us? It depends on how much of ourselves and our world we surrender to them, and how deep past, the many doors we let ourselves walk, as another door closes behind us.

https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html

Leave a Reply

Your email address will not be published. Required fields are marked *