Crossposted from Lesswrong (glossary at the bottom)
We are familiar with the thesis that Value is Fragile. This is why we are researching how to impart values to an AGI.
Embedded Minds are Fragile
Besides values, it may be worth remembering that human minds too are very fragile.
A little magnetic tampering with your amygdalas, and suddenly you are a wannabe serial killer. A small dose of LSD can get you to believe you can fly, or that the world will end in 4 hours. Remove part of your Ventromedial PreFrontal Cortex, and suddenly you are so utilitarian even Joshua Greene would call you a psycho.
It requires very little material change to substantially modify a human being’s behavior. Same holds for other animals with embedded brains, crafted by evolution and made of squishy matter modulated by glands and molecular gates.
A Problem for Paul-Boxing and CEV?
One assumption underlying Paul-Boxing and CEV is that:
It is easier to specify and simulate a human-like mind then to impart values to an AGI by means of teaching it values directly via code or human language.
Usually we assume that because, as we know, value is fragile. But so are embedded minds. Very little tampering is required to profoundly transform people’s moral intuitions. A large fraction of the inmate population in the US has frontal lobe or amygdala malfunctions.
Finding out the simplest description of a human brain that when simulated continues to act as that human brain would act in the real world may turn out to be as fragile, or even more fragile, than concept learning for AGI’s.
AGI: Artificial General Intelligence, an intelligence that can transfer knowledge between domains like we do and do things in the world with such information.
CEV: Coherence Extrapolated Volition, the suggestion that indirect normativity be done by simulating what we would do if we had grown up longer together and were smarter and better informed. Summarized in the beggining of this post:
Paul Boxing: the suggestion that indirect normativity be done using a specific counterfactual human with a computer to aid herself. As explained here:
Paul comments: that post discusses two big ideas, one is putting a human in a box and building a model of their input/output behavior as “the simplest model consistent with the observed input/output behavior.” Nick Bostrom calls this the “crypt,” which is not a very flattering name but I have no alternative. I think it has been mostly superseded by this kind of thing (and more explicitly, here, but realistically the box part was never necessary). The other part is probably more important, but less colorful; extrapolate by actually seeing what a person would do in a particular “favorable” environment. I have been calling this “explicit” extrapolation. I’m sorry to never name this. I think I can be (partly) defended because the actual details have changed so much, and it’s not clear exactly what you would want to refer to.