EA - Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination by alexlintz
The Nonlinear Library: EA Forum - A podcast by The Nonlinear Fund
Categories:
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination, published by alexlintz on August 3, 2022 on The Effective Altruism Forum. Epistemic status: This model is loosely inspired by a conversation with Nate Soares but has since warped into something perhaps significantly different from his original intent. I think this model is a useful thinking tool when it comes to examining potential interventions to mitigate AI risk and getting a grasp of the problem we face. That said, I’m sure that the model has its flaws - I just haven’t spent enough time to find them. I’ve tried to write this post so it’s very skimmable (except the hypothetical scenarios section) - reading the first sentence or two of each heading and skipping the bullet points should get you the gist of the model. Summary The three pillars model attempts to describe the conditions needed to successfully avoid the deployment of unaligned AGI. It proposes that, to succeed, we need to achieve some sufficient combination of success on all three of the following: Technical alignment research Safety-conscious deployment decisions Coordination between potential AI deployers. While how difficult success is depends on the difficulty of solving any given pillar, this model points toward why we may well fail to avoid AGI catastrophe: we need to simultaneously succeed at three difficult problems. More generally, the model aims to help longtermists flesh out our mental pictures of what success on AGI risk looks like. In particular, it suggests that a strategy aimed solely at a single pillar is unlikely to be sufficient, and our community might need to take ambitious actions in several directions at once. This model is intended as an imperfect but hopefully useful thinking tool. Further work in this area could add nuance to the model, formalize it, and try to use it to describe different viewpoints (e.g. Christiano’s stories of AI failure) or strategies (e.g. raising awareness of AI risk, which might affect several pillars). The three pillars Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else. We need to attain sufficient success on some combination of three pillars in order to attain a future where AGI does not kill us: technical safety, safety-conscious deployment decisions, and coordination between deployers. Success on any given pillar can, to some extent, serve as a substitute for success on another. For example, we could be extremely successful on one pillar, quite successful on two, or fairly successful on all three in order to avoid doom. We do generally need at least a minimal amount of progress on each pillar to succeed — for example, even a low-cost and easy-to-implement technical solution to alignment still needs to be adopted by leading AI developers. One conceptualization which might be useful for driving intuition is to set the bar for victory at 100 pillar points. We can have partial success on all pillars: e.g. 33, 33, and 34 from pillars 1,2, and 3 respectively; or we can get almost everything from one pillar: e.g. a 90-5-5 split with pillar 1 supporting most of the weight. In practice, we should expect progress on the pillars to be correlated. Success on each pillar is more likely if we’re in a world with more competence, more well-calibrated AGI risk awareness, a more effective and influential longtermist community, etc. In addition, success on one pillar can increase the return-on-investment in other pillars. E.g. If technical success were to give us the foolproof ability to mind-read AGIs, it would be trivial t...
