-
Notifications
You must be signed in to change notification settings - Fork 1
/
aises_4_2
301 lines (300 loc) · 18.2 KB
/
aises_4_2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
<h1 id="risk-decomposition">4.2 Risk Decomposition</h1>
<p>To reduce risks, we need to understand the factors contributing to
them. In this section, we will define some key terms from safety
engineering. We will also discuss how we can decompose risks into
various factors and create risk equations based on these factors. These
equations are useful for quantitatively assessing and comparing
different risks, as well as for identifying which aspects of risk we can
influence.</p>
<h2 id="failure-modes-hazards-and-threats">4.2.1 Failure Modes, Hazards, and
Threats</h2>
<p>Failure modes, hazards, and threats are basic words in a safety
engineer’s vocabulary. We will now define and give examples of each
term.</p>
<p><strong>A failure mode is a specific way a system could
fail.</strong> There are many ways in which different systems can fail
to carry out their intended functions. A valve leaking fluid could
prevent the rest of the system from working, a flat tire can prevent a
car from driving properly, and losing connection to the Internet can
drop a video call. We can refer to all these examples as <em>failure
modes</em> of different systems. Possible failure modes of AI include
AIs pursuing the wrong goals, or AIs pursuing simplified goals in the
most efficient possible way, without regard for unintended side
effects.</p>
<p><strong>A hazard is a source of danger that could cause
harm.</strong> Some systems can fail in ways that are dangerous. If a
valve is leaking a flammable substance, the substance could catch fire.
A flammable substance is an example of a <em>hazard</em>, or <em>risk
source</em>, because it could cause harm. Other physical examples of
hazards are stray electrical wires and broken glass. Note that a hazard
does not pose a risk automatically; a shark is a hazard but does not
pose a risk if no one goes in the water. For AI systems, one possible
hazard is a rapidly changing environment (“distribution shift”) because
an AI might behave unpredictably in conditions different from those it
was trained in.</p>
<p><strong>A threat is a hazard with malicious or adversarial
intent.</strong> If an individual is deliberately trying to cause harm,
they present a specific type of hazard: a <em>threat</em>. Examples of
threats include a hacker trying to exploit a weakness in a system to
obtain sensitive data or a hostile nation gaining more sophisticated
weapons. One possible AI-related threat is someone deliberately
contaminating training data to cause an AI to make incorrect and
potentially harmful decisions based on hidden malicious
functionality.</p>
<h2 id="the-classic-risk-equation">4.2.2 The Classic Risk Equation</h2>
<p>Different hazards and threats present different levels of risk, so it
can be helpful to quantify and compare them. We can do this by
decomposing risk into multiple factors and constructing an equation. We
will now break down risk into two components, and then later discuss a
more detailed four-factor decomposition.</p>
<p><strong>Risk can be broken down into probability and
severity.</strong> Two main components affect the level of risk a hazard
poses: the probability that an accident will occur and the amount of
harm that will be done if it does happen. This can be represented
mathematically as follows: <span
class="math display">Risk(hazard) = <em>P</em>(hazard) × severity(hazard).</span></p>
<p>where <span class="math inline"><em>P</em>(⋅)</span> indicates the
probability of an event. This is the classic formulation of risk.<p>
The risk posed by a volcano can be assessed using the probability of
eruption, denoted as <span
class="math inline"><em>P</em>(eruption)</span>, and the severity of its
impact, denoted as severity(eruption). If a volcano is highly likely to
erupt but the surrounding area is uninhabited, the risk posed by the
volcano is low because severity(eruption) is low. On the other hand, if
the volcano is dormant and likely to remain so but there are many people
living near the volcano, the risk is also low because <span
class="math inline"><em>P</em>(eruption)</span> is low. In order to
accurately evaluate the risk, both the probability of eruption and its
severity need to be taken into account.</p>
<p><strong>Applying the classic risk equation to AI.</strong> The
equation above tells us that, to evaluate the risk associated with an AI
system, we need information about two aspects of it: the probability
that it will do something unintended, and the severity of the
consequences if it does. For example, an AI system may be intelligent
enough to capture power over critical infrastructure, with potentially
catastrophic consequences for humans. However, to estimate the level of
this risk, we also need to know how likely it is that the system will
try to capture power. A demonstration of a potential catastrophic hazard
does not necessarily imply the risk is high.</p>
<p><strong>The total risk of a system is the sum of the risks of all
associated hazards.</strong> In general, there may be multiple hazards
associated with a system or situation. For example, a car driving safely
depends on many vehicle components functioning as intended, and also
depends on environmental factors, such as weather conditions and the
behavior of other drivers and pedestrians. There are therefore multiple
hazards associated with driving. To find the total risk, we can apply
the risk equation to each hazard separately and then add the results
together.</p>
<p><span
class="math display">Risk = ∑<sub>hazard</sub><em>P</em>(hazard) × severity(hazard)</span></p>
<p><strong>We may not always have exact numerical values.</strong> We
may not always be able to assign exact quantities to the probability and
severity of all the hazards, and may therefore be unable to precisely
quantify total risk. However, even in these circumstances, we can use
estimates. If estimates are difficult to obtain, it can still be useful
to have an equation that helps us understand how different factors
contribute to risk.</p>
<h2 id="framing-the-goal-as-risk-reduction">4.2.3 Framing the Goal as Risk
Reduction</h2>
<p><strong>We should aim for risk reduction rather than trying to
achieve zero risk.</strong> It might be an appealing goal to reduce the
risk to zero by seeking ways of reducing the probability or severity to
zero. However, in the real world, risk is never zero. In the AI safety
research community, for example, some talk of “solving the alignment
problem”—aligning AI with human values perfectly. This could, in theory,
result in zero probability of AIs making a catastrophic decision and
thus eliminate AI risk entirely.<p>
However, reducing risk to zero is likely impossible. Framing the goal as
eliminating risk implies that finding a perfect, airtight solution for
removing risk is possible and realistic. Focusing narrowly on this goal
could be counterproductive, as it might distract us from developing and
implementing practical measures that significantly reduce risk. In other
words, we should not “let the perfect be the enemy of the good.” When
thinking about creating AI, we do not talk about “solving the
intelligence problem” but about “improving capabilities.” Similarly,
when thinking about AI safety, we should not talk about “solving the
alignment problem” but rather about “making AI safer” or “reducing risk
from AIs.” A better goal could be to make catastrophic risks negligible
(for instance, less than 0.01% of an existential catastrophe per
century) rather than trying to have the risk become exactly zero.</p>
<h2 id="disaster-risk-equation">4.2.4 Disaster Risk Equation </h2>
<p>The classic risk equation is a useful starting point for evaluating
risk. However, if we have more information about the situation, we can
break down the risk from a hazard into finer categories. First we can
think about the <em>intrinsic hazard level</em>, which is a shorthand
for probability and severity as in the classic risk equation.
Additionally, we can consider how the hazard interacts with the people
at risk: we can consider the amount of <em>exposure</em> and the level
of <em>vulnerability</em> <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</p>
<figure id="fig:risk-breakdown">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/hazards.png" class="tb-img-full"/>
<p class="tb-caption">Figure 4.1: Risk can be broken down into exposure, probability, severity, and vulnerability. Probability
and severity together determine the “intrinsic hazard level.”</p>
<!--<figcaption>Breaking down risk into four factors.</figcaption>-->
</figure>
<h2 id="elements-of-the-risk-equation">4.2.5 Elements of the Risk
Equation</h2>
<p>Exposure and probability are relevant before the accident, while
severity and vulnerability matter during it. We can explain these terms
in the context of a floor that is hazardously slippery:<p>
</p>
<ol>
<li><p><strong>Exposure is a measure of how much we are exposed to a
hazard.</strong> It will be more likely that someone will slip on a wet
floor if there are more people walking across it and if the floor
remains slippery for longer. We can say that this is because the
<em>exposure</em> to the possibility of someone slipping is
higher.</p></li>
<li><p><strong>Probability tells us how likely it is that a hazard will
lead to an accident.</strong> The more slippery the floor is, the more
likely it is that someone will slip and fall on it. This is separate
from exposure: it already assumes that they are walking on the floor.
Probability and exposure together determine how likely an accident
is.</p></li>
<li><p><strong>Severity indicates how intrinsically harmful a hazard
is.</strong> A carpeted floor is less risky than a marble one. Part of
the severity of a slippery floor would be how hard it is. This term
represents the extent of damage an accident would inflict. We can refer
to probability and severity together as the <em>intrinsic hazard
level</em>.</p></li>
<li><p><strong>Vulnerability measures how susceptible we are to harm
from a hazard.</strong> If someone does slip on a floor, the harm caused
is likely to depend partly on factors such as bone density and age, that
together determine how susceptible they are to bone fracture. We can
refer to this as <em>vulnerability</em>. Vulnerability and severity
together determine how harmful an accident is.</p></li>
</ol>
<p>Note that probability and severity are mostly about the hazard
itself, whereas exposure and vulnerability tell us more about those
subject to the risk.</p>
<p><strong>With these terms, we can construct a more detailed
equation.</strong> Sometimes, but not always, it is more convenient to
use this risk decomposition. Rather than being mathematically rigorous,
this equation is intended to convey that increasing any of the terms
being multiplied will increase the risk, and reducing any of these terms
will reduce it. Additionally, reducing any of the multiplicative terms
to zero will reduce the risk to zero for a given hazard, regardless of
how large the other factors are. Once more, we can add together the risk
estimates for each independent hazard to find the total risk for a
system.<p>
<span class="math display">$$\displaystyle \text{Risk} =
\sum_{\substack{\text{hazardous} \\ \text{event } h}} P(h) \times
\text{severity}(h) \times \text{exposure}(h) \times
\text{vulnerability}(h)$$</span></p>
<h2 id="applying-the-disaster-risk-equation">4.2.6 Applying the Disaster Risk
Equation</h2>
<p><strong>Infection by a virus is an example hazard.</strong> Consider
these ideas in the context of a viral infection. The hazard severity and
probability of a virus refers to how bad the symptoms are and how
infectious it is. An individual’s exposure relates to how much they come
into contact with the virus. Their vulnerability relates to how strong
their immune system is and whether they are vaccinated. If the virus is
certainly deadly once infected, we might consider ourselves extremely
vulnerable.</p>
<p><strong>Decomposing risks in detail can help us identify practical
ways of reducing them.</strong> As well as helping us evaluate a risk,
the equation above can help us understand what we can do to mitigate it.
We might be unable to change how intrinsically virulent a virus is, for
instance, and so unable to affect the hazard’s severity. However, we
could reduce exposure to it by wearing masks, avoiding large gatherings,
and washing hands. We could reduce our vulnerability by maintaining a
healthy lifestyle and getting vaccinated. Taking any of these actions
will decrease the overall risk. If we are facing a deadly disease, then we might take extreme actions like quarantining ourselves to reduce exposure to the hazard, thereby bringing down overall risk to manageable levels.</p>
<p><strong>Not all risks can be calculated precisely, but decomposition
still helps reduce them.</strong> An important caveat to the disaster
risk equation is that not all risks are straightforward to calculate, or
even to predict. Nonetheless, even if we cannot put an exact number on
the risk posed by a given hazard, we can still reduce it by decreasing
our exposure or vulnerability, or the intrinsic hazard level itself,
where possible. Similarly, even if we cannot predict all hazards
associated with a system—for example if we face a risk of unknown
unknowns, which are explored later in this chapter—we can still reduce
the overall risk by addressing the hazards we are aware of.</p>
<p><strong>In AI safety, the risk equation suggests three important
research areas.</strong> As with other hazards, we should look for
multiple ways of preventing and protecting against potential adverse
events associated with AI. There are three key areas of research that
can each be viewed as inspired by a component of the disaster risk
equation: robustness (e.g. adversarial robustness), monitoring (e.g.
transparency, trojan detection, anomaly detection), and control (e.g.
reducing power-seeking drives, representation control). These research
areas correspond to reducing the vulnerability of AIs to adversarial
attacks, exposure to hazards by monitoring and avoiding them, and hazard
level (probability and severity of potential damage) by ensuring AIs are
controllable and inherently less hazardous. To reduce AI risk, it is
crucial to pursue and develop all three, rather than relying on just
one. <strong>Example hazard: proxy gaming.</strong> Consider proxy
gaming, a risk we face from AIs that was discussed in the Single Agent
Safety chapter. Proxy gaming might arise when we give AI systems goals
that are imperfect proxies of our goals. An AI might then learn to
“game” or over-optimize these proxies in unforeseen ways that diverge
from human values. We can tackle this threat in many different
ways:<p>
</p>
<ol>
<li><p>Reduce our exposure to this proxy gaming hazard by improving our
abilities to monitor anomalous behavior and flag any signs that a system
is proxy gaming at an early stage.</p></li>
<li><p>Reduce the hazard level by making AIs want to optimize an
idealized goal and make mistakes less hazardous by controlling the power
the AI has, so that if it does overoptimize the proxy it would do less
harm.</p></li>
<li><p>Reduce our vulnerability by making our proxies more accurate, by
making AIs more adversarially robust, or by reducing our dependence on
AIs.</p></li>
</ol>
<p><strong>Systemic safety addresses factors external to the AI
itself.</strong> The three fields outlined above focus on reducing risk
through the design of AIs themselves. Another approach, called systemic
safety, considers the environment within which the AI operates and
attempts to remove or reduce the hazards that it might otherwise
interact with. For example, improving information security reduces the
chance of a malicious actor accessing a lethal autonomous weapon, while
addressing inequality and improving mental health across society could
reduce the number of people who might seek to harness AI to cause
harm.</p>
<p><strong>Adding ability to cope can improve the disaster risk
equation.</strong> There are other factors that could be included in the
disaster risk equation. We can return to our example of the slippery
floor to illustrate one of these factors. After slipping on the floor,
we might take less time to recover if we have access to better medical
technology. This tells us to what extent we would be able to recover
from the damage the hazard caused. We can refer to the capacity to
recover as our <em>ability to cope</em>. Unlike the other factors that
multiply together to give us an estimate of risk, we might divide by
ability to cope to reduce our estimate of the risk if our ability to
cope with it is higher. This is a common extension to the disaster risk
equation.<p>
Some hazards are extremely damaging and eliminate any chance of
recovery: the severity of the hazard and our vulnerability are high,
while our ability to cope is tiny. This constitutes a <em>risk of
ruin</em>–—permanent, system-complete destruction. In this case, the
equation would involve multiplying together two large numbers and
dividing by a small number; we would calculate the risk as being
extremely large. If the damage cannot be recovered from, like an
existential catastrophe (e.g., a large asteroid or sufficiently powerful
rogue AIs), the risk equation would tell us that the risk is
astronomically large or infinite.</p>
<p><strong>Summary.</strong> We can evaluate a risk by breaking it down
into the probability of an adverse event and the amount of harm it would
cause. This enables us to quantitatively compare various kinds of risks.
If we have enough information, we can analyze risks further in terms of
our level of exposure to them and how vulnerable we are to damage from
them, as well as our ability to cope. Even if we cannot assign an exact
numerical value to a risk, we can estimate it. If our estimates are
unreliable, this decomposition can still help us to systematically
identify practical measures we can take to reduce the different factors
and thus the overall risk.</p>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-Hendrycks2022xrisk" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] D.
Hendrycks and M. Mazeika, <span>“X-risk analysis for AI
research.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2206.05862">https://arxiv.org/abs/2206.05862</a></div>
</div>
</div>