WIT Press


METHOD FOR MITIGATING SYSTEM FAILURES

Price

Free (open access)

Volume

174

Pages

12

Page Range

213 - 224

Published

2018

Size

519 kb

Paper DOI

10.2495/SAFE170201

Copyright

WIT Press

Author(s)

TAKAFUMI NAKAMURA

Abstract

In this paper, a method is proposed for mitigating system failures that recognizes the shortcomings of current state-of-the-art methodologies (i.e., crisis management, risk management, normal accident theory, and high reliability organization). The current majority of methodologies for ICT systems use a reductionist approach (i.e., one that lacks a holistic view). Therefore, we need more holistic methodologies to mitigate system failures. There are many system failure examples in our world. The Tokyo Stock Exchange crashed on the 1st of November 2005 because of an operational error, severely impacting the global economy. Such system failures severely impact not only ICT systems but also social systems (transportation systems, nuclear plant systems, etc.). A JR West train derailed and overturned in Kyoto, Japan on the 25th April 2005 because of driver misconduct and caused the loss of 106 passengers lives. On the 11th of March 2011, a massive earthquake fiercely shook eastern Japan; it was followed by a devastating tsunami. This caused the Fukushima No. 1 nuclear plant hydrogen explosion, which will take a long time to clean up. The progress of ICT technologies (i.e., cloud, virtual, and network technologies) inevitably shifts ICT systems into complexity with tightly interacting domains. This trend requires a novel way to mitigate system failures to promote system safety more than ever. Emergent properties should be dealt with in order to promote system safety. Crisis management should focus on holistic properties over partial components. This paper introduces a system failure framework to promote a holistic view to manage and therefore mitigate system failures. An application example of ICT system failures exhibits the effectiveness of this methodology.

Keywords

risk management, crisis management, normal accident theory (NAT), high reliability organization (HRO), information and communication technology (ICT)