Enhancing Data Reliability and Recovery Performance in Erasure Coded Storage Systems

Osama Khan, Johns Hopkins University

Ensuring data reliability has always been a key concern in large storage systems. There have been numerous schemes put forward to make data more durable, which include replication and/or erasure coding. Remote auditing can be used to ensure that data is available and free from corruption and/or deletion. In the first part of this dissertation, we outline how to make remote data auditing more robust against small targeted deletions. We then move on to the second part of the dissertation where we focus on the recovery problem, i.e. given that a (hardware or software) failure has occurred, the cost of recovering the data affected by that failure should be minimized. While there are lots of facets to such an open ended problem, we focus on the I/O aspect of it. We examine ways of minimizing the amount of I/O needed for recovery in the context of an erasure coded storage system. This involves analyzing the recovery characteristics of both existing codes, as well as introducing a new class of erasure codes which have been specifically designed for optimal recovery I/O performance. We provide an in-depth experimental evaluation of the theoretical results to show that reading a minimal set of symbols during recovery does indeed translate into practical savings in reconstruction I/O cost.

Speaker Biography

Osama Khan received the B. Sc. (Hons.) degree in Computer Science from Lahore University of Management Sciences, Lahore (Pakistan) in 2002, and proceeded to Germany in 2003 where he completed the Masters program offered by the University of Saarland, in collaboration with the German Center for Artificial Intelligence and Max-Planck Institut for Computer Science. He was then awarded the Fulbright Fellowship to pursue his Ph.D. at Johns Hopkins University in 2006. At JHU, he was a member of HSSL, which is headed by Dr. Randal Burns. His research focuses on data reliability and recovery issues in large storage systems. In the summer of 2012, he interned at Microsoft Research where he worked on the ThinCloud project, dealing with issues related to fault tolerance in a large distributed storage system. His research interests include replication, erasure coding, data auditing, cloud file systems and distributed storage systems.