ECE 753: Fault--Tolerant Computing, Spring 2010-2011 (Jan 2011 - May 2011)

Kewal K. Saluja


THIS SITE IS UNDER CONSTRUCION Lecture Location and Time: Room 1143 Mechanical Engineering. Tuesdays and Thursday 11:00-12:15

Instructor Office Hours: Mondays 2:00-3:30; Tuesdays 2:00-3:30; Wednesday 2:00-3:00; and Other days - by appointment. Room 4611 Engg. Hall
(Optional) Discussion Session and Location: None at present

THIS SITE IS UNDER CONSTRUTION AND IT IS still being updated - some links may not work

PDF files
  • Cover Sheet
  • Conduct
  • Outline
  • General Reference List
  • Reading List

  • Some papers from the reading list in PDF file format I AM STILL WORKING ON IT THIS IS LINK IS STILL NOT UP
  • [aviz:95] Building Dependable Systems: How to keep up with complexity
  • [siew:95] Niche Succeses to Ubiquitous Invisibility: Fault Tolerant Computing, Past, Persent and Future
  • [mull:93] Fault Tolerant Broadcast and Related Problems
  • [cris:91] Understanding Fault-Tolerant Distributed Systems

  • PROJECTS REVIEWS PAPERS PRESENTATIONS
    Class Schedule and material covered in the lectures - Spring 2010-2011
    
    1/18
     
  • Lecture set 1 in .ppt
  • Lecture set 1 in pdf (six slides per page) Motivation About the course Introduction - historical perspective - Terminology and definitions - Fundamental Principles - Fault - Error - Failure chain 1/20 Introduction (contd.)
  • Fault Characteristics and Barries in pdf - Fault - Error - Failure chain - Methods to break the FEF chain (You must read the first three papers in the reading list, namely [aviz:95], [siew:95] and [cris:91])
  • Lecture set 2 in .ppt
  • Lecture set 2 in pdf (six slides per page) Fault Modeling - (please read [abra:86] and [mull:93] papers in the reading list) Fault characteristics Fault Modeling Introduction Models at different levels gate level, function level, system level Error models System fault models and high level failure models System fault models and high level failure models Modeling other faults
    1/25 Fault Modeling (contd.)
  • Fault Models at System Level in pdf
  • Lecture set 3 in .ppt
  • Lecture set 3 in pdf (six slides per page) Test Generation and fault simulation Introduction Basics of Testing Complexity of testing and complexity reduction methods 1/26
  • Homework set 1 in PDF Homework 1 due on Thursday February 3 1/27 Test Generation and fault simulation (contd.) Complexity of testing and complexity reduction methods Fault equivalneces Fault simulation - 2, 3 value Combinational test generation - Random - Deterministic (PODEM)
    2/1 Test Generation and fault simulation (contd.) Combinational test generation - Random - Deterministic (PODEM) - 2, 3, 5, and 9 value simulation, heuristics Sequential test generation Design for testability (DFT) Built-in Self-test (BIST) 2/3 Homework 1 due Today
  • Lecture set 4 in .ppt
  • Lecture set 4 in pdf (six slides per page) Concepts in fault tolerance Introduction An Example showing the scope of the course Hardware redundancy techniques - Passive, Active, Hybrid Information redundancy - coding Information redundancy - self-checking
    2/7
  • Homework set 2 in PDF Homework 2 due on Thursday February 17 2/8 Concepts in fault tolerance (contd.) Information redundancy - self-checking Time redundancy Software redundancy About Project - I would like to see project decided by March 1 2/10 Concepts in fault tolerance (contd.) Software redundancy About Project - I would like to see project decided by March 1
  • Lecture set 5A in .ppt
  • Lecture set 5A in pdf (six slides per page) Variuos fault tolerant measures
  • Lecture set 5B in .ppt
  • Lecture set 5B in pdf (six slides per page) Reliability modeling and analysis recap, need for modeling and analysis mathematical formulation reliability computation
    2/11
  • Homework set 1 Solution in PDF
    2/15 Reliability modeling and analysis (contd.) reliability computation reliability block diagrams - series, parallel, and series/parallel systems - non series/parallel systems
  • Slides by Koren and Krishna - canonical structures (six slides per page) combinatorial methods Markov models 2/17 Homework 2 due Today Reliability modeling and analysis (contd.) Reliability block diagrams - non series/parallel systems Markov models solution methods examples
    2/22 Reliability modeling and analysis (contd.) Markov models solution methods examples other parameters such as safety, availability Availability General remarks - overhead, Mission time improvement, law of diminishing return Petri net model and solution - brief discussion 2/24
  • Homework set 3 in PDF Homework 3 due March 8 (Tuesday)
  • Lecture set 6 in .ppt
  • Lecture set 6 in pdf (six slides per page) System level diagnosis Introduction System and system test model one-step diagnosis - design
    3/1 System level diagnosis (contd.) one-step diagnosis - design other models sequential diagnosis Various other formulations and general comments 3/3 Project choice and team due today System level diagnosis (contd.) Various other formulations and general comments
  • Lecture set 7 in .ppt
  • Lecture set 7 in pdf (six slides per page) Low level fault tolerance: Error correction coding Introduction and motivation Hamming code by example Algebraic coding - Theory
  • Homework set 2 Solution in PDF
    3/8 Homework 3 due Today Low level fault tolerance: Error correction coding (contd.) Algebraic coding - Theory 3/10 Project Proposal due today
  • Homework set 3 Solution in PDF Low level fault tolerance: Error correction coding (contd.) Algebraic coding - Theory Linear Block codes: SEC-DED code Linear Block Codes: Theorems Odd weight column code Hardware issues
    3/12 to 3/20 Spring Break
    3/22 Low level fault tolerance: Error correction coding (contd.) Algebraic coding - Theory Odd weight column code Hardware issues single error correcting - single byte error detecting codes Cyclic codes - basics - (Time permitting) 3/24 Low level fault tolerance: Error correction coding (contd.) Cyclic codes - basics and an example
  • Lecture set 8 in .ppt
  • Lecture set 8 in pdf (six slides per page)
  • DSN 2010 Presentation - Chip Multiprocessors Low level fault tolerance: Watchdog and Re-execution Introduction Watchdog based methods Timers, processors
  • Homework set 4 in PDF Homework 4 due April 5 (Tuesday)
    3/29 Low level fault tolerance: Watchdog and Re-execution (contd.) Watchdog based methods - path signature and branch address hashing Re-execution based methods instruction re-execution, program re-executions and variations thereof Case studies
  • Case Study 1 - CRAY (file in pdf)
  • Case Study 2 - ARSMT (file in pdf)
  • Case Study 3 - Multiscalar (file in pdf)
  • Case Study 4 - DSN 2010 Presentation - Chip Multiprocessors 3/31 Project progress report due today Low level fault tolerance: Watchdog and Re-execution (contd.) Case studies 4 (contd.)
  • Lecture set 9 (set 1) in .ppt
  • Lecture set 9 (set 1) in pdf (six slides per page) High level fault tolerance: Checkpointing and recovery basic concept fault model and coverage checkpointing in uniprocessor systems
  • Lecture set 9 (set 2) in .ppt
  • Lecture set 9 (set 2) in pdf (six slides per page) High level fault tolerance: Checkpointing and recovery (contd.) Case for checkpointing
  • Homework set 5 in PDF Homework 5 due April 12 (Tuesday)
    4/5 Homework 4 due Today
  • Homework set 4 Solution in PDF High level fault tolerance: Checkpointing and recovery (contd.) Defintions Issues in checkpointing - kernal, user, application Optimal checkpointing (contd.) Reducing overhead Checkpointing in distributed systems system model consistant state, recovery line, domino effect, livelocks 4/7
  • Lecture set 9 (set 3) in .ppt
  • Lecture set 9 (set 3) in pdf (six slides per page) Will cover only Coordinated checkpointing Message Loggging optimistic and pessimistic logging
  • Lecture set 9 (set 4) in ppt
  • Lecture set 9 (set 4) in pdf (six slides per page) Will cover Communication induced and lag based recovery Forward error recovery
  • Lecture set 10 in ppt
  • Lecture set 10 in pdf (six slides per page) Software fault-tolerance Causes of Errors, Techniques to reduce errors, Acceptance Tests Single Version Fault Tolerance Wrapper Rejuvenation
    4/12 Homework 5 due today Review before exam
  • Homework set 5 Solution in PDF 4/13 EXAM Wednesday Evening 7:15-8:45 PM Room 3024 Engineering Hall Syllabus for the Exam (PDF file) Old exam - I am placing last offering (Spring 2008-09) exam and its solution here. Please keep in mind that some of the material covered this year is different from the previous offering of the course.
  • Exam from Spring 2008-2009 in PDF
  • Exam Solution from Spring 2008-2009 in PDF Solutions to the exam this year:
  • Exam Solution in PDF 4/14 EXPO (4/14 - 4/16) No Lecture
    4/19 Software fault-tolerance (contd.) Single Version Fault Tolerance Data Diversity SIHFT RESO N-version Fault Tolerance Consistent comparison problem confidence signals Independent v/s correlated failurs achieving version independence Recovery Block appraoch
  • Lecture set 11 in .ppt
  • Lecture set 11 in pdf (six slides per page) Reconfiguration/Network fault tolerence Introduction and models n-cube architecture
  • Byzantine agreement problem - a .ppt file
  • Byzantine agreement problem - a pdf file Byzantine agreement problem - Sensor network context 4/21 Project presentation Christopher Karle and Deepika Ganju: Project 3 - Flash Aware RAID Summary: Flash based SSDs are increasing in popularity, performance, and capacity. The advancements made in capacity come at a cost of decreased reliability. Traditional error correcting codes capable of repairing single-bit errors are no longer completely effective as multiple-bit errors grow more common. Because of this, flash based storage should leverage strategies used in RAID configurations, though flash has several distinct characteristics which need to be considered when applying RAID techniques. This paper gives a brief background in RAID, while highlighting many of its problems. The remainder of the content is used to present several flash aware RAID adaptations.
  • Project 3 - Flash Aware RAID by Deepika Ganju and Chris Karle in .pptx format
    4/26 Project presentation (60 minutes) + Course Evaluation Jie Liu: Project 7 - Metamorphic Testing and its Application in Self-Tests Summary: In testing, people normally use the “golden model” to judge the correctness of the tested unit. What if the unit outputs don't exactly match the “golden model”, but it is still considered to be correct? Metamorphic testing was first introduced to software testing to target such problems. In this presentation, we will first discuss the idea of applying metamorphic testing to hardware fault-tolerance. It can be accomplished by two methods: time redundancy and hardware redundancy. We will then compare it with a DMR implementation of hardware redundancy, due to their similar functionalities, and discuss the differences in performance and area. In addition, we will look at the advantages of metamorphic testing in hardware, and different goals achieved by time redundancy and hardware redundancy. Finally, we will also look at the disadvantages such as single-point of failure, and its inability to give a rational judgment of the correct output if the test fails. And then, we can discuss the proposed solutions for those disadvantages.
  • Project 7 - Metamorphic Testing and its Application in Self-Tests by Jie Liu in .ppt format 4/28 Project report due today Project presentation Ramkumar Ravikumar and Adithya Krishnamurthy: Project 6 - Fault Tolerance in Automotive Systems Summary: Design of fault tolerant electronics systems has become a standard requirement in the automotive sector these days. These systems increase the overall automotive and passenger safety by liberating the driver from handling routine tasks and also assisting the driver during critical situations. This paper introduces the reader to fault tolerant design practices followed across several layers in the automotive industry. We start off by analyzing X-by-wire systems, which a re fault tolerant distributed systems that are fail-operational and can maintain a reliable state all the time. Next, we investigate fault tolerance techniques used in the design of automotive software and how they help to improve the overall reliability and dependability of the system. Sensors and Actuators, considered to be the basic building block of sophisticated electronic control units in a utomobiles, need to be fault tolerant to ensure smooth operation of various electronic systems in the vehicle. A separate section, detailing the design of fail-safe Sensors and Actuators is a part of this paper as well. We conclude the paper with a section on the design of automotive communication systems and protocols and their ability to ensure reliable communication between various ECU's in the vehicle.
  • Project 6 - Fault Tolerance in Automotive Systems by Ramkumar Ravikurmar and Adithya Krishnamurthy in .ppt format
    5/3 Project presentation Matt Sinclair and Felix Loh: Project 1 - G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing Summary: GPUs have become increasingly popular in recent years, in large part due to their potential to offer a large amount of computational power at low prices. GPU designers have also made GPU pipelines more general purpose and more programmable, which have made GPUs more attractive to a wider audience. Thus, it is increasingly important to provide fault tolerance. However, pre-Fermi NVIDIA GPUs do not provide fault tolerance in any form. Since GPUs are often used in high performance computing and other areas where data integrity is important, there exists the need to provide support for fault tolerance. In this project, we present G-CP, a mechanism for providing fault tolerance support in GPUs through use of software checkpointing combined with time and/or space redundancy. By doing this, GPU algorithms will be able to periodically checkpoint their work. If a fault has occurred, then the user can roll back to the last checkpoint and continue executing.
  • Project 1 - G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing by Matt Sinclair and Felix Loh in .ppt format 5/5 Project presentation Varun Vats and Raghuvardhan Moola: Project 5 - Exploiting Accidental Heterogeneity in Multicore Processors Summary: Of late, fault tolerance in commodity multi-core processors has assumed great importance because of the highly miniaturized but unreliable fabrication technologies in use and even more so because of the cost associated with them. Redundancy (and hence reconfigurability) has been crucial to fault tolerance and can be applied at five distinct levels in multi-core processors. In increasing order of granularity, these are: gate level, micro-architectural level, stage level, architectural level and core level. Core level reconfigurability techniques are the easiest to apply but have poor returns in terms of lifetime extension whereas gate level techniques have tremendous overhead due to large number of redundant components and high routing area associated with them. The other three techniques have a significant number of advantages over these two that make them highly viable options.In our presentation, we discuss various architectures proposed in the literature that employ reconfigurability in these levels and present a comparison based on the complexity, hardware cost, and performance overhead associated with them and fault coverage they provide.
  • Project 5 - Exploiting Accidental Heterogeneity in Multicore Processors by Varun vats and Raghuvardhan Moola in .pptx format
    5/6 (Friday) Review report due today
    5/8 Sunday - Project Presentations 10:00 AM to 12:30 PM Room 4610 Engineeing Hall Project presentations Andrew Nere and David Palframan: Project 2 - Fault Analysis and Fault Tolerance in Cortical Models Summary: For many decades, computing devices based on the von Neumann architecture have benefited from Moore's Law, utilizing extra devices to improve processor performance. However, power constraints, the difficulty of programming CMPs, and limited throughput between memory and the CPU (i.e. the von Neumann bottleneck) have pushed researchers to investigate alternative computing designs, including systems modeled after the structure and functionality of the brain. The fault models of these alternative computing systems have yet to be fully investigated. In this work, we investigate the fault behavior of a cortical column model used to implement a simplified representation of the mammalian visual cortex. We evaluate the behavior of different cortical heirarchies in the presence of various types of faults, including stuck-at and coupling-like faults. Finally, we discuss the fault tolerance inherent in such systems as well as biological and digital parallels to the faults we investigate.
  • Project 2 - Fault Analysis and Fault Tolerance in Cortical Models - by Andrew Nere and David Palframan in .pptx format Anurag Patel and Kamlesh Prakash: Project 4 - Fault-tolerant features of modern processors – A Case Study Summary: Processor manufacturers have taken several approaches for providing fault tolerance in microprocessors. We present a survey of various techniques that have been implemented in various microprocessors, these include design for testability, redundant/self- checking circuits, error correcting memory sub systems, and software based fault tolerance. We take a look at the power and performance overheads associated with some of these techniques and possible ways to overcome these overheads. With the technology scaling power based performance metrics have become very important. Several new fault tolerance techniques have been proposed recently that provide fault tolerance at a lower overhead. We looked at several of these new technologies and are presenting here the ones we found to be most promising.
  • Project 4 - Fault-tolerant features of modern processors – A Case Study - by Anurag Patel and Kamlesh Prakash in .ppt format
    5/13 Friday: Final Exam, 10:05 AM ( no exam, instead there may be project presentations