ECE 753: Fault--Tolerant Computing,
Spring 2010-2011 (Jan 2011 - May 2011)
Kewal K. Saluja
THIS SITE IS UNDER CONSTRUCION
Lecture Location and Time:
Room 1143 Mechanical Engineering. Tuesdays and Thursday 11:00-12:15
Instructor Office Hours:
Mondays 2:00-3:30;
Tuesdays 2:00-3:30;
Wednesday 2:00-3:00; and
Other days - by appointment.
Room 4611 Engg. Hall
(Optional) Discussion Session and Location: None at present
THIS SITE IS UNDER CONSTRUTION AND IT IS still being updated - some links may not work
Lecture set 1 in pdf (six slides per page)
Motivation
About the course
Introduction - historical perspective
- Terminology and definitions
- Fundamental Principles
- Fault - Error - Failure chain
1/20 Introduction (contd.)
Fault Characteristics and Barries in pdf
- Fault - Error - Failure chain
- Methods to break the FEF chain
(You must read the first three papers in the reading list, namely
[aviz:95], [siew:95] and [cris:91])
Lecture set 2 in pdf (six slides per page)
Fault Modeling - (please read [abra:86] and [mull:93] papers in the reading list)
Fault characteristics
Fault Modeling
Introduction
Models at different levels
gate level, function level, system level
Error models
System fault models and high level failure models
System fault models and high level failure models
Modeling other faults
1/25 Fault Modeling (contd.)
Lecture set 3 in pdf (six slides per page)
Test Generation and fault simulation
Introduction
Basics of Testing
Complexity of testing and complexity reduction methods
1/26
Homework set 1 in PDF Homework 1 due on Thursday February 3
1/27
Test Generation and fault simulation (contd.)
Complexity of testing and complexity reduction methods
Fault equivalneces
Fault simulation - 2, 3 value
Combinational test generation
- Random
- Deterministic (PODEM)
2/1
Test Generation and fault simulation (contd.)
Combinational test generation
- Random
- Deterministic (PODEM)
- 2, 3, 5, and 9 value simulation, heuristics
Sequential test generation
Design for testability (DFT)
Built-in Self-test (BIST)
2/3
Homework 1 due Today
Lecture set 4 in pdf (six slides per page)
Concepts in fault tolerance
Introduction
An Example showing the scope of the course
Hardware redundancy techniques
- Passive, Active, Hybrid
Information redundancy - coding
Information redundancy - self-checking
2/7
Homework set 2 in PDF Homework 2 due on Thursday February 17
2/8
Concepts in fault tolerance (contd.)
Information redundancy - self-checking
Time redundancy
Software redundancy
About Project - I would like to see project decided by March 1
2/10
Concepts in fault tolerance (contd.)
Software redundancy
About Project - I would like to see project decided by March 1
Homework set 1 Solution in PDF
2/15
Reliability modeling and analysis (contd.)
reliability computation
reliability block diagrams
- series, parallel, and series/parallel systems
- non series/parallel systems
Slides by Koren and Krishna - canonical structures (six slides per page)
combinatorial methods
Markov models
2/17 Homework 2 due Today
Reliability modeling and analysis (contd.)
Reliability block diagrams
- non series/parallel systems
Markov models
solution methods
examples
2/22
Reliability modeling and analysis (contd.)
Markov models
solution methods
examples
other parameters such as safety, availability
Availability
General remarks - overhead, Mission time improvement,
law of diminishing return
Petri net model and solution - brief discussion
2/24
Lecture set 6 in pdf (six slides per page)
System level diagnosis
Introduction
System and system test model
one-step diagnosis - design
3/1
System level diagnosis (contd.)
one-step diagnosis - design
other models
sequential diagnosis
Various other formulations and general comments
3/3 Project choice and team due today
System level diagnosis (contd.)
Various other formulations and general comments
Lecture set 7 in pdf (six slides per page)
Low level fault tolerance: Error correction coding
Introduction and motivation
Hamming code by example
Algebraic coding - Theory
Homework set 2 Solution in PDF
3/8 Homework 3 due Today
Low level fault tolerance: Error correction coding (contd.)
Algebraic coding - Theory
3/10 Project Proposal due today
Homework set 3 Solution in PDF
Low level fault tolerance: Error correction coding (contd.)
Algebraic coding - Theory
Linear Block codes: SEC-DED code
Linear Block Codes: Theorems
Odd weight column code
Hardware issues
3/12 to 3/20
Spring Break
3/22
Low level fault tolerance: Error correction coding (contd.)
Algebraic coding - Theory
Odd weight column code
Hardware issues
single error correcting - single byte error detecting codes
Cyclic codes - basics - (Time permitting)
3/24
Low level fault tolerance: Error correction coding (contd.)
Cyclic codes - basics and an example
Homework set 4 in PDF
Homework 4 due April 5 (Tuesday)
3/29
Low level fault tolerance: Watchdog and Re-execution (contd.)
Watchdog based methods - path signature and branch address hashing
Re-execution based methods
instruction re-execution,
program re-executions and variations thereof
Case studies
Homework set 4 Solution in PDF
High level fault tolerance: Checkpointing and recovery (contd.)
Defintions
Issues in checkpointing - kernal, user, application
Optimal checkpointing (contd.)
Reducing overhead
Checkpointing in distributed systems
system model
consistant state, recovery line, domino effect, livelocks
4/7
Lecture set 10 in pdf (six slides per page)
Software fault-tolerance
Causes of Errors, Techniques to reduce errors, Acceptance Tests
Single Version Fault Tolerance
Wrapper
Rejuvenation
4/12 Homework 5 due today
Review before exam
Homework set 5 Solution in PDF
4/13
EXAM Wednesday Evening 7:15-8:45 PM Room 3024 Engineering HallSyllabus for the Exam (PDF file)
Old exam - I am placing last offering (Spring 2008-09) exam and its solution here.
Please keep in mind that some of the material covered this
year is different from the previous offering of the course.
Byzantine agreement problem - a pdf file
Byzantine agreement problem - Sensor network context
4/21 Project presentation
Christopher Karle and Deepika Ganju: Project 3 - Flash Aware RAID
Summary: Flash based SSDs are increasing in popularity, performance, and capacity.
The advancements made in capacity come at a cost of decreased reliability.
Traditional error correcting codes capable of repairing single-bit errors are
no longer completely effective as multiple-bit errors grow more common.
Because of this, flash based storage should leverage strategies used in RAID configurations,
though flash has several distinct characteristics which need to be considered when applying
RAID techniques. This paper gives a brief background in RAID, while highlighting many of
its problems. The remainder of the content is used to present several flash aware
RAID adaptations.
Project 3 - Flash Aware RAID by Deepika Ganju and Chris Karle in .pptx format
4/26 Project presentation (60 minutes) + Course Evaluation
Jie Liu: Project 7 - Metamorphic Testing and its Application in Self-Tests
Summary: In testing, people normally use the “golden model” to judge the correctness of
the tested unit. What if the unit outputs don't exactly match the “golden model”,
but it is still considered to be correct? Metamorphic testing was first introduced
to software testing to target such problems. In this presentation, we will first discuss
the idea of applying metamorphic testing to hardware fault-tolerance.
It can be accomplished by two methods: time redundancy and hardware redundancy.
We will then compare it with a DMR implementation of hardware redundancy,
due to their similar functionalities, and discuss the differences in performance and area.
In addition, we will look at the advantages of metamorphic testing in hardware,
and different goals achieved by time redundancy and hardware redundancy.
Finally, we will also look at the disadvantages such as single-point of failure, and its
inability to give a rational judgment of the correct output if the test fails.
And then, we can discuss the proposed solutions for those disadvantages.
Project 7 - Metamorphic Testing and its Application in Self-Tests by Jie Liu in .ppt format
4/28 Project report due today
Project presentation
Ramkumar Ravikumar and Adithya Krishnamurthy: Project 6 - Fault Tolerance in Automotive Systems
Summary: Design of fault tolerant electronics systems has become a standard requirement
in the automotive sector these days. These systems increase the overall automotive
and passenger safety by liberating the driver from handling routine tasks and
also assisting the driver during critical situations. This paper introduces the
reader to fault tolerant design practices followed across several layers in
the automotive industry. We start off by analyzing X-by-wire systems, which a
re fault tolerant distributed systems that are fail-operational and can maintain
a reliable state all the time. Next, we investigate fault tolerance techniques
used in the design of automotive software and how they help to improve the overall
reliability and dependability of the system. Sensors and Actuators, considered
to be the basic building block of sophisticated electronic control units in a
utomobiles, need to be fault tolerant to ensure smooth operation of various
electronic systems in the vehicle. A separate section, detailing the design of
fail-safe Sensors and Actuators is a part of this paper as well. We conclude
the paper with a section on the design of automotive communication systems and
protocols and their ability to ensure reliable communication between various
ECU's in the vehicle.
Project 6 - Fault Tolerance in Automotive Systems by Ramkumar Ravikurmar and Adithya Krishnamurthy in .ppt format
5/3 Project presentation
Matt Sinclair and Felix Loh: Project 1 - G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing
Summary: GPUs have become increasingly popular in recent years, in large part
due to their potential to offer a large amount of computational power
at low prices. GPU designers have also made GPU pipelines more general
purpose and more programmable, which have made GPUs more attractive to
a wider audience. Thus, it is increasingly important to provide fault
tolerance. However, pre-Fermi NVIDIA GPUs do not provide fault
tolerance in any form. Since GPUs are often used in high performance
computing and other areas where data integrity is important, there
exists the need to provide support for fault tolerance. In this
project, we present G-CP, a mechanism for providing fault tolerance
support in GPUs through use of software checkpointing combined with
time and/or space redundancy. By doing this, GPU algorithms will be
able to periodically checkpoint their work. If a fault has occurred,
then the user can roll back to the last checkpoint and continue
executing.
Project 1 - G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing by Matt Sinclair and Felix Loh in .ppt format
5/5 Project presentation
Varun Vats and Raghuvardhan Moola: Project 5 - Exploiting Accidental Heterogeneity in Multicore Processors
Summary: Of late, fault tolerance in commodity multi-core processors has assumed great
importance because of the highly miniaturized but unreliable fabrication
technologies in use and even more so because of the cost associated with them.
Redundancy (and hence reconfigurability) has been crucial to fault tolerance
and can be applied at five distinct levels in multi-core processors.
In increasing order of granularity, these are: gate level, micro-architectural
level, stage level, architectural level and core level. Core level reconfigurability
techniques are the easiest to apply but have poor returns in terms of lifetime
extension whereas gate level techniques have tremendous overhead due to large
number of redundant components and high routing area associated with them.
The other three techniques have a significant number of advantages over these
two that make them highly viable options.In our presentation, we discuss
various architectures proposed in the literature that employ reconfigurability
in these levels and present a comparison based on the complexity, hardware cost,
and performance overhead associated with them and fault coverage they provide.
Project 5 - Exploiting Accidental Heterogeneity in Multicore Processors by Varun vats and Raghuvardhan Moola in .pptx format
5/6 (Friday) Review report due today
5/8 Sunday - Project Presentations 10:00 AM to 12:30 PM Room 4610 Engineeing Hall
Project presentations
Andrew Nere and David Palframan: Project 2 - Fault Analysis and Fault Tolerance in Cortical Models
Summary: For many decades, computing devices based on the von Neumann
architecture have benefited from Moore's Law, utilizing extra devices to improve
processor performance. However, power constraints, the difficulty of programming CMPs,
and limited throughput between memory and the CPU (i.e. the von Neumann bottleneck) have
pushed researchers to investigate alternative computing designs, including systems modeled
after the structure and functionality of the brain. The fault models of these alternative
computing systems have yet to be fully investigated. In this work, we investigate the
fault behavior of a cortical column model used to implement a simplified representation
of the mammalian visual cortex. We evaluate the behavior of different cortical heirarchies
in the presence of various types of faults, including stuck-at and coupling-like faults.
Finally, we discuss the fault tolerance inherent in such systems as well as biological
and digital parallels to the faults we investigate.
Project 2 - Fault Analysis and Fault Tolerance in Cortical Models - by Andrew Nere and David Palframan in .pptx format
Anurag Patel and Kamlesh Prakash: Project 4 - Fault-tolerant features of modern processors – A Case Study
Summary: Processor manufacturers have taken several approaches for
providing fault tolerance in microprocessors. We present a survey of
various techniques that have been implemented in various
microprocessors, these include design for testability, redundant/self-
checking circuits, error correcting memory sub systems, and software
based fault tolerance. We take a look at the power and performance
overheads associated with some of these techniques and possible ways to
overcome these overheads. With the technology scaling power based
performance metrics have become very important. Several new fault
tolerance techniques have been proposed recently that provide fault
tolerance at a lower overhead. We looked at several of these new
technologies and are presenting here the ones we found to be most
promising.