ECE 753: Fault--Tolerant Computing, Spring 2013-2014 (Jan 2014 - May 2014)

Kewal K. Saluja

Lecture Location and Time: Room 1153 Mechanical Engineering. Tuesdays and Thursday 9:30-10:45

Instructor Office Hours: Mondays 2:00-3:00; Tuesdays 2:00-3:00; Wednesday 2:00-3:00; and Other days - by appointment. Room 4611 Engg. Hall
(Optional) Discussion Session and Location: None at present

THIS SITE IS UNDER CONSTRUTION AND IT IS still being updated - some links may not work

PDF files
  • Cover Sheet
  • Conduct
  • Outline
  • General Reference List
  • Reading List

  • Some papers from the reading list in PDF file format I AM STILL WORKING ON IT THIS IS LINK IS STILL NOT UP
  • [aviz:95] Building Dependable Systems: How to keep up with complexity
  • [siew:95] Niche Succeses to Ubiquitous Invisibility: Fault Tolerant Computing, Past, Persent and Future
  • [cris:91] Understanding Fault-Tolerant Distributed Systems
  • [aviz:04] Basic Concepts and Taxonomy of Dependable and Secure Computing
  • [mull:93] Fault Tolerant Broadcast and Related Problems
  • [kala:13] A Survey of Checker Architectures
  • [goel:81] An Implicit Eneumeration Algorithm to Generate Tests for Combinational Circuits

    Class Schedule and material covered in the lectures - Spring 2013-2014
  • Lecture set 1 in .ppt
  • Lecture set 1 in pdf (six slides per page) Motivation About the course Introduction - historical perspective - Terminology and definitions (You must read the first four papers in the reading list, namely [aviz:95], [siew:95] [cris:91]) and [aviz:04] 1/23 Introduction (contd.)
  • Fault Characteristics and Barries in pdf - Fundamental Principles - Fault - Error - Failure chain - Fault - Error - Failure chain - Methods to break the FEF chain
  • Lecture set 2 in .ppt
  • Lecture set 2 in pdf (six slides per page) Fault Modeling - (please read [abra:86] [kala:13] and [mull:93] papers in the reading list) Fault characteristics Fault Modeling Introduction Models at different levels gate level, function level, system level Error models System fault models and high level failure models System fault models and high level failure models Modeling other faults
  • Fault Models at System Level in pdf
    1/28 Classes cancelled by the University - No lecture today Fault Modeling (contd.) Modeling faults at higher level
  • Lecture set 3 in .ppt
  • Lecture set 3 in pdf (six slides per page) Test Generation and fault simulation Introduction Basics of Testing Complexity of testing and complexity reduction methods Fault equivalneces
  • Homework set 1 in PDF Homework 1 due on Thursday February 6 1/30 We will cover the material that was scheduled to be covered on Tuesday
    2/4 Test Generation and fault simulation (contd.) Complexity of testing and complexity reduction methods Fault equivalneces Fault simulation - 2, 3 value Combinational test generation - Random - Deterministic (PODEM) - 2, 3, 5, and 9 value simulation, heuristics 2/6 Homework 1 due Today Test Generation and fault simulation (contd.) Combinational test generation - Deterministic (PODEM) - 2, 3, 5, and 9 value simulation, heuristics Sequential test generation Design for testability (DFT) Built-in Self-test (BIST)
    2/11 Test Generation and fault simulation (contd.) Built-in Self-test (BIST)
  • Lecture set 4 in .ppt
  • Lecture set 4 in pdf (six slides per page) Concepts in fault tolerance Introduction An Example showing the scope of the course Hardware redundancy techniques - Passive, Active, Hybrid
  • Active approach to Fault Tolerance in pdf 2/13 Concepts in fault tolerance (contd.) Information redundancy - self-checking Time redundancy
  • Homework set 2 in PDF Correction in Problem 1c made on Feb 19, 2014 Homework 2 due on Tuesday February 25
    2/18 Concepts in fault tolerance (contd.) Software redundancy
  • Lecture set 5A in .ppt
  • Lecture set 5A in pdf (six slides per page) Variuos fault tolerant measures
  • Lecture set 5B in .ppt
  • Lecture set 5B in pdf (six slides per page) Reliability modeling and analysis recap, need for modeling and analysis mathematical formulation reliability computation About Project - I have posted the guidelines - I would like to see project fully assigned by March 6
  • Guidelines and deadlines (team, project, progress, oral presentation, written report, and reviews (pdf file)
  • Homework set 1 Solution in PDF 2/20 Reliability modeling and analysis (contd.) reliability computation reliability block diagrams - series, parallel, and series/parallel systems - non series/parallel systems exact computation upper and lower bounds
  • Slides by Koren and Krishna - upper and lower bounds in .ppt
  • Slides by Koren and Krishna - upper and lower bounds (six slides per page)
    2/25 Homework 2 due Today
  • Homework set 2 Solution in PDF Reliability modeling and analysis (contd.) Markov models solution methods examples other parameters such as safety, maintainability, ... Availability General remarks - overhead, Mission time improvement, law of diminishing return Petri net model and solution - brief discussion (time permitting) 2/27 Reliability modeling and analysis (contd.) Markov models Availability General remarks - overhead, Mission time improvement, law of diminishing return Petri net model and solution - brief discussion (time permitting)
  • Lecture set 6 in .ppt
  • Lecture set 6 in pdf (six slides per page) System level diagnosis Introduction System and system test model one-step diagnosis - design
  • Homework set 3 in PDF Homework 3 due March 11 (Tuesday) - UPDATED deadline
    3/4 Decide your project and team by now System level diagnosis (contd.) one-step diagnosis - design other models sequential diagnosis YOUR TEAM, PROJECT TITLE, AND PROJECT NUMBER IS POSTED SEE
  • PROJECT TEAMS and LIST OF PROJECTS - March 4 3/6 Project choice and team due today March 6 (Thursday) in the form of a proposal System level diagnosis (contd.) Various other formulations and general comments
  • Lecture set 7 in .ppt
  • Lecture set 7 in pdf (six slides per page) Low level fault tolerance: Error correction coding Introduction and motivation Hamming code by example
    3/11 Homework 3 due Today UPDATED deadline Low level fault tolerance: Error correction coding (contd.) Hamming Code - Example (contd.) Algebraic coding - Theory Linear Block codes: SEC-DED code 3/13 Low level fault tolerance: Error correction coding (contd.) Algebraic coding - Theory Linear Block codes: SEC-DED code and theorems Odd weight column code
  • Homework set 3 Solution in PDF
    3/15 to 3/23 Spring Break
  • Homework set 4 in PDF Homework 4 due April 3 (Thursday) Low level fault tolerance: Error correction coding (contd.) Algebraic coding - Theory Odd weight column code Hardware issues single error correcting - single byte error detecting codes Cyclic codes - basics - (Time permitting) 3/27 Project progress report due today March 27 Thursday Low level fault tolerance: Error correction coding (contd.) Cyclic codes - basics and an example
  • Lecture set 8 in .ppt
  • Lecture set 8 in pdf (six slides per page) Low level fault tolerance: Watchdog and Re-execution Introduction Watchdog based methods Timers, processors path signature and branch address hashing (please read [mahm:88] [rotenberg:99] [rashid:00] [subra:10] and [kala:13] papers in the reading list)
    4/1 Low level fault tolerance: Watchdog and Re-execution (contd.) Re-execution based methods instruction re-execution, program re-executions and variations thereof Case studies
  • Case Study 1 - CRAY (file in pdf)
  • Case Study 2 - ARSMT (file in pdf)
  • Case Study 3 - Multiscalar (file in pdf)
  • Case Study 4 - DSN 2010 Presentation - Chip Multiprocessors 4/3 Homework 4 due Today Project review and presentation issues Low level fault tolerance: Watchdog and Re-execution (contd.) Case studies 4 (contd.)
  • Lecture set 9 (set 1) in .ppt
  • Lecture set 9 (set 1) in pdf (six slides per page) High level fault tolerance: Checkpointing and recovery basic concept fault model and coverage checkpointing in uniprocessor systems We lost a good deal of time today due to Fire in the building
  • Lecture set 9 (set 2) in .ppt
  • Lecture set 9 (set 2) in pdf (six slides per page) Case for checkpointing
  • Homework set 4 Solution in PDF
  • Homework set 5 in PDF Homework 5 due April 10 (Thursday)
    4/8 High level fault tolerance: Checkpointing and recovery (contd.) checkpointing in uniprocessor systems New slide set: Case for checkpointing Defintions Issues in checkpointing - kernal, user, application Optimal checkpointing (contd.) Reducing overhead Checkpointing in distributed systems system model consistant state, recovery line, domino effect, livelocks 4/10 Homework 5 due today
  • Lecture set 9 (set 3) in .ppt
  • Lecture set 9 (set 3) in pdf (six slides per page) Will cover only Coordinated checkpointing Message Loggging optimistic and pessimistic logging Checkpointing in real-time systems Other uses of checkpointing
  • Lecture set 9 (set 4) in ppt
  • Lecture set 9 (set 4) in pdf (six slides per page) Will cover Communication induced and log based recovery Forward error recovery
  • Homework set 5 Solution in PDF
    4/15 EXAM Tuesday Evening 7:15 PM Room 1153 Mechanical Engineering (Lecture room) Syllabus for the Exam (PDF file) Old exam - I am placing last offering (Spring 2010-11) exam and its solution here. Please keep in mind that some of the material covered this year is different from the previous offering of the course.
  • Exam from Spring 2010-2011 in PDF
  • Exam Solution from Spring 2010-2011 in PDF 4/17
  • Lecture set 10 in ppt
  • Lecture set 10 in pdf (six slides per page) Software fault-tolerance Causes of Errors, Techniques to reduce errors, Acceptance Tests Single Version Fault Tolerance Wrapper Rejuvenation Data Diversity SIHFT RESO N-version Fault Tolerance Consistent comparison problem confidence signals Independent v/s correlated failurs achieving version independence Recovery Block appraoch
  • Exam in PDF
  • Exam Solution in PDF Exam discussion N-version Fault Tolerance (contd.) Consistent comparison problem confidence signals Independent v/s correlated failurs achieving version independence Recovery Block appraoch Algorithm Based Fault-Tolerance (ABFT)
  • Lecture set 11 in .ppt
  • Lecture set 11 in pdf (six slides per page) Reconfiguration/Network fault tolerence Introduction and models n-cube architecture 4/24 Project Discussion Reconfiguration/Network fault tolerence (contd.) n-cube architecture de Bruijn networks
  • Reliabilty of n-cube in .ppt (Koren and Krishna)
  • Reliabilty of n-cube (text by Koren and Krishna) in PDF
  • Byzantine agreement problem - a .ppt file
  • Byzantine agreement problem - a pdf file Byzantine agreement problem - Sensor network context
    4/28 (Monday) Project report due today April 28 Monday 4/29 Project Presentations Project # 9 (40 Minutes) Yaman Sangar, Amitesh Narayan, and Snehal Mhatre Project: Probabilistic modeling of performance parameters of Carbon Nanotube Transistors and a comparison with their CMOS counterparts Summary: CMOS technology has come to a standstill and not much scope remains of further advancements. With a motivation to find alternatives, we explore the space of CNTFET. We start with a brief explanation of the structure and working of a CNTFET and compare the delay and power consumption of basic gates in both technologies. We find that CNTFETs have better delay and power characteristics which provides us a strong inspiration to look out further. We then look at some of the issues with CNTFETs and based on probabilistic modelling of these faults, we have tried to infer how the these effects can be minimized Project # 2 (30 Minutes) Addison Floyd and Dan Fisher Project: Survey of Detection, Diagnosis and Fault Tolerance Methods in FPGA's Summary: FPGAs are important in today's computing community. They are highly reconfigurable, and thus highly complex. FPGAs must work reliably, and so methods of detection, diagnosis, and fault tolerance have been found to make sure that FPGAs work when they are needed, regardless of present circumstances. 5/1 Project presentations Project # 7 (40 Minutes) Rakesh Roshan Amalraj, Rasika Joshi, and Gayatri Vishwanathan Project: A comparison of Deterministic and Stochastic Logic for Image Processing Applications Summary: Stochastic computing has gained lot of attention mainly because of its fault tolerance capability which is a key requirement for the deep sub-micron technology. It processes information using digitized probabilities for exploiting randomized algorithms. We have made a comparison of deterministic and stochastic level implementation for a commonly used image processing application such as gamma correction in terms of area, delay, power and fault tolerance. Performance verification has been done using 4bit and 8 bit gray scale images using PSNR (Peak Signal and noise ratio) and SSIM (Structural Similarity Index) as image output metrics. Study of the fault injected outputs for the gamma correction application shows that Stochastic computing shows a high fault tolerance capability than deterministic computing. Project # 10 (30 Minutes) Sriharsha Yerramalla and Karan Maini Project: Tool for customizing Fault Tolerance in a System Summary: There are numerous real time & operation critical systems in which the failure of the system is unacceptable at any stage of processing. Fault detection and even correction of internal faults during normal operation is of prime concern in such applications. Fault Tolerance (F-T) has been taken into account for many years during design process of these applications, but it has not leveraged any of recent advances in CAD tools that automate the design process. Therefore, inserting fault tolerant structures into a circuit has been considered as a challenge. In our work, we propose a new tool for the automatic insertion of fault-tolerant structures in an HDL synthesizable description of the design. With this tool, Fault Tolerance could be included into any instance of a module with little overhead in area and extra cost. An automatic fault tolerant design is produced based on user specifications, provided on the GUI, and this design in Verilog HDL is simulated and synthesized using commercial tools like ModelSim, Quartus and Design Vision. The GUI has been implemented using Java Swing technology and interacts frequently with perl script, running as a separate process. Numerous implementation techniques have been considered to demonstrate the capabilities of this tool. A comparative analysis of these techniques has been carried out in terms of area and timing delay.
    5/6 Project presentations Project # 5 (40 Minutes) Sumanth Suraneni, Sharath Prasad, and Harsha Sutanto Project: Find an efficient method for check-pointing on CPU-GPUs Summary: Our project is on implementing an efficient method of check-pointing on CPU - GPU. This project deals with checkpointing the kernel state while executing a workload on the AMD Southern Islands (SI) GPU. When a fault occurs, we need to roll-back to the nearest stored checkpoint and restart the program execution from that location. We have checkpointed randomly during the execution flow and restarted from that location. In order to verify our design, we have run the AMDAPP benchmarks. We also reduced the memory overhead that has to be stored during a checkpoint. Project # 3 (30 Minutes) Afrin Shafiuddin and Swetha Srinivasan Project: Fault tolerance in (real time systems) scheduling Summary: Fault tolerance in uniprocessor systems are usually handled by using time redundancy in the schedule so that any task instance can be re-executed in presence of faults during the execution. In this project, we make a comparison between the Fault tolerant Earliest Deadline First (EDF) scheduling policy and a proposed algorithm based on Round Robin & Shortest Remaining time First execution for periodic real-time tasks. These schemes are used to tolerate transient faults during the execution of tasks. 5/8 Project Presentations Project # 1 (40 Minutes) Sakshi Gupta, Chandru Loganathan, and Vignesh Chandrasekaran Project: Hardware implementation of Self-checking circuits Summary: Project aims at exploring and implementing different self-checking circuits on FPGA with the intention of reconfiguring the hardware to prolong system halt. In a system with hardware redundant function blocks, the concept of reconfiguration logic tries to detect the faulty block, isolate the faulty block and ensure that the correct output is still produced with the non-faulty function blocks. Implemented self-checking circuits include Residue checker, Less than or equal to checker (LTOETC) which was extended to implement the Range checker. Based on the synthesis report, the hardware utilization was analyzed for all the circuits and for the residue checker, its hardware complexity was computed as function of modulo.
  • Lecture set 12 in .ppt
  • Lecture set 12 in pdf (six slides per page) Course conclusion - Topics we covered - Topics we did not cover and why - comments on future directions Remaining presentations on May 14 in Room 4610 Engineerin Hall, 5:05 PM 5/9 (Friday) Review report due today May 9 Friday by midnight (11:59 PM)
    5/14 Wednesday: Final Exam, 5:05 PM ( no exam, instead there may be project presentations Location: Room 4610 Engineering Hall Project # 8 (30 Minutes) Younggyun Cho and Hrushikesh Chavan Project: Structural Fault Tolerance for SOC Circuits Summary: We implemented a fully functional 5 stage pipeline. We then replaced the two critical stages of the processor i.e. ID/EX and EX/MEM with BISER and Razor flip flops. BISER FF is able to detect faults in sequential circuits however to detect faults arising in combinational circuits we need to provide additional hardware redundancy. This is where a Razor FF comes into picture. We analyze the area and power ratings of both the designs. Also, on the occurrence of an error if the pipeline is stalled then we calculate the additional CPI arising because of this. Project # 4 (40 Minutes) Jing Tu, Tongxin Zheng, and Yuewen Lei Project: Implement various FT techniques on a simplescalar processor in Verilog and compare their performance Summary: In this project, we implemented three redundancy methods on a five-stage pipelined CPU in verilog to realize fault tolerance, which are information redundancy, hardware redundancy and time redundancy. We ran branch intensive, arithmetic operation intensive and load store intensive benchmarks on each redundancy implementation respectively, to obtain their failure rates under different fault occurrence circumstances. We also synthesized our redundancy design to find the area overhead and performance degradation. By comparing failure rate, area and performance overhead, we concluded the advantages and disadvantages of each redundancy method when it comes to their practical use. Project # 6 (30 Minutes) Han Lin and Jiun-Yi Lin Project: On-Chip Reliability Monitor for Measuring Frequency Degradation Summary: Until now, we have finished the design of majority voting circuit, beat frequency detector and the counter module. We have built the ring oscillator, phase comparator. We are using 3 versions of technology file, ranging from 130nm, 90nm to 45nm, in order to observe the NBTI effect to the circuit with different technology files, and we are using different technology files to validate the ability of silicon odometer to detect the NBTI effect. By this way, we want to make sure the silicon odometer can be sensitive to NBTI. Project # 11 (30 Minutes) Nekhil Baid and Tejashree Meka Project: Fault Tolerant techniques for Yield Enhancement of Memories Summary: Embedded memories occupy a large portion of the chip area and yield enhancement of memories has become the growing concern of chip manufacturers resulting in need for fault tolerant memories. A survey on the various fault tolerant techniques which have been implemented so far has been performed . The algorithms are compared based on their repair rate and hardware overhead. Also a simulator has been implemented which evaluates the repair rate for a relatively new address scrambling technique for a specific memory size, number of defects, number of spare rows and spare columns.