ECE 753: Fault-Tolerant Computing Spring 2010-2011 January - May, 2011 Partial List of projects and few key references Some of the more recent resources and topics that may help you decide your project: 1. Consider looking at the following: Nov/Dec issue of IEEE design and test dealing with reliability IEEE Transactions on Computers - Jan 2011 issue: deals with dependable computing architecture IEEE Computer - February 2011 issue: deals with nanoscale architectures IEEE Transactions on Computers - Jan 2009 issue: deals with nanowire decoders IEEE Transactions on Computers - Dec 2008 issue: deals with parallel applications Architecture level fault-tolerance - see papers in IEEE Micro, ASPLOS and SELSE (SELSE paper are not published but can potentially be obtained by contacting the authors) 2. Reducing storage burden via data deduplication (IEEE Computer, Dec 2008, pp 15-17): its impact on fault-tolerance and compare with other methods such as data compression or mixing of data-deduplication and file compression. 3. Output commit level fault-tolerance using Condor in combination with forward recovery (different from forward recovery through checkpointing) 4. Fault tolerance in wired and wireless systems - for example use of network coding 5. Evolution of homogeneous systems into heterogeneous systems in the presence of faults and reconfiguration capability. 6. nano tubes 7. On-chip reliability monitors (such as aging sensors) - Microelectronics reliabilty (journal) and conferences like ICCD Some of the topics that I have noticed in the past 5 to 6 years even though some of these are very old topics: 1. RAID Levels, Architectures and Relative Performance Numerous papers including a recent paper (see ref below) dealing with separable codes (Feng, Deng, Bao, and Shen, "New and efficient MDS array codes for RAID Part I: Reed-Solomon-Like Codes for Tolerating Three Disc Failures", IEEE Transactions on Computers, Sept 2005.) 2. Self Checking "state" checker based method Refs: ITC paper by Mitra in the conference Proc of 2000 and the paper included in the reading material Special Issue of JETTA - August 2005 has two papers on this topic An annual conference "IEEE Online test conf" can be rich source of papers in this area. Book by Lala (see ref list) has many general references 3. Check pointing, Rollback, Roll-forward - see text 1 and many recent conferences. A some what recent ref is Ssu, Fuchs, and Jiau, IEEE TC Feb 2003 4. Routing and reconfiguration in systems with faulty nodes/links Aversky and Natchev, "Dynamic reconfiguration in computer clusters with irregular topologies in the ...", IEEE TC, May 2005 5. Fault tolerance in cellular networks Yang et. al "A fault-tolerant distributed channel allocation scheme for cellular networks", IEEE TC, May 2005 6. Crosstalk tolerant bus encoding schemes 7. Life span computation using multiple voltage and multiple frequency controls Weglarz, Saluja and Mak, "Testing of hard faults in simultaneous multithreaded processors," International On-Line Test Symposium, June 2004. This is a research project you can see Eric Weglarz and me about this 8. Use of recursive redundancy to improve reliability IEEE Design and Test, Aug/Sep 2005 9. Fault tolerant methods in modern speculative processors 10. Comparative study of reliability and performance evaluation tools Additional topics Case Studies: 1. Tandam/Compaq: Prepare a survey of the fault tolerance techniques of the Compaq NonStop Himalaya Servers. (expand it to widen the scope and include more recent products of various manufacturers of ICs and systems) http://himalaya.compaq.com/view.asp?IOID=565#2 2. Fault tolerant in Automotive systems: Prepare a survey of fault tolerance techniques that are used in automobiles. Include systems like engine management, drive by wire and steer by wire. (A System-Safety Process for By-Wire Systems Delphi Secured Microcontroller Architecture Motronic Engine Management. http://www.delphi.com/ http://www.delphiautomotive.com/news/ contains all tech publications) 3. Fault-tolerant features of modern processors - compare and contrast you may look at the websites of Intel, IBM, HP, Sun, etc. Software Testing 1. Software Testing There is a need for a good survey article in the area. 2. Tools and Demos of software testing - The book by Lyu, "software reliability engineering" and many conferences and web should provide a rich resource Hardware Test (fault tolerance Orientation): 1. Determining diagnostic resolution of a test sets requires developing fault simulator background needed - data structures testing (553) excellent programming skills 2. Use of on-line testing methods in fault-tolerance There is an annual workshop that deals with this issues "IEEE Online test conf" 3. Defect tolerance See many papers by Koren Broader categories: 1. Arriving agreement in interconnected systems - algorithm implementations and relative performances 2. bio-computing, alternative technologies (such as high risk technologies) 3. Quantum Computing 4. Provide an alternative classification of software fault-tolerant techniques. Includes a survey of all methods such a classical methods (N version programming, recovery block) and methods more often used in practice such as checkpointing, shadowing, etc. Survey paper on recent methods are available and classical methods can be found in texts. 5. Clock synchronization 6. Atomic and reliable broadcast 7. Algorithmic based fault-tolerance 8. System level diagnosis - distributed algorithms 9. Fault-tolerant transaction processing systems 10. Measures of software reliability 11. Validation and verification techniques 12. Modeling and evaluation tools 13. Fault injection methods 14. Fault tolerance in wireless systems 15. Fault tolerance and reconfigurable memory systems 16. MEM based systems and fault-tolerance requirements 17. Reconfiguration for fault-tolerance (use of FPGAs) 18. Evaluation tools such as SHARP and USAN - compare and contrast 19. Evaluation tools such as PARMA List of projects completed in the past four offerings of this course 1. Survey of rollback-recovery techniques in wired and wireless networks - by Emily Yuet Ching and Veradej Phipatanasuphorn 2. Fault tolerance in wireless systems - by Ying Rao and Chao Wang 3. RAID - by Jeng-Liang Tsai and Yu-Chi Lai 4. A fault model for SETI-style distributed computing - by Matt King and Joe Riley 5. Performance assessment of complex systems by SAN - by Karan Mehra and Parikshit Thakur 6. Error detection methods in arithmetic operations - by Lin Tsung-Chi and Chen Yi-Ting 7. Reducing Cross-coupling effects using bit ordering - by Bradford Beckmann and Anitha Mani 8. Crosstalk aware fault-tolerant techniques - by Hsinwei (Frank) Chou and Shengkai Chiu 9. Fault tolerance in modern operating systems - by Govindraj Chandrasekar and Shyam Sundararaman 10. Characterizing non-determinism in cores of future processors - by Saisanthosh Balakrishnan and Shiliang Hu 11. Fault tolerant techniques for on chip cache memory - by DongHyun Baik and SangHoon Kim 12. Routing in systems with faulty nodes/links - by Wang Wong and Kai Ting 13. Bio-inspired fault tolerant for cellular arrays - by Jie Zhang and Dabin Wu 14. Bit-sliced architecture for fault tolerance - by Erika Gunadi and Arisandi Widjaja 15. Software testing and verifiable system design - by Diwakar Iyer and Arun Ram Kumaran 16. Fault tolerant sensor network algorithms and techniques - by Ranjit John Mathai and Cheryan Jacob 17. Fault Tolerant real-time systems - by Venkatesh Janakiraman and Shyamsundar Sekhar 18. Case Study: IBM S390 system - fault tolerant and availability - by Zhiyu Liu and Xun Zhang 19. On-line testing for fault tolerance - by Ami Mehta and Parikshit Narkhede 20. Evaluating fault tolerant techniques for superscalar processors - by Devang Sachdev and Shamik Valia 21. Fault-Tolerance in E-Commerce Web Servers - by Vishnu Vijayaraghavan and Subramanian Rama 22. Incorporating fault tolerance in reconfigurable architectures - by Karthik Kalyanam and Kaushik Raghunath 23. The fault-tolerant FFT butterfly network - by Jiong Fan and Cheng-Pei Liu 24. Extended life span testing - by Xiangning Yang 25. Linux application fault tolerance - by Scott Finley 26. Encoding for crosstalk tolerance busses - by Neil Hockert 27. Fault Tolerance in Automotive X-by-wire - by Diana Palsetia and Sean Pieper 28. Survey of fault-tolerant techniques in modern micro-processors - by Sukanya Thiagrajan 29. Fault-tolerance in Quantum Computing - by Terry Yao-Lin Chen and Chien-Liang Chen 30. Fault tolerant in Automotive systems case study - by Divya Jhalani and Shikha Dhir 31. Case Study: Non Stop Servers - by Qing Lu and Chunhura Yao 32. An analysis and extension of SMT redundany for heavily-threaded workloads - by Mitchell Hayenga 33. Performance and reliability analysis of RAID-based memories - by Vishal Mehta and Ashwin Alapati 34. Process redundancy for future-generation CMP fault-tolerance - by Dan Gibson 35. NBTI degradation of digital Circuits - by Shriram Vijayakumar and Warin Sootkaneung 36. A survey of aging issues in digital circuits - by Meng-She and Cheng-Han Sung 37. Roll-forward techniques for fault detection and correction in Condor environment - by Vaishali Karanth and Janaki Jillella 38. A survey of software testing methods - by Tao Feng and Kasturi Bidarkar 39. Performance cost analysis of software-implemented hardware fault tolerance methods in general-purpose GPU computing - by Tony Gregerson and Ameya Abhyankar 40. Can network coding help improve the reliability of TCP? - by Chin-Ya Huang and Aishwarya Nagarajan ) 41. Evaluation of fault-tolerance in optical multistage interconnection networks - by Hytham Alihassan and Srinivas Kattamuri