BUILDING RESILIENT DISTRIBUTED SYSTEMS: FAULT-TOLERANT DESIGN PATTERNS FOR STATEFUL WORKFLOWS

Preetham Vemasani; Suraj Modi

Authors

Preetham Vemasani Uber Technologies Inc, USA Author
Suraj Modi Uber Technologies Inc, USA Author

Keywords:

Resilient Design Patterns, Fault-Oblivious Systems, Distributed Computing, Stateful Workflows, Self-Healing Mechanisms

Abstract

Resilient design patterns play a crucial role in developing fault-oblivious stateful workflow systems in distributed computing. This article explores advanced techniques and strategies for building resilient distributed systems that can gracefully handle failures and maintain operational continuity. It delves into fault tolerance strategies, such as data redundancy, checkpointing, and transactional consistency, to ensure system reliability and data integrity. The article discusses the benefits of microservices architecture in achieving fault isolation and minimizing the impact of failures. It highlights the importance of self-healing mechanisms, including automated fault detection and correction, to ensure continuous operation. Scalability and load balancing techniques, such as dynamic resource adjustment and workload distribution, are explored to accommodate fluctuating demands and optimize system performance. The article also examines error handling and recovery mechanisms, including automated rollbacks and distributed consensus protocols, to maintain data consistency and coordinate recovery actions across nodes. Additionally, it emphasizes the significance of proactive system health monitoring and rapid fault identification and resolution in minimizing downtime and ensuring a smooth user experience. The article concludes by discussing emerging trends, open research problems, and providing recommendations for building resilient distributed systems that can withstand the challenges of modern computing environments.

References

M. Armbrust et al., "A view of cloud computing," Commun. ACM, vol. 53, no. 4, pp. 50-58, Apr. 2010, doi: 10.1145/1721654.1721672.

R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Gener. Comput. Syst., vol. 25, no. 6, pp. 599-616, Jun. 2009, doi: 10.1016/j.future.2008.12.001.

M. Zaharia et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, 2012, pp. 15-28.

S. Meesala, S. K. Chinthaparthi, and P. S. V. S. Sridhar, "A survey on fault tolerance in workflow management systems," in Proc. 2nd Int. Conf. Inventive Systems and Control (ICISC), 2018, pp. 52-57, doi: 10.1109/ICISC.2018.8399050.

E. Deelman et al., "Pegasus: A framework for mapping complex scientific workflows onto distributed systems," Sci. Program., vol. 13, no. 3, pp. 219-237, Jul. 2005, doi: 10.1155/2005/128026.

M. Bux and U. Leser, "Dynamiccloudsim: Simulating heterogeneity in computational clouds," Future Gener. Comput. Syst., vol. 46, pp. 85-99, May 2015, doi: 10.1016/j.future.2014.09.007.

G. Kandaswamy, A. Mandal, and D. A. Reed, "Fault tolerance and recovery of scientific workflows on computational grids," in Proc. 8th IEEE Int. Symp. Cluster Computing and the Grid (CCGRID '08), 2008, pp. 777-782, doi: 10.1109/CCGRID.2008.79.

[8] Y. Chen and A. L. N. Reddy, "Failure recovery in workflow management systems," in Proc. 13th IEEE Int. Symp. Pacific Rim Dependable Computing (PRDC 2007), 2007, pp. 150-157, doi: 10.1109/PRDC.2007.36.

S. Das, A. D. Kshemkalyani, and I. Gupta, "Transparent fault-oblivious computing," in Proc. 2013 43rd Annual IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN), 2013, pp. 1-12, doi: 10.1109/DSN.2013.6575332.

B. Randell, "System structure for software fault tolerance," IEEE Trans. Softw. Eng., vol. SE-1, no. 2, pp. 220-232, Jun. 1975, doi: 10.1109/TSE.1975.6312842.

D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in Proc. 1988 ACM SIGMOD Int. Conf. Management of Data, 1988, pp. 109-116, doi: 10.1145/50202.50214.

J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992.

S. Newman, Building Microservices: Designing Fine-Grained Systems. Sebastopol, CA, USA: O'Reilly Media, Inc., 2015.

J. O. Kephart and D. M. Chess, "The vision of autonomic computing," Computer, vol. 36, no. 1, pp. 41-50, Jan. 2003, doi: 10.1109/MC.2003.1160055.

D. Ongaro and J. Ousterhout, "In search of an understandable consensus algorithm," in Proc. 2014 USENIX Annual Technical Conf. (USENIX ATC 14), 2014, pp. 305-319.

L. A. Barroso and U. Hölzle, "The datacenter as a computer: An introduction to the design of warehouse-scale machines," Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1-108, 2009, doi: 10.2200/S00193ED1V01Y200905CAC006.

A. S. Tanenbaum and M. Van Steen, Distributed Systems: Principles and Paradigms, 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2006.

N. Neves and P. Veríssimo, "Software-implemented fault tolerance: A survey," in Proc. 2001 IEEE Int. Symp. Software Reliability Engineering, 2001, pp. 365-374, doi: 10.1109/ISSRE.2001.989495.

J. C. Laprie, "Dependable computing and fault-tolerance: Concepts and terminology," in Proc. 15th IEEE Int. Symp. Fault-Tolerant Computing (FTCS-15), 1985, pp. 2-11, doi: 10.1109/FTCSH.1985.6498838.

P. A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, 2nd ed. Vienna, Austria: Springer-Verlag, 1990.

W. W. Chu, C.-M. Sit, and K. S. Leung, "Task response time for real-time distributed systems with resource contentions," IEEE Trans. Softw. Eng., vol. 17, no. 10, pp. 1076-1092, Oct. 1991, doi: 10.1109/32.99189.

S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," in Proc. 19th ACM Symp. Operating Systems Principles (SOSP '03), 2003, pp. 29-43, doi: 10.1145/945445.945450.

E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, Sep. 2002, doi: 10.1145/568522.568525.

J. S. Plank, "An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance," University of Tennessee, Knoxville, TN, USA, Tech. Rep. UT-CS-97-372, 1997.

H. Weatherspoon and J. D. Kubiatowicz, "Erasure coding vs. replication: A quantitative comparison," in Peer-to-Peer Systems, P. Druschel, F. Kaashoek, and A. Rowstron, Eds. Berlin, Heidelberg: Springer, 2002, pp. 328-337.

K. M. Greenan, E. L. Miller, and J. J. Wylie, "Reliability of flat XOR-based erasure codes on heterogeneous devices," in Proc. 2008 IEEE Int. Conf. Dependable Systems and Networks With FTCS and DCC (DSN), 2008, pp. 147-156, doi: 10.1109/DSN.2008.4630084.

S. Muralidhar et al., "f4: Facebook's warm BLOB storage system," in Proc. 11th USENIX Conf. Operating Systems Design and Implementation (OSDI'14), 2014, pp. 383-398.

G. Weikum and G. Vossen, Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001.

T. Haerder and A. Reuter, "Principles of transaction-oriented database recovery," ACM Comput. Surv., vol. 15, no. 4, pp. 287-317, Dec. 1983, doi: 10.1145/289.291.

M. K. Aguilera and S. Toueg, "A simple bivalency proof that t-resilient consensus requires t+1 rounds," Inf. Process. Lett., vol. 71, no. 3-4, pp. 155-158, Aug. 1999, doi: 10.1016/S0020-0190(99)00100-3.

D. Skeen and M. Stonebraker, "A formal model of crash recovery in a distributed system," IEEE Trans. Softw. Eng., vol. SE-9, no. 3, pp. 219-228, May 1983, doi: 10.1109/TSE.1983.236608.

M. Shanker, M. Misra, and A. K. Sarje, "Distributed real time database systems: Background and literature review," Distrib. Parallel Databases, vol. 23, no. 2, pp. 127-149, Apr. 2008, doi: 10.1007/s10619-007-7022-z.

R. Sumathi and S. Esakkirajan, Fundamentals of Relational Database Management Systems. Berlin, Heidelberg: Springer-Verlag, 2007.

J. N. Gray, "Notes on data base operating systems," in Operating Systems: An Advanced Course, R. Bayer, R. M. Graham, and G. Seegmüller, Eds. Berlin, Heidelberg: Springer, 1978, pp. 393-481.

P. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems. Reading, MA, USA: Addison-Wesley, 1987.

J. Lewis and M. Fowler, "Microservices," martinfowler.com, 25 Mar. 2014. [Online]. Available: https://martinfowler.com/articles/microservices.html. [Accessed: 15 May 2023].

S. Newman, Building Microservices: Designing Fine-Grained Systems. Sebastopol, CA, USA: O'Reilly Media, Inc., 2015.

C. Richardson, Microservices Patterns: With Examples in Java. Shelter Island, NY, USA: Manning Publications, 2018.

N. Dragoni et al., "Microservices: Yesterday, today, and tomorrow," in Present and Ulterior Software Engineering, M. Mazzara and B. Meyer, Eds. Cham: Springer International Publishing, 2017, pp. 195–216.

A. Balalaie, A. Heydarnoori, and P. Jamshidi, "Microservices architecture enables DevOps: Migration to a cloud-native architecture," IEEE Softw., vol. 33, no. 3, pp. 42–52, May 2016, doi: 10.1109/MS.2016.64.

O. Babaoglu et al., "Self-star properties in complex information systems," in Conceptual Structures, Reasoning and Learning in a Networked World, S. Akoka et al., Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 50–74.

P. Mayer et al., "The autonomic cloud: A vision of voluntary, peer-2-peer cloud computing," in Proc. 2013 IEEE 7th Int. Conf. Self-Adaptation and Self-Organizing Systems Workshops, 2013, pp. 89–94, doi: 10.1109/SASOW.2013.16.

M. Salehie and L. Tahvildari, "Self-adaptive software: Landscape and research challenges," ACM Trans. Auton. Adapt. Syst., vol. 4, no. 2, pp. 14:1–14:42, May 2009, doi: 10.1145/1516533.1516538.

J. O. Kephart and D. M. Chess, "The vision of autonomic computing," Computer, vol. 36, no. 1, pp. 41–50, Jan. 2003, doi: 10.1109/MC.2003.1160055.

B. Cheng et al., "Self-adaptive systems: A research roadmap," in Software Engineering for Self-Adaptive Systems II, R. de Lemos et al., Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 1–32.

M. Armbrust et al., "A view of cloud computing," Commun. ACM, vol. 53, no. 4, pp. 50–58, Apr. 2010, doi: 10.1145/1721654.1721672.

X. Pu et al., "Understanding performance interference of I/O workload in virtualized cloud environments," in Proc. 2010 IEEE 3rd Int. Conf. Cloud Computing, 2010, pp. 51–58, doi: 10.1109/CLOUD.2010.65.

N. R. Herbst, S. Kounev, and R. Reussner, "Elasticity in cloud computing: What it is, and what it is not," in Proc. 10th Int. Conf. Autonomic Computing (ICAC 13), 2013, pp. 23–27.

V. Cardellini, M. Colajanni, and P. S. Yu, "Dynamic load balancing on web-server systems," IEEE Internet Comput., vol. 3, no. 3, pp. 28–39, May 1999, doi: 10.1109/4236.769420.

A. M. Alakeel, "A guide to dynamic load balancing in distributed computer systems," Int. J. Comput. Sci. Inf. Secur., vol. 10, no. 6, pp. 153–160, 2010.

M. Randles, D. Lamb, and A. Taleb-Bendiab, "A comparative study into distributed load balancing algorithms for cloud computing," in Proc. 2010 IEEE 24th Int. Conf. Advanced Information Networking and Applications Workshops, 2010, pp. 551–556, doi: 10.1109/WAINA.2010.85.

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, Sep. 2002, doi: 10.1145/568522.568525.

B. Randell, "System structure for software fault tolerance," IEEE Trans. Softw. Eng., vol. SE-1, no. 2, pp. 220–232, Jun. 1975, doi: 10.1109/TSE.1975.6312842.

J. Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, R. J. Stroud, and Z. Wu, "Fault tolerance in concurrent object-oriented software through coordinated error recovery," in Proc. 25th Int. Symp. Fault-Tolerant Computing, 1995, pp. 499–508, doi: 10.1109/FTCS.1995.466978.

A. F. Garcia, C. M. F. Rubira, A. Romanovsky, and J. Xu, "A comparative study of exception handling mechanisms for building dependable object-oriented software," J. Syst. Softw., vol. 59, no. 2, pp. 197–222, Nov. 2001, doi: 10.1016/S0164-1212(01)00073-4.

P. A. Buhr and W. Y. R. Mok, "Advanced exception handling mechanisms," IEEE Trans. Softw. Eng., vol. 26, no. 9, pp. 820–836, Sep. 2000, doi: 10.1109/32.877844.

R. Miller and A. Tripathi, "Issues with exception handling in object-oriented systems," in ECOOP'97 — Object-Oriented Programming, M. Aksit and S. Matsuoka, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 85–103.

F. B. Schneider, "Implementing fault-tolerant services using the state machine approach: A tutorial," ACM Comput. Surv., vol. 22, no. 4, pp. 299–319, Dec. 1990, doi: 10.1145/98163.98167.

D. Ongaro and J. Ousterhout, "In search of an understandable consensus algorithm," in Proc. 2014 USENIX Annu. Tech. Conf. (USENIX ATC 14), 2014, pp. 305–319.

H. Howard, M. Schwarzkopf, A. Madhavapeddy, and J. Crowcroft, "Raft refloated: Do we have consensus?," ACM SIGOPS Oper. Syst. Rev., vol. 49, no. 1, pp. 12

L. Lamport, "The part-time parliament," ACM Trans. Comput. Syst., vol. 16, no. 2, pp. 133–169, May 1998, doi: 10.1145/279227.279229.

M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process," J. ACM, vol. 32, no. 2, pp. 374–382, Apr. 1985, doi: 10.1145/3149.214121.

J. C. Corbett et al., "Spanner: Google's globally-distributed database," in Proc. 10th USENIX Conf. Operating Systems Design and Implementation (OSDI'12), 2012, pp. 251–264.

A. Lakshman and P. Malik, "Cassandra: A decentralized structured storage system," ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922.

D. Haryadi, R. Suchithra, and M. H. M. Noor, "Advanced monitoring systems for large-scale cloud computing: A survey," in Proc. 2020 6th Int. Conf. Advanced Computing and Communication Systems (ICACCS), 2020, pp. 1170–1175, doi: 10.1109/ICACCS48705.2020.9074309.

A. Brito, S. Fetzer, H. Sturzrehm, and P. Felber, "Multithreading-enabled active replication for event stream processing operators," in Proc. 28th Int. Symp. Software Reliability Engineering (ISSRE), 2017, pp. 168–178, doi: 10.1109/ISSRE.2017.17.

M. Cinque, R. Della Corte, and A. Pecchia, "Microservices monitoring with event logs and black box execution tracing," IEEE Trans. Serv. Comput., vol. 14, no. 4, pp. 1097–1111, Aug. 2021, doi: 10.1109/TSC.2019.2940009.

D. Yuan et al., "Be conservative: Enhancing failure diagnosis with proactive logging," in Proc. 10th USENIX Conf. Operating Systems Design and Implementation (OSDI'12), 2012, pp. 293–306.

A. Gehani and D. Tariq, "SPADE: Support for provenance auditing in distributed environments," in Middleware 2012, P. Narasimhan and P. Triantafillou, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 101–120.

J. Wettinger, U. Breitenbücher, and F. Leymann, "Standards-based DevOps automation and integration using TOSCA," in Proc. 2014 IEEE/ACM 7th Int. Conf. Utility and Cloud Computing, 2014, pp. 59–68, doi: 10.1109/UCC.2014.14.

N. Dragoni et al., "Microservices: How to make your application scale," in Proc. 11th Int. Symp. Advances in Artificial Intelligence and Applications (AAIA'16), 2016, pp. 95–104.

I. Baldini et al., "Serverless computing: Current trends and open problems," in Research Advances in Cloud Computing, S. Chaudhary, G. Somani, and R. Buyya, Eds. Singapore: Springer, 2017, pp. 1–20.

A. Basiri et al., "Chaos engineering," IEEE Softw., vol. 33, no. 3, pp. 35–41, May 2016, doi: 10.1109/MS.2016.60.

Z. Wen, R. Yang, P. Garraghan, T. Lin, J. Xu, and M. Rovatsos, "Fog orchestration for internet of things services," IEEE Internet Comput., vol. 21, no. 2, pp. 16–24, Mar. 2017, doi: 10.1109/MIC.2017.36.

P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica, "Highly available transactions: Virtues and limitations," Proc. VLDB Endow., vol. 7, no. 3, pp. 181–192, Nov. 2013, doi: 10.14778/2732232.2732237.

D. B. Terry, V. Prabhakaran, R. Kotla, M. Balakrishnan, M. K. Aguilera, and H. Abu-Libdeh, "Consistency-based service level agreements for cloud storage," in Proc. 24th ACM Symp. Operating Systems Principles (SOSP '13), 2013, pp. 309–324, doi: 10.1145/2517349.2522731.

J. C. Corbett et al., "Spanner: Google's globally-distributed database," in Proc. 10th USENIX Conf. Operating Systems Design and Implementation (OSDI'12), 2012, pp. 251–264.

BUILDING RESILIENT DISTRIBUTED SYSTEMS: FAULT-TOLERANT DESIGN PATTERNS FOR STATEFUL WORKFLOWS

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

cover