During this talk, Gremlin's Jacob Plicque (Chaos Engineer & Resilience Coach, former Senior SRE @ Fanatics) will answer the 5 W's (Who, What, When, Where & Why) of Chaos Engineering.
* The systematic way to begin Chaos Engineering
* The value of running chaos experiments to build more reliable systems and confidence in your remediation processes.
* How other companies are using Chaos Engineering—and the positive results they’ve seen creating reliable distributed systems with CE
Chaos Engineering is NOT:
* Applying failure modes randomly
* Applying failures to your entire infrastructure straight away
* Applying failure on systems without communication
* Creating a one-off fix to be run once and then abandoned
Chaos Engineering IS:
* Applying failures carefully, and with an explicit hypothesis
* Starting small and growing the blast radius
* Communicating plans clearly with all stakeholders
* Designing a well-defined practice that requires constant attention
How to improve the availability and reliability of Kubernetes clusters using the discipline of Chaos Engineering
How to use Chaos Engineering to safely inject failure into your applications and nodes in order to detect weaknesses.
Specific Chaos Experiments for you to run on Kubernetes to ensure you’ve designed a reliable system
Jacob is a Solutions Architect at Gremlin where he works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses and create more reliable applications. Jacob has worked on Chaos Engineering across a variety of verticals, including finance, e-commerce, airlines, retail, and insurance. Jacob is also the co-host of the Break Things on Purpose podcast, a series dedicated to sharing Chaos Engineering experiences with industry experts. Jacob previously worked at Fanatics as a Senior Site Reliability Engineer, where he was responsible for providing a reliable e-commerce experience to process over 1,100 orders a minute on peak days such as Cyber Monday and Black Friday.
Freewheel / Comcast