10 on Tech Episode 003 – Eric Wright (@discoposse) on Root Cause Analysis

Share with your friends


In this show, Eric Wright is with us to talk about root cause analysis. We talked about:

Eric’s initial blog post

Edward’s response/thoughts

Eric’s follow-up post

Show Transcript
James Green: Hello everybody and welcome to 10 On Tech. I’m James Green with ActualTech Media. If you’re not familiar with ActualTech Media, we are a tech marketing company. You can go over to www.actualtechmedia.com to learn more about us. I’m honored to be joined today by Eric Wright who is from VMTurbo. He’s also very widely known in the community as a whole. We love Eric; I’ve known Eric for a while now and really enjoy when we get to meet up and chat. I have recently been reading some stuff that Eric has been writing on and I thought it was really interesting so I invited him on so we can talk a little bit more about it. Welcome Eric, do you want to tell us a little bit more about yourself, where we can find you online?
Eric Wright: Yeah, thanks James. It’s always great to chat. You can find me anywhere, DiscoPosse is the easiest way. I’m @DiscoPosse on Twitter. Of course I’m the technology evangelist at VMTurbo. The articles that we talked about actually were on the About Virtualization article, you can just go to aboutvirtualization.com. You can tell I’m Canadian by my about.
James Green: About.
Eric Wright: Yeah. Definitely DiscoPosse is the place to find me. I have my personal at discoposse.com. Yeah, I try and touch on a lot of things. It was funny that this article seemed to hit home with a few folks. Actually I wrote a follow-on article as a result of it.
James Green: Yeah, I read them both and they were great. Let me set the stage a little bit. The topic of the article was more or less the way that architectures are changing changes the way that we’re going to do root cause analysis. You laid out the idea that as we move towards a more micro-services based service oriented application architecture, the added resiliency from doing that allows us to tolerate failures differently and changes the way that we approach root cause analysis. Do you want to expand on that a little bit?
Eric Wright: Sure. It’s funny because we’ve got this idea that we wanted to create real strength and resiliency at the lowest possible layers. We have multiple network cards, we have multiple plugs in the back of every server, multiple fans, so everything can just fall apart in rather large chunks of our infrastructure. That was where we do all of our resiliency typically. What’s happening is then we’ve got this real neat top-down approach that’s happening, and that’s the real application oriented folks that kind of grew up in the web, and this idea that, you and I we grew up listening to the first people, “You go to H-T-T-P colon forward-” That was how we heard about people learning about the web. These folks were raised and there was no non-internet world.
  They’ve got this idea that you just go to Amazon, you buy database as a service and it’s immediately resilient. It’s like, “Wow, okay.” All of the stuff that we had been creating as this resilient underlying infrastructure and we’d been learning about root cause analysis around it, becomes a little less of a focus because we’ve moved further up the stack in the way that we create environments. It’s not that it doesn’t happen that there is this stuff underneath, server-less is on servers. We’ve all gotten over that thing, can please stop saying that? Here we’ve got this idea that we’ve got resiliency at the application tear because in the Amazon methodology you design specifically for failure.
  What happens as a result of that is you widely accept that failure will occur, and then you focus more on the application than you do on all of the underpinnings that are keeping your application alive. Then it was funny that I got this response right away. A good friend, Edward Haletky, he’s @Texiwill on Twitter. Edward, he’s the CEO at the Virtualization Practice. He’s an analyst and long time virtualization expert, in specifics around security and root cause analysis. He does some really cool stuff. At any rate, Edward says, “Hey, I pretty heavily disagree with what you’re saying here, that there are tons of reasons why root cause analysis is of absolute importance.” All of a sudden I was trapped, I’m like, “Oh no, he’s right, and I’m right. Dang it.” Got to go back to the well.
James Green: That was exactly my feeling, I read your first article and then I read his article and I was like, “Yeah, they’re both right.”
Eric Wright: Now I thought, “Okay, I maybe need to frame it a bit better.” We had a great, literally we spent an hour just chatting. You get that fun of sharing old stories, like, “I remember this time where there was this weird memory issue that was hot like … ” It’s just shielded memory, literally stuff that you don’t even have to think about a lot of times but coming out of infrastructure we still have to think about that. I didn’t want it to detract from what root cause analysis is in its foundation for infrastructure.
  But what we’re finding is that a lot of technologists are buying into infrastructure as a service, which in and of itself has resiliency built underneath, usually. Cloud services obviously you’re not even allowed to know what it’s running on because it doesn’t matter, you simply access an API or a web URL, it’s a service that you’re purchasing. That’s this dichotomy we had, that I’m like, “Yes, you do need to know root cause analysis because most of virtualization today is not cloud based. It is not super resilient.”
  We still need to know, did the server fail? Did someone unplug it? Did something weird happen when lost access to the network? Water coming from the air conditioner poured onto a server rack, that stuff happens, and being able to understand how to troubleshot and do root cause analysis is still critically important. But because of the way that we’re designing applications going forward, we need to understand how it consumes those services and remove the resiliency into stuff where it’s a little bit more grey and we don’t need to necessarily know or be concerned about it.
  It’s not that we don’t care, but we don’t have to be as concerned. We do care if it goes up or down, because AWS goes down, large hunks of it go down on occasion, so we’ve still got to be ready. Again to Edward’s point, the truth is we still need to know because some performance issue happens you need to be able to track it back to something. But at VMTurbo this is one of the things we always bump into, is that our founders, the previous company they did was all wrapped around doing root cause analysis in software, and it couldn’t really be done because there’s a point where you need humans to deal with it.
  The reaction to it is one of 2 ways, you either get really good at becoming the human that deals with it, or you get really good at building infrastructure that doesn’t have to fall down as often. Then you can focus on what does the application do? Is it answering business requirements? It’s less knob turning and stuff like that.
James Green: Yeah, it’s obviously in an ideal world the situation you would get to, is where you have built some sort of software tool so that you don’t have to be the human that’s really good at doing that, but there’s an infinite number of variables to the degree that we’re never going to be able to build that tool. In Edward’s response article he gave a bunch of examples of things that they still needed to use root cause analysis to figure out. One of the ones that he mentioned that I liked was the example of encryption.
  There was a machine that was being tested with encryption locally and it was I think he said 2% CPU overhead to do this encryption. When they moved it to the cloud suddenly it’s 20% CPU overhead to do this encryption, and they’re trying to figure out why is this. In the end it turned out that it was a different available instruction set on the CPU locally, as opposed to where it was running in the cloud. Is that something that a software tool could be programmed to figure out? Maybe, but there’s literally almost an infinite number of possibilities of things like that, and at some point like you’re saying, a human is really the only way we can get to the bottom of it.
Eric Wright: I think the careful thing that I should have framed it around was that root cause analysis as a troubleshooting methodology for anything is super important. It’s absolutely that every technologist, every person should be able to do root cause analysis, however root cause analysis in the context of just failure of services is were we’re seeing more and more focus on service delivery versus server delivery. It’s neat to see that evolution and we’ve got a long way to go. Oh boy do we. But as we go don’t put away don’t skills yet, troubleshooting is still of absolute importance. However make sure you look further up the stack because that’s where things are heading.
James Green: I think you titled the article, micro-services something something something and the shift away from root cause analysis. I was wondering after this whole exchange, maybe what you meant is the shift away from root cause analysis as we know it now. Is that a more reasonable way to explain it?
Eric Wright: Definitely.
James Green: Because it’s totally still relevant, Edward’s point was totally valid, it’s just that in your point, which is also totally valid, it’s not going to be like it has been and we’re going to need to do root cause analysis in a different area of the infrastructure, in a little bit different way.
Eric Wright: Yeah, if you spent 20% of your time doing troubleshooting on physical hardware in the past, you’re going to find about 15% freed up hopefully in the future because you’re [crosstalk 00:10:36]
James Green: Right, that would be the idea.
Eric Wright: Don’t spend the time figuring what’s going to backfill that 15%, or whatever it is. But yeah, it’s the shift more than the cease of that activity.
James Green: Right, awesome. Thank you for coming on Eric, appreciate the chat, the clarification. It’s always insightful and great to talk with you. Appreciate it.
Eric Wright: Thank you very much.
James Green

James is a Partner at ActualTech Media and writes, speaks, and consults on Enterprise IT. He has worked in the IT industry as an administrator, architect, and consultant, and has also published numerous articles, whitepapers, and books. James is a 2014 - 2016 vExpert and VCAP-DCD/DCA. Follow James on Twitter

No Comments

Post A Comment

Web Analytics