10 on Tech Episode 002 – DRS Doctor with Brian Graf

Share with your friends


Joining us for the second episode of 10 on Tech is Brian Graf (@vBrianGraf) to discuss the new VMware Fling called DRS Doctor. You can see an example of the tool and how to use it on the official VMware blog about the release of the Fling. Don’t forget that the full show transcript is available below the player if you’d rather read than listen!

Show Transcript
James: Hello everybody, it’s James Green with ActualTech Media, and I am here again with Brian Graf. Brian is a product manager at VMware, and works specifically with the DRS and High Availability features of vSphere, and Brian is a friend that I met a couple years ago at VMworld, and we’ve stayed in touch since then. I have recently been talking about the new Fling that VMware released called DRS Doctor. Brian is familiar with that, and I believe may be even been in involved in the creation of that, but we will ask.
  At any rate, he knows a lot about it, and so I wanted to bring Brian on to just talk about what the Fling does, what problems it solves, why it’s good for people, and how you can leverage it in your environment. First of all, Brian, welcome to the show; you want to just introduce yourself real fast?
Brian: Yeah, James, thanks for having me. I am excited to be on the podcast, ActualTech Media is a great company that I like to follow, and you’ve got some good content, so thanks for having me. Like you said, I am the PM – the product manager – at VMware for the resource management portions of vCenter. DRS, HA, DPM, and a number of other things that are in the works. You can find me on the Twittersphere at @vBrianGraf.
James: Cool, that’s one F by the way. One F. Awesome, so like I said there is a new Fling called DRS Doctor, and I’ve been playing with a little bit. I wrote about a little bit, and essentially the premise is that it’s a tool that gives you more insight into the decisions that DRS is making in your environment. Traditionally, DRS has been extremely useful, but a little bit opaque from a end-user’s standpoint, especially when trying to troubleshoot issues and that kind of thing.
  Sometimes wind up in a situation where you’re pulling out logs to then send them off to support who can make some meaning out of them and let you know was going on. It’s my understanding that the main purpose of DRS Doctor is to allow somebody – an administrator – to run it against their cluster, and understand why, how, and when DRS is making decisions about their machines without having to go through the whole kind of painful process they had to go through before. Is that about the size of the it?
Brian: Yeah, that’s a pretty accurate description of what we’re doing. Just to take a quick step back here, VMware Flings traditionally are projects that some of the engineers at VMware tend to do on their own. They tend to go along and say, “You know what, I got this idea, and I want to try and build it.” DRS Doctor is one that we have three engineers based out of India who have spent quite a bit of time developing this, and they came to me once I became the product manager and said, “Hey we want to get this out here, we think it’s going to be very beneficial to the customers. Can you take a look at it and see what you think?”
  The Fling site itself – if you want to take a look at this while you are listening to the podcast – is actually labs.vmware.com/flings/drsdoctor. We sat down with the three engineers, so Adarsh, Sai, and Vikas, (you will see their names and their bios up on the Fling site) and they basically came back and said, “Here is the problem.” Just like you said James, DRS is great, a lot of people like to use it. In those situations though where maybe there is an issue, or people think that there is a perceived issue it is really hard to troubleshoot. If you are on a Windows vCenter the DRM logs for DRS are in a specific folder that you can grab, but they are still kind of cryptographic, and if you are on the VCSA you have to generate essentially the vCenter log bundle, and then extract the DRM dumps out of there.
  They said, “There’s got to be a easier way, and want to do it with this Fling.” DRS Doctor runs in just another VM. Matt Meyer, the tech marketing engineer with VMware created a step-by-step walk-through of how to install and configure DSR Doctor on a CentOS 7 (I believe) or 6 VM, and then I’m going to release in the next couple days how to do that on VMware’s Photon OS as well. It requires Python, and a few other requirements that you’ve blogged about as well.
  Essentially what it does is it connects to your vCenter server and it places DRS into a partially automated mode so that it can grab the recommendations that DRS gives, creates the logs and it executes on them. Even though it’s in partially automated mode, because DRS Doctor is running, it acts as if it is in fully automated mode, and it will create log files every five minutes. Every time that DRS runs it will generate one log file, and it has a bunch of information in there that talks about the cluster, what’s going on, what has occurred in those five minutes, and why. We can jump into that if you would like.
James: Let me ask you a quick question about it in terms of actually using it. I think there is two different scenarios here, there is one scenario where you are perceiving an issue, as you said, and you’re trying to do some troubleshooting. The other is you just want to be more aware and understand better what the DRS is doing all the time. In the case of my testing, the tool was actually just a handful of Python scripts, and so I just ran those from my local machine against the vCenter. that I wanted to interact with, and I got what I needed. That’s one way you could do it.
  It sounds like the way you are describing setting it up there is also a case where you would kind of just have this running all the time, and that way whenever you want to pop in and get some historical data it would have been running and captured the stuff you wanted to look at is that correct?
Brian: Yeah, exactly. If you are perceiving a issue, chances are the issue has already occurred, or the logs that you want to look at have already been generated. At that time, if the issue continues to occur, then it works really well to pop in, turn on DRS Doctor maybe from your local machine, and run it for maybe a hour to generate like 20 logs, and take a look. Then you have the other option which is to just essentially have this up and running and kind of get a better idea of what DRS is.
  A number of years ago, Duncan Epping and Frank Denneman did a really good job of writing a series of a book that they updated a few times as well as a number of blog post around DRS, and HA, and how they work. A lot of people back then read that book, read the blog articles, and were really up-to-date on how DRS works.
  Since then, we’ve kind of seen a drop off of the number of people that have been leveraging that, and so a lot of people are trying to understand, “Why is DRS doing what it is doing? Why did DRS make the moves that it chose to make?” Right? This is a great way for them to just easily jump in there and see exactly why DRS did with that.
James: For somebody who has read that book, let me ask you this; how different is DRS today from the time when they wrote that book which I believe was 5.0?
Brian: Yeah, that’s a good question. We have a lot of things that go on there. From the naked eye, I think a lot of people would say that DRS hasn’t changed over the years. We have had a number of updates, and so a lot of it has to do with fine-tuning the DRS algorithm. We’ve added additional advanced options for people that are trying to really tweak their environment, or have special use cases. The overall purpose of how DRS works, and why it does what it does is still the same.
James: Okay, cool. Well, we’ve got just a minute left, and I want to just ask you: tell me how to run this thing? If I get the Fling downloaded and I want to try it, I know you can get up and running really fast if you want to test it out. Walk me through how to do that.
Brian: Okay.
James: I know you mentioned that you have a few dependencies, so let’s just assume you’ve taken care of that and you are ready to run the tool. What does that look like?
Brian: Yeah, so once you’ve taken care of the dependencies, there is a config file where you just have to go in and add in the address, username, and cluster that you want to monitor. Once you’ve saved that, then you essentially run the DRS Doctor file, it will ask for the password for the username that you put in the config file, and as soon as you do that, you will see that it starts generating the first log. You will start seeing logs pop up in the DRS Doctor window every five minutes.
  Once those are done, when you are ready to turn that off you can just close out of DRS Doctor, jump back into the cluster, and to make sure that you set DRS back to fully automated mode, and there is a small parsing script in the DRS doctor folder that you can just right-click and run that will go through and take an aggregate of all your log files and dump them out in a summary, and audit, and an entitlement log for the entire time so you don’t have to go through and look at every five minutes individually.
James: Awesome. What you’re going to get once you are always to the end is three files that will give you a kind of breakdown of all the information you are really concerned about over the period of time that you are running the tool?
Brian: Yep.
James: Awesome. Well, thank you Brian so much for being on and talking with us about DRS Doctor, I really appreciate your time, and I look forward to seeing how this tool develops. I imagine that over time it will grow a few features, or sometimes we see things that started as a Fling wind up in vSphere proper, so who knows?
Brian: Yep. For everyone that is listening, go out and give it a try and give us some feedback, because that’s how we learn and how we can generate these future features. Thanks.
James: Awesome.
James Green

James is a Partner at ActualTech Media and writes, speaks, and consults on Enterprise IT. He has worked in the IT industry as an administrator, architect, and consultant, and has also published numerous articles, whitepapers, and books. James is a 2014 - 2016 vExpert and VCAP-DCD/DCA. Follow James on Twitter

No Comments

Post A Comment

Web Analytics