As a senior consultant, I often come to a client’s site to diagnose and fix a problem they have. It can be a performance problem, strange database behavior, database crash and more.
The variety of problems and people to handle is huge and not easy at all. In some cases the client understands exactly how things work and can explain the problem perfectly. In other cases the client doesn’t know anything (they might not be technical at all) and it’s really difficult to even understand what the problem is.
In this post I decided to give a short guideline on how to start the diagnose process. As it’s impossible to talk about causes and solutions, I will talk about what I usually do from my side when coming to a client I’m not familiar with. I’ll try to give general examples to explain my points, but I’ll leave real examples to a different post I plan to write soon.
So let’s start
Knowing the Other Person
It is very important to understand who is standing in front of you. They might be an IT person, a developer, a DBA or an application user. It is very important to talk to them in the relevant language. If an application user says that the application is down, I will never ask him: “Did you connect to SQL*Plus to see if the database is up?”. I will try to get information that they know, or at least understand what I’m asking.
Asking Questions
The first thing is to understand the problem, not to look for a solution, but to get as much information as possible about the problem. And we do that by asking questions.
When asking something like “what’s the problem” we might get answers like “the database is slow”. This answer tells us completely nothing for several reasons:
- What is slow in the database? Is it everything or only a certain operation? Sometimes people say “everything is slow” and then I realize that “everything” for them is the specific application flow they are using and everything else for everyone else is working as usual.
- Is it really the database? Maybe the network is slow or down? Maybe there is a storage problem? Maybe the developers changed something in the application?
- Is it slow now? I had cases that there was a performance problem, but when I got there, the problem was gone and I needed to find out what had happened in the past, so looking at the system behavior when I got there was irrelevant.
Because of that, the first step we need to do is to ask questions to understand exactly what’s going on. And we should do that before we even login to the system.
As I said, the problems are different every time, but this is a short list of example questions that might be related and why:
- Is it the first time you see the problem? If it happens occasionally, does it happen on the same day every time? Same hour?
I had a case where the problem was consistent to Mondays mornings and it turned out to be a backup that didn’t finish on time. - When did the problem start? It might change the picture (and our research) if this started this morning or a week ago.
- Was there any change in the environment (application upgrade/OS upgrade/some maintenance) before the problem started?
- Do you see the the problem all over the application or only specific flows/screens?
- Can you please show me what you see? You won’t believe how important this step is. First, because you might see something that they don’t, and second, sometimes the problem will be gone when they try to show you. That doesn’t mean you’re done, but it means that we need to check different things (such as logs instead of current behavior).
Getting information
After we got all the relevant information from the client, the next step is to gather some information ourselves. It depends what the problem is, but at this point I would connect to the database server, check its utilization, generate AWR or statspack, check OS and database logs, and more.
Gather as much information as you can (related to the problem of course) and think what else you need to gather or check before you continue. For example, if I get to a server and the database is down, I won’t simply start it. The client will suffer another couple of minutes of down time while I check the alert log to understand the problem. When the database needs recovery, I will make sure we have a backup of the current state (or create a backup if possible). When performing recovery, the files are getting updated. I had cases when the recover wasn’t successful and I wanted to get back to the start and try a different method.
Think of Hypothesis and Prove it
If this is a complex problem, at that point we need to think about a hypothesis. What might cause the problem? At this stage, anything is allowed, including accusing other components (application, storage, etc.). But accusing someone else is easy, as easy as accusing us by someone else, so we need to be very careful. That’s why we need to prove our hypothesis as much as we can before actually tell that to the relevant people.
In order to prove a hypothesis, we might need to check more logs, ask other people to check their logs, and maybe run several tools or tests. At this point it’s not a shame to contact someone you know that might be helpful. If this is really strange and complex, I usually make a phone call to another consultant and talk to him about it, a brainstorm might bring up things you didn’t think about at the beginning, or help you prove your suspicions. Sometimes I suspect the OS, so I make a call to a sysadmin I know and can help me (even though I’m familiar with operating systems, he will probably know better).
Solve the Problem (or Try Again)
Now, that you think you know what the problem is, you can fix it (or let others fix it). But it is also important to verify that it was indeed fixed. If you made the change or applied the fix, and it didn’t help, we need to think of a different reason and try again.
Summary
I know that this post is very general, but I meant it to be a general guideline to make some order in the diagnosis process. I hope it will help you when you get to your next crisis.
1 thought on “How to Diagnose a Problem”