As DBAs we are usually located in the middle between the IT (sysadmins, storage, network) and the development. Some of my best friends are sysadmins and developers (well, friends anyway), but I still want to dedicate this post to let off some steam about them.
I hear many complaints from DBAs about sysadmins (and I include OS, network and storage people here) and developers, and, well, they are usually justified. It doesn’t mean that they don’t complain about us (again, justifiably) but still…
So I wanted to tell several stories from my own experience, feel free to leave comments with your (funny or sad) stories.
Somehow it’s always us…
Because the developers work with the database, but not with the network or the OS, when something is wrong, it’s always our fault, right?
I’ve seen people (DBAs as well I have to admit) that simply throw the responsibility to others too quickly. Sentences like “the database is slow, why do you think it’s the storage?” (from storage people) or “the query was fast yesterday and slow today, it must be the database” (from developers) are very common unfortunately. That’s why (and if you read my posts you know that by now) I always try to prove my theories about the problem as much as I can. Sometimes I make mistakes as well and move the problem to someone else, while eventually we figure out that this was indeed the database. This happens only after I really do my best and thought it is not the database. I expect them (sysadmin, dev, etc.) to do the same and check themselves first, before blaming us.
We didn’t change anything…
Many times I arrived to a client that complained about performance problems. The first things I ask in these cases are “when did it start?” and “what did you do before it started?”. Many time the answer was something like “we upgraded the application, but haven’t changed anything relevant to this screen or flow”. Somehow, I always find a “bad” SQL at the end (and between us, this is usually not that difficult) and when I show it to them they say “oh, OK, yeah, this is a new SQL”. Is it really easier to call me, wait for me and pay for my time just to realize that you added a new SQL to this flow yesterday? Really?
It’s not us, everything looks good here…
Another very common problem is when the application people blame the database (as usual) but the DBA realizes that this is an OS or storage problem. I had such a case about 10 years ago, one of the DBAs in my team (I managed a team of consultants back then) phoned me for assistance. The client’s database was slow for about two days, so I asked “what happened two days ago?” and the answer was that the storage people made some changes to their storage (EMC Clariion if I remember correctly). All the evidences supported a storage issue as well (AWR, etc.). So it was easy, it was a storage issue, right?
The problem was that the storage people said that everything works fine on their side. They even copied some files around to check the storage performance. If you don’t know that, copying a file and Oracle activity is totally, but TOTALLY different. The fact that copying a file works perfectly simply doesn’t say anything. But they insisted.
The next step for me was to guide my guy to check the system logs, which surprisingly showed plenty of “SCSI error” messages. With this information he went to the storage people again which simply said that this is OK and we can ignore these errors! At that point I gave up, this was clearly a storage issue but they denied it. I guided my guy to say that to whoever was in charge and say that this is the best we can do. After about 30 minutes he called me back, the storage people realized (maybe after googling it or something) that the SCSI errors cannot be ignored and found that during the maintenance they did, they messed up the FC routing tables. Once this was fixed, everything worked just fine.
Nothing strange here…
Another case of blaming the database was a couple of years ago when we got to a client with a “slow database”. It was a small application so it was installed on a small server outside the computer room and it used NAS with iSCSI for storage. It took us not too long to understand that this is a storage issue, but the storage people claimed that everything is OK from their side (obviously). The people we worked with were sure that this was a database issue, but it clearly wasn’t. We tried to get to know the environment a little bit better and understand if there is something between the database and the storage, but they insisted that the database is the problem. I think it was a few hours later when someone else came to sit with us, and when we talked about their environment again he said “maybe there is a problem with the firewall”. We immediately responded “Wait a minute, what firewall? Is there a firewall between the database server and the storage? But we asked about something like that and you said nothing… Oh well, anyway, this is your problem, remove the firewall and see how the performance problem disappears”.
Summary
I’ve met some excellent developers and sysadmins over the years. I learned a lot from some of them and I really enjoyed working with them and solving problems we had. But there are also the others, the ones that don’t even try to understand, or simply don’t want to. I don’t know what they are trying to do, and I don’t know why they think it’s good for them (or maybe they don’t but don’t care). Because if I will solve the problem eventually, isn’t it better for them to be part of the solution instead of part of the problem?
I even faced issues for performance and eventually it was hardware issue, the FA switch was faulty and when we replaced it, the performance was blazing fast. Even hardware issues contribute to database performance.
Of course hardware issues can be the root cause, we just need some assistance from the relevant team in identifying and resolving the problem.