Bash Critical Mistakes

lately I wrote a couple of posts about bash scripting for DBAs (part 1 and part 2). I promised to post an example script as well, and I will. But before that I thought it’s important to give several examples of how scripts can be destructive. The examples in this post are real, and involved me (either on the side that caused the problem, or the side who fixed it).

Deleting a Table

The first example is from about 12-13 years ago. It’s also not a bash example, but SQL script. I decided to include it here since it’s related to something I wrote in the previous post, and it’s very relevant to scripting in general.

If you remember, I wrote that exit code is important. We should use it to decide what to do if something goes wrong. So here what happens if we don’t.

In my working place back then we needed to write a script to delete many rows from a table. Using “delete” to delete many rows is very slow and inefficient. The better way is to move the data we need to keep to a different table, truncate the original table and move the data back. That way we can bypass some internal maintenance mechanisms (like redo) and get better performance. One of my colleagues wrote the script that will be used during the system downtime. The script performed “insert as select” from the original table to a temporary table, then truncated the original table and inserted the data back. When the downtime started, he execute the script and waited.

Remember I wrote about “whenever sqlerror exit”? He didn’t use it. What happened is that the temporary table grew and grew until it filled the tablespace up and got an error. The insert failed. Once the insert failed the script continued and truncated the original table…

It took us a few hours to restore the database from backup, as we lost all the data.

Deleting a Server

OK, this one is completely my fault, and it happened only a few years ago. I wrote a bash script for a customer (it is probably the largest I’ve ever written, and it works really nicely now) to do all kind of things. To make it easier, I created a function called “fail” that cleans all the temporary files I was using in the process, prints the error message and exists.

It’s convenient and easy to use such a function. After every command I can simply check the return code and if something is wrong call “fail”. It does everything for me.

As I said, the script used many different temporary files and directories, so I had to clean them in the “fail” function. One of them was a directory I created in another directory that I got as an input. So the “fail” function used “rm -rf ${user_dir}/${tmp_dir}” to delete it. So far so good.
After a few days I added some input validations, so if something is wrong with the input the script would fail. As part of that I performed checks on the variables, and if something was wrong, I called “fail”. The only thing I missed is that the “user_dir” and “tmp_dir” variables were generated based on variables from the user. The first time I called “fail” was before these variables were populated. You see where I’m going, right? If these variables are not populated and I call “fail”, what happens when it executes “rm -rf ${user_dir}/${tmp_dir}”? Right, it will execute “rm -rf /” and we all know how bad this is.

When I checked my input validation, I intentionally entered invalid data, and then I got the error message “cannot remove /proc/…”. I was puzzled but immediately understood that something is very wrong. I stopped the script and wanted to check how much damage I made. I tried to see the files on the server using “ls”. When I got the error “ls: command not found” I knew that the game was over.

I was really lucky to have my own linux server for development, so the fact that I ruined it didn’t have any effect on the rest of the company. But since then I’m very careful about using variables before populating them.

Summary

Scripts are very useful, but can be very dangerous. These are only two examples, but there are many. Scripts that don’t stop on failures, don’t check return codes, or use unpopulated variables (and these are only a few things we can do wrong), can lead to a mess.

Before you write you next script, try to remember this post and eliminate at least these devastating scenarios.

Bash Critical Mistakes

Deleting a Table

Deleting a Server

Summary

Like this:

Leave a Reply Cancel reply

Bash Critical Mistakes

Deleting a Table

Deleting a Server

Summary

Share this:

Like this:

Leave a Reply Cancel reply

Related Post

Many Usm Processes in Oracle RestartMany Usm Processes in Oracle Restart

Diagnosing Listener Issues in RACDiagnosing Listener Issues in RAC

Bash for DBAs – Part 1Bash for DBAs – Part 1