I’ve worked with OEM in the past, but these days I’m working with it much more than before. I have a client who uses it as their main monitoring system, so we upgraded it to 13.2, and since then I created custom alerts (metric extensions), installed patches, used it to diagnose issues and more.
It was always quite complicated to use, but that makes sense as it has lots and lots of features and capabilities. However, some things just don’t seem right. I wonder how many people are using it as their main monitoring system and if it’s quite a lot, do they suffer from the same things or it is just me.
Metric Extension on DB Target
In this specific environment I have both single-instance databases and RAC databases. When I create a metric extension I have to choose between “database instance” and “cluster database”. I wanted to create a metric extension to be used on all of the databases, but this is quite a problem. If I choose “database instance”, which instance should I choose for the RAC databases? I want it to be able to use a different instance if one of them is down, but I don’t want to deploy it to all instances. If I choose “cluster database” I can’t deploy it to the non-RAC databases. Am I missing something? Any ideas? Why can’t there be a simple “database” target? That will run once on each database, whether it’s a RAC or a single instance?
Viewing All Incidents
This is a tricky one. I wanted to see all the incidents/problems/events/whatever you call it for my entire environment, but the data seem to be inconsistent. I can go to the “database” screen and then click on the number of critical or warning issues next to the database. This way I see the information for this specific database. To see all incidents, I went to the “incident manager”. I should be able to see everything there, right? But I didn’t. There were issues (like metric extensions) that I simply couldn’t see, no matter how I modified the search. And some of the issues I saw from the incident manager were not there when I got there through the database page. I can’t really understand how these reports work and why they are not identical.
Tablespace Free Space
Tablespace free space is a very important alert. In this database I saw that I get alerts all the time, so I wanted to adjust the threshold. The problem is that this is a large database and some tablespaces are really big. So for a 10TB tablespace, even if I set the threshold to 1%, this is still 100GB which is quite a lot. When working with large databases, I expect the alert system to allow me a size threshold as well. For example, 1% of the tablespace, but only if it’s 50GB or less. That way I don’t need to change the threshold for each tablesapce to a fraction of percent, and I still don’t get false alerts. OEM doesn’t have this option.
More About Tablespace Free Space
Another thing about tablespaces, and this really boggled my mind, is autoextendable files. Apparently (and it’s described in MOS note 2101403.1) OEM checks that the tablespace can grow based on the disk free space. This makes sense, but for some reason they require that the disk space will be larger than the maximum tablespace size. So let’s take a tablespace with 10 files, each of them is 10GB but can increase to 32GB. This means that the tablespace size is 100GB, and the maximum size is 320GB. OEM will expect the disk to have 220GB free or it will alert on tablespace free space. We will get an alert even if there is 100GB free disk space, which means that the tablespace can double its size before we have an issue. In my opinion, in this case the max available size should be the maximum between max tablespace size and disk free space. And based on that we should calculate the available space and alert if it’s below the threshold. In the example above, if the tablespace (100GB) is full, and we still have 100GB on the disk, this means that the calculated metric is that the tablespace is 50% full of the maximum available capacity and there is no reason to alert.
TFA
This is probably the least problematic one, as it’s related to diagnosing the OEM itself. I had performance issue and wanted to open an SR. Usually, when these things happen, I try to upload TFA information so the support engineer will have all the information they need. The thing is that the OMS repository database is on a different host, the OMS server has only the OMS software. According to the documentation, I tried to execute “./tfactl diagcollect -srdc emdebugon” to enable debug. It asked me for the DB details (host, port, etc.), but the failed with “Returned error message: TNSPINGNOTFOUND.”. Yes, I don’t have TNSPING on this host, why do you need it? Then, running “./tfactl diagcollect -srdc emomscrash” to collect the information asked me for the repository database name. It didn’t ask me for host, port, and db name, it expected to get the database name itself. When I entered something I got “database does not exist”. Quite frustrating. I ended up collecting the log files manually and uploading them to the SR.
OEM 13 is supposed to be quite mature as this is far from being close to the initial release. However, there are still quite a few thing to add/change in order to make this a really wonderful and easy to use monitoring system.
Metric Extension on DB Target
Was having the same trouble.
During runtime of the ME ensure it’s the min instance number, if it’s not exit out.
So for me the role has to be primary and instance number requires to be the min one.