GotoDBA Features,How Things Work,Infrastructure Oracle SEHA – Failover When VIP Goes Down

Oracle SEHA – Failover When VIP Goes Down

Lately I’ve been writing about Standard Edition High Availability (SEHA). After publishing my introduction post about SEHA, I got a comment from Purav with a case I didn’t cover in my testing SEHA post. So I wanted to thank Purav for this and cover this problem here.

Problem Description

When we have a SEHA environment, Oracle will automatically failover the database from one node to the other if needed, and I checked all kind of different scenarios. The scenario I didn’t check is what happens when the public interface goes down.

When one server cannot access the public network its resources should relocate. When this happens we can see that the VIP is moved to one of the other nodes, but what about the database? With RAC, each instance runs on its own server, so if the network goes down, the instance stays up but obviously cannot serve clients. With SEHA however, since we have only one instance that moves between the servers, we would expect this instance to failover, but it doesn’t.

I tested this on my SEHA environment, where I have both nodes as VirtualBox machines. It’s easy to just go to the running instance and bring the network down.

This Is How It Looks Like

This is a partial output when everything is up and running:

[oracle@seha2 ~]$ crsctl status res -t
...
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
...
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       seha1                    STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       seha1                    STABLE
ora.se1.db
      1        ONLINE  ONLINE       seha1                    Open,HOME=/oracle/db
                                                             /19,STABLE
ora.seha1.vip
      1        ONLINE  ONLINE       seha1                    STABLE
ora.seha2.vip
      1        ONLINE  ONLINE       seha2                    STABLE
--------------------------------------------------------------------------------

And this is how it looks like after disconnecting the public network interface of seha1 server:

[oracle@seha2 ~]$ crsctl status res -t
...
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
...
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       seha2                    STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       seha2                    STABLE
ora.se1.db
      1        ONLINE  ONLINE       seha1                    Open,HOME=/oracle/db
                                                             /19,STABLE
ora.seha1.vip
      1        ONLINE  INTERMEDIATE seha2                    FAILED OVER,STABLE
ora.seha2.vip
      1        ONLINE  ONLINE       seha2                    STABLE
--------------------------------------------------------------------------------

As you can see, both virtual IPs are now running on seha2 as well as the SCAN and SCAN listener, however, the database se1 is still running on seha1 server and is practically inaccessible.

Solution

Before I could get a response from Oracle about this, Purav reached out again. They opened an SR and wanted to let me know about the response.

The solution is to create a service for this database. Once a service exists, when the public network goes down, the database will relocate to the other node.

I attached the network again and created a service:

[oracle@seha1 ~]$ srvctl add service -db se1 -service se1_seha
[oracle@seha1 ~]$ srvctl start service -db se1 -service se1_seha
[oracle@seha2 ~]$ crsctl status res -t
...
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
...
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       seha2                    STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       seha2                    STABLE
ora.se1.db
      1        ONLINE  ONLINE       seha1                    Open,HOME=/oracle/db
                                                             /19,STABLE
ora.se1.se1_seha.svc
      1        ONLINE  ONLINE       seha1                    STABLE
ora.seha1.vip
      1        ONLINE  ONLINE       seha1                    STABLE
ora.seha2.vip
      1        ONLINE  ONLINE       seha2                    STABLE
--------------------------------------------------------------------------------

Let’s see what happens now after I disconnect the public network of seha1:

[oracle@seha2 ~]$ crsctl status res -t
...
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
...
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       seha2                    STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       seha2                    STABLE
ora.se1.db
      1        ONLINE  ONLINE       seha2                    Open,HOME=/oracle/db
                                                             /19,STABLE
ora.se1.se1_seha.svc
      1        ONLINE  ONLINE       seha2                    STABLE
ora.seha1.vip
      1        ONLINE  INTERMEDIATE seha2                    FAILED OVER,STABLE
ora.seha2.vip
      1        ONLINE  ONLINE       seha2                    STABLE
--------------------------------------------------------------------------------

When the network got disconnected I saw that the VIP of seha1 got relocated as before, but Oracle immediately realized that the service is down as well. This caused the database to relocate and after a short while the database and service were up and running on seha2 as expected.

A Short Research

Why does this happen? The reason lies in resource dependencies. Each resource in the cluster depends on others and this causes resources to stop, start, and relocate based on the behavior of other resources in the cluster.

Let’s look at the dependencies of the database and the service:

[oracle@seha2 ~]$ crsctl status resource ora.se1.db -dependency
================================================================================
Resource Start Dependencies
================================================================================
                                   ora.se1.db
--------------------------------------------------------------------------------
ora.se1.db(ora.database.type)->
| ora.DATA.dg(ora.diskgroup.type)[hard:global:uniform,pullup:global]
| | ora.asm(ora.asm.type)[hard,pullup:always]
| | | type:ora.asm_listener.type[hard:type,pullup:type]
| | | | ora.asmnet1.asmnetwork(ora.asm_network.type)[hard,pullup]
| | | ora.ASMNET1LSNR_ASM.lsnr(ora.asm_listener.type)[weak]
| | | | ora.asmnet1.asmnetwork(ora.asm_network.type)[hard,pullup]
| type:ora.listener.type[weak:type]
| | type:ora.cluster_vip_net1.type[hard:type,pullup:type]
| | | ora.net1.network(ora.network.type)[hard,pullup]
| | | ora.gns<Resource not found>[weak:global]
| ora.ons(ora.ons.type)[weak:uniform]
| | ora.net1.network(ora.network.type)[hard,pullup]
--------------------------------------------------------------------------------
[oracle@seha2 ~]$ crsctl status resource ora.se1.se1_seha.svc -dependency
================================================================================
Resource Start Dependencies
================================================================================
                              ora.se1.se1_seha.svc
--------------------------------------------------------------------------------
ora.se1.se1_seha.svc(ora.service.type)->
| ora.se1.db(ora.database.type)[hard,pullup:always]
| | ora.DATA.dg(ora.diskgroup.type)[hard:global:uniform,pullup:global]
| | | ora.asm(ora.asm.type)[hard,pullup:always]
| | | | type:ora.asm_listener.type[hard:type,pullup:type]
| | | | | ora.asmnet1.asmnetwork(ora.asm_network.type)[hard,pullup]
| | | | ora.ASMNET1LSNR_ASM.lsnr(ora.asm_listener.type)[weak]
| | | | | ora.asmnet1.asmnetwork(ora.asm_network.type)[hard,pullup]
| | type:ora.listener.type[weak:type]
| | | type:ora.cluster_vip_net1.type[hard:type,pullup:type]
| | | | ora.net1.network(ora.network.type)[hard,pullup]
| | | | ora.gns<Resource not found>[weak:global]
| | ora.ons(ora.ons.type)[weak:uniform]
| | | ora.net1.network(ora.network.type)[hard,pullup]
| type:ora.cluster_vip_net1.type[hard:type,pullup:type]
| | ora.net1.network(ora.network.type)[hard,pullup]
| | ora.gns<Resource not found>[weak:global]
| type:ora.listener.type[weak:type]
| | type:ora.cluster_vip_net1.type[hard:type,pullup:type]
| | | ora.net1.network(ora.network.type)[hard,pullup]
| | | ora.gns<Resource not found>[weak:global]
| type:ora.service.type[dispersion:type]
--------------------------------------------------------------------------------

Look at the highlighted line in the service dependency list. This is a dependency of the service in the virtual IP resource, a dependency that the database doesn’t have. When the public network dies, the VIP goes down on this node, and because of this dependency, the service goes down as well. When the cluster tries to relocate the server and start it on seha2, it can’t. This is because the service depends on the database to be up, but the database is not up on seha2, it’s up on seha1. This is the trigger that causes the database to relocate to seha2 and after the database is up, the service can start successfully too.

One last thing, in the documentation (under 2.9.2), Oracle says that a service must be created in SEHA, but they don’t explain why or how it affects SEHA behavior.

This concludes this scenario as well. I wanted to thank Purav again for the comment and especially for the followup with the solution.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post