Event Scripting, or RIND (RIND is not dependencies) is a way for administrators to trigger service transitions based on a number of things which occur during cluster operation. Programmatically, it makes most of the business logic of RGManager external - and therefore, customizable. This gives administrators the flexibility to create specific failover policies which are not handled by rgmanager out of the box.
The Language Itself
Event scripts are written in a language called S-Lang; documentation specifics about the language are available at http://www.s-lang.org
Notes
There is no SLang-to-C glue written to support migration of virtual machines. Do not use RIND with virtual machines if you also intend to use live or pause migration. You may use RIND with virtual machines if you make the virtual machine a part of a service within cluster.conf. For example:
<service name="vm1"> <vm name="vm1"/> </service>
Basic configuration specification
<rm>
<events>
<event class="node"/> <!-- all node events -->
<event class="node"
node="bar"/> <!-- events concerning 'bar' -->
<event class="node"
node="foo"
node_state="up"/> <!-- 'up' events for 'foo' -->
<event class="node"
node_id="3"
node_state="down"/> <!-- 'down' events for node ID 3 -->
(note, all service ops and such deal with node ID, not
with node names)
<event class="service"/> <!-- all service events-->
<event class="service"
service="A"/> <!-- events concerning 'A' -->
<event class="service"
service="B"
service_state="started"/> <!-- when 'B' is started... -->
<event class="service"
service="B"
service_state="started"/>
service_owner="3"/> <!-- when 'B' is started on node 3... -->
<event class="service"
priority="1"
service_state="started"/>
service_owner="3"/> <!-- when 'B' is started on node 3, do this
before the other event handlers ... -->
</events>
...
</rm>
General globals available from all scripts
node_self - local node ID
event_type - event class, one of:
EVENT_NONE - unspecified / unknown
EVENT_NODE - node transition
EVENT_SERVICE - service transition
EVENT_USER - a user-generated request
EVENT_CONFIG - cluster.conf change. At this time, one may not trigger events scripts off of this.
Node event globals (i.e. when event_type == EVENT_NODE)
node_id - node ID which is transitioning
node_name - name of node which is transitioning
node_state - new node state
NODE_ONLINE or 1 - the node is running rgmanager
NODE_OFFLINE or 0 - the node is not running rgmanager
node_clean - 0 if the node has not been fenced, 1 if the node has been fenced. Deprecated - in STABLE3, service operations are blocked on fencing, so node_clean can be ignored.
Service event globals (i.e. when event_type == EVENT_SERVICE)
service_name - Name of service which transitioned
service_state - new state of service
service_owner - new owner of service (or a value <0 if service is no longer running)
service_last_owner - Last owner of service if known. Used for when service_state = "recovering" generally, in order to apply restart/relocate/disable policy.
User event globals (i.e. when event_type == EVENT_USER)
service_name - service to perform request upon
user_target - target node ID if applicable
user_request - request to perform. The default user request handlers are defined below.
USER_ENABLE - Start the disabled or stopped service named service_name on the optionally-specified user_target.
USER_DISABLE - Stop service_name service and place it in to the disabled state. This may also be used to instruct rgmanager that a failed service has been successfully cleaned up by rgmanager. Disabled services may not be started from event scripts.
USER_STOP - Stop service_name, placing it in to the stopped state.
USER_RELOCATE - Move service_name to another node in the cluster on the optionally-specified user_target.
USER_RESTART - Restart service_name on the node where it currently lies.
USER_FREEZE - Freeze service_name in place, preventing service operations and status checks.
USER_UNFREEZE - Unfreeze service_name, allowing service operations and status checks again.
USER_MIGRATE - Migrate service_name - for virtual machines TBD
Scripting functions - Informational
node_list = nodes_online();
- Returns a list of all online nodes.
service_list = service_list();
- Returns a list of all configured services.
(restarts_exceeded, restarts, last_owner, owner, state) = service_status( service_name );
Returns the state, owner, last_owner, restarts, and whether the restart threshold has been exceeded. Note that all return values are optional, but you must place commas showing separation between each element. For example:
(,,,, state) = service_status(service_name); (, restarts,,,) = service_status(service_name);
(nofailback, restricted, ordered, node_list) = service_domain_info( service_name );
- Returns the failover domain specification, if it exists, for the specified service name. The node list returned is an ordered list according to priority levels. In the case of unordered domains, the ordering of the returned list is pseudo-random.
value = service_property( service_name, property_name );
Returns the value for a given service property from cluster.conf. For example, the following obtains the failover domain value for myservice:
domain = service_property("myservice", "domain");
Scripting functions - Operational
err = service_start( service_name, node_list, (avoid_list) );
- Start a non-running, (but runnable, i.e. not failed) service on the first node in node_list. Failing that, start it on the second node in node_list and so forth. One may also specify an avoid list, but it's better to just use the subtract() function below.
err = service_stop( service_name, (0 = stop, 1 = disable) );
- Stop a running service. The second parameter is optional, and if non-zero is specified, the service will enter the disabled state.
err = service_migrate( service_name, target_node_id );
TBD Migrate a running virtual machine. Both arguments are required.
Utility functions - Node list manipulation
node_list = union( left_node_list, right_node_list );
Calculates the union between the two node lists, removing duplicates and preserving ordering according to left_node_list. Any added values from right_node_list will appear in their order, but after left_node_list in the returned list. For example: union([1 2 3 5], [3 2 1 4]) returns [1 2 3 5 4]
node_list = intersection( left_node_list, right_node_list );
Calculates the intersection (items in both lists) between the two node lists, removing duplicates and preserving ordering according to left_node_list. For example: intersection([1 2 3 5], [3 2 1 4]) returns [1 2 3]
node_list = delta( left_node_list, right_node_list );
Calculates the delta (items not in both lists) between the two node lists, removing duplicates and preserving ordering according to left_node_list. Any added values from right_node_list will appear in their order, but after left_node_list in the returned list. Any added values from right_node_list will appear in their order, but after left_node_list in the returned list. For example: delta([1 2 3 5], [3 2 1 4]) returns [5 4]
node_list = subtract( left_node_list, right_node_list );
Removes items specified in right_node_list from left_node_list. For example: subtract([1 2 3 5], [3 2 1 4]) returns [5]. Example:
all_nodes = nodes_online(); allowed_nodes = subtract(nodes_online, node_to_avoid);
stop_processing();
- Calling this function will prevent further event scripts from being executed on a particular event. Call this script if, for example, you do not wish for the default event handler to process the event. Note: This does NOT terminate the caller script; that is, the script being executed will run to completion.
Utility functions - Logging
Log levels correspond to the syslog.conf(5) manual page. Resource scripts inherit rgmanager's log filtering level, log facility, and log file (if specified).
debug( item1, item2, ... );
- LOG_DEBUG level
info( item1, item2, ... );
- LOG_INFO level
notice( item1, item2, ... );
- LOG_NOTICE level
warning( item1, item2, ... );
- LOG_WARNING level
err( item1, item2, ... );
- LOG_ERR level
crit( item1, item2, ... );
- LOG_CRIT level
alert( item1, item2, ... );
- LOG_ALERT level
emerg( item1, item2, ... );
- LOG_EMERG level
Notes:
items - These can be strings, integer lists, or integers. Logging string lists (e.g. from service_list() is not supported.
level - the level is consistent with syslog(8)
Error Conditions
SUCCESS - Operation succeeded
FAIL - General failure; see logs.
ERR_ABORT - Service is in failed state; operation aborted
ERR_INVALID - Operation invalid for resource. For example, relocating a disabled service.
ERR_DEPEND - Operation violates a dependency rule. For example, trying to move an exclusive service to a node where a service is running.
ERR_DOMAIN - Operation violates a failover domain constraint. For example, trying to move a service outside of its restricted failover domain.
ERR_RUNNING - Service is already running. For example, enable a service when it is not stopped.
Example 1: creating a follows-but-avoid-after-start behavior
%
% If the main queue server and replication queue server are on the same
% node, relocate the replication server somewhere else if possible.
%
define my_sap_event_trigger()
{
variable state, owner_rep, owner_main;
variable nodes, allowed;
%
% If this was a service event, don't execute the default event
% script trigger after this script completes.
%
if (event_type == EVENT_SERVICE) {
stop_processing();
}
(,,, owner_main, state) = service_status("service:main_queue");
(,,, owner_rep, state) = service_status("service:replication_server");
if ((event_type == EVENT_NODE) and (owner_main == node_id) and
(node_state == NODE_OFFLINE) and (owner_rep >= 0)) {
%
% uh oh, the owner of the main server died. Restart it
% on the node running the replication server
%
notice("Starting Main Queue Server on node ", owner_rep);
()=service_start("service:main_queue", owner_rep);
return;
}
%
% S-Lang doesn't short-circuit prior to 2.1.0
%
if ((owner_main >= 0) and
((owner_main == owner_rep) or (owner_rep < 0))) {
%
% Get all online nodes
%
nodes = nodes_online();
%
% Drop out the owner of the main server
%
allowed = subtract(nodes, owner_main);
if ((owner_rep >= 0) and (length(allowed) == 0)) {
%
% Only one node is online and the rep server is
% already running. Don't do anything else.
%
return;
}
if ((length(allowed) == 0) and (owner_rep < 0)) {
%
% Only node online is the owner ... go ahead
% and start it, even though it doesn't increase
% availability to do so.
%
allowed = owner_main;
}
%
% Move the replication server off the node that is
% running the main server if a node's available.
%
if (owner_rep >= 0) {
()=service_stop("service:replication_server");
}
()=service_start("service:replication_server", allowed);
}
return;
}
my_sap_event_trigger();Relevant <rm> section from cluster.conf:
<rm central_processing="1">
<events>
<event name="main-start" class="service"
service="service:main_queue"
service_state="started"
file="/tmp/sap.sl"/>
<event name="rep-start" class="service"
service="service:replication_server"
service_state="started"
file="/tmp/sap.sl"/>
<event name="node-up" node_state="up"
class="node"
file="/tmp/sap.sl"/>
</events>
<failoverdomains>
<failoverdomain name="all" ordered="1" restricted="1">
<failoverdomainnode name="molly"
priority="2"/>
<failoverdomainnode name="frederick"
priority="1"/>
</failoverdomain>
</failoverdomains>
<resources/>
<service name="main_queue"/>
<service name="replication_server" autostart="0"/>
<!-- replication server is started when main-server start
event completes -->
</rm>