-
Notifications
You must be signed in to change notification settings - Fork 23
Expand file tree
/
Copy pathsusChkSrv.py.7
More file actions
488 lines (488 loc) · 15.3 KB
/
susChkSrv.py.7
File metadata and controls
488 lines (488 loc) · 15.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
.\" Version: 1.2
.\"
.TH susChkSrv.py 7 "27 Oct 2025" "" "SAPHanaSR"
.\"
.SH NAME
.\"
susChkSrv.py \- Provider for SAP HANA srHook method srServiceStateChanged().
.PP
.\"
.SH DESCRIPTION
.\"
susChkSrv.py can be used to provide a script for the SAP HANA srHook method
srServiceStateChanged().
.PP
The SAP HANA nameserver provides a Python-based API ("HA/DR providers"), which
is called at important points of the host auto-failover and system replication
takeover processes. These so called hooks can be used for arbitrary operations
that need to be executed. The method srServiceStateChanged() is called when
HANA processes are failing, starting or stopping.
.PP
Purpose of susChkSrv.py is to detect failing HANA indexserver processes and
trigger a fast takeover to the secondary site. With regular configuration of an
HANA database, the resource agent (RA) for HANA in a Linux cluster does not
trigger a takeover to the secondary site when:
.br
- A software failure causes one or more HANA processes to be restarted in place
by the HANA daemon (hdbdaemon).
.br
- A hardware error (e.g. SIGBUS from an uncorrectable memory error) causes the
indexserver to restart locally.
.br
See also SAPHanaSR(7) or SAPHanaSR-ScaleOut(7).
.PP
The hook script susChkSrv.py is called on any srServiceStateChanged() event.
The script checks for
'isIndexserver and serviceRestart and serviceWasActiveBefore and hostActive and databaseActive'.
If it finds the correct entries, it executes the predefined action. As soon as
the HANA landscapeHostConfiguration status changes to 1, the Linux cluster will
take action. The action depends on HANA system replication status and the RA´s
configuration parameters PREFER_SITE_TAKEOVER, AUTOMATED_REGISTER and ON_FAIL_ACTION. See manual
page ocf_suse_SAPHana(7) or ocf_suse_SAPHanaController(7).
.PP
Customising of HANA daemon timeout parameters might be needed for adapting the
solution to a given environment. Please refer to SAP HANA documentation.
.PP
This hook script needs to be installed, configured and activated on all HANA
nodes.
.PP
.\"
.SH SUPPORTED PARAMETERS
.\"
* The "HA/DR providers" API accepts the following parameters for the
ha_dr_provider_suschksrv section in global.ini:
.TP
\fB[ha_dr_provider_suschksrv]\fP
.TP
\fBprovider = susChkSrv\fP
Mandatory. Must not be changed.
.TP
\fBpath = /usr/share/SAPHanaSR-angi\fP
Mandatory. Delivered within RPM package. Please change only if requested.
.TP
\fBexecution_order = [ \fIINTEGER\fB ]\fP
Mandatory. Order might depend on other hook scripts.
.TP
\fBaction_on_lost = [ ignore | stop | kill | fence ]\fP
.\" TODO \fBaction_on_lost = [ ignore | stop | kill | fence | suicide ]\fP
Action to be processed when a lost indexserver is identified.
.br
- \fBignore\fP: do nothing, just write to tracefiles.
.br
- \fBstop\fP: do 'sapcontrol ... StopSystem'.
If this is combined with SAPHanaController RA parameter 'AUTOMATED_REGISTER=true',
HANA needs to release all OS resources prior to the automated registering. See
also manual page ocf_suse_SAPHanaController(7).
.br
- \fBkill\fP: do 'HDB kill-<\fIsignal\fR>'. The signal can be defined by parameter 'kill_signal'.
This action is primarily meant for scale-up, and scale-out with host auto-failover
(aka BW).
.br
On scale-out without host auto-failover (aka ERP), it depends on runtime effects,
whether HANA will try a restart.
.br
If this is combined with SAPHanaController RA parameter 'AUTOMATED_REGISTER=true', HANA
needs to release all OS resources prior to the automated registering.
.br
- \fBfence\fP: do 'crm node fence <\fIhost\fR>'. This needs a Linux cluster
STONITH method and sudo permission.
For scale-out, SAPHanaSR-alert-fencing needs to be configured additionally, see
manual page SAPHanaSR-alert-fencing(8) for details.
.br
.\" TODO - suicide: do 'systemctl reboot'. Do NOT use this!
.\" .br
Optional. Default is ignore.
.TP
\fBkill_signal = [ \fIINTEGER\fB ]\fP
Signal to be used with 'HDB kill-<\fIsignal\fR>'.
.br
Optional. Default is 9.
.\" TODO
.\" .TP
.\" \fBignore_srhook = [ yes | no ]\fP
.\" Initiate takeover even if HANA system replication (srHook) is not in sync.
.\" .br
.\" Advanced. Default is no. Please use only if requested.
.\" .TP
.\" \fBmonitor_services = [ <service>,<service>,... ]\fP
.\" HANA services (processes) to look at.
.\" Represented by dictionary entry "service_name".
.\" .br
.\" Optional. Default is service "indexserver".
.\" .TP
.\" \fBmonitor_tenants = [ <tenant>,<tenant>,... ]\fP
.\" HANA tenants to look at.
.\" Represented by dictionary entry "database".
.\" .br
.\" Optional. Default is tenant TODO.
.TP
\fBstop_timeout = [ \fIINTEGER\fB ]\fP
How many seconds to wait for 'sapcontrol ... StopSystem' to return.
Should be greater than value of HANA parameter 'forcedterminationtimeout'.
See also SAPHanaSR_basic_cluster(7).
.br
Optional. Default is 20 seconds.
.TP
* The "HA/DR providers" API accepts the following parameter for the trace section in global.ini:
.TP
\fB[trace]\fP
.TP
\fBha_dr_suschksrv = [ info | debug ]\fP
Optional. Default is info. Will be added automatically if not set.
.PP
* The HANA daemon TODO for the daemon section of daemon.ini:
.\" TODO check the below values with SAP
.PP
\fB[daemon]\fP
.TP
\fBterminationtimeout = [ \fIINTEGER\fB ]\fP
.br
See also SAPHanaSR_basic_cluster(7).
Optional. Timeout in milliseconds. Default is 30000.
.TP
\fBforcedterminationtimeout = [ \fIINTEGER\fB ]\fP
.br
See also SAPHanaSR_basic_cluster(7).
Optional. Timeout in milliseconds. Default is 270000.
.PP
* The HANA daemon TODO for the indexserver.<tenant> section of daemon.ini:
.\" TODO check the below values with cloud partner
.TP
\fB[indexserver.<\fItenant\fR>]\fP
.TP
\fBgracetime = [ \fIINTEGER\fB ]\fP
TODO Should be 6000.
.br
Optional. Timeout in milliseconds. Default is 2000.
.PP
.\"
.SH RETURN CODES
.\"
.B 0
Successful program execution.
.br
.B >0
Usage, syntax or execution errors.
.PP
.\"
.SH EXAMPLES
.\"
\fB*\fP Example for minimal entry in SAP HANA scale-up global configuration
/hana/shared/$SID/global/hdb/custom/config/global.ini
.PP
In case of a failing indexserver, the event is logged. No action is performed.
The section ha_dr_provider_suschksrv is needed on all HANA nodes.
The HANA has to be stopped before the file can be changed.
.PP
.RS 2
[ha_dr_provider_suschksrv]
.br
provider = susChkSrv
.br
path = /usr/share/SAPHanaSR-angi/
.br
execution_order = 3
.RE
.PP
\fB*\fP Example for entry in SAP HANA scale-up global configuration
/hana/shared/HA1/global/hdb/custom/config/global.ini
.PP
In case of a failing indexserver, the complete node will get fenced by
calling 'crm node fence ...'.
Unlike fence actions from inside the Linux HA cluster, this hook script fence
action will be issued even when the cluster is in maintenance and stonith is
temporarily disabled. The fence request then is queued until the cluster is set
back into active state. See example below for removing queued fence actions.
.br
The section ha_dr_provider_suschksrv is needed on all HANA nodes.
The HANA has to be stopped before the file can be changed.
Alterntively use SAPHanaSR-manageProvider for applying an HA/DR provider hook
configuration, see manual page SAPHanaSR-manageProvider(8).
.PP
.RS 2
[ha_dr_provider_suschksrv]
.br
provider = susChkSrv
.br
path = /usr/share/SAPHanaSR-angi/
.br
execution_order = 2
.br
action_on_lost = fence
.RE
.PP
\fB*\fP Example for entry in SAP HANA scale-out global configuration
/hana/shared/HA1/global/hdb/custom/config/global.ini
.PP
In case of a failing indexserver, the HANA instance will be stopped with
'sacpcontrol ... StopSystem'. HANA timeouts might be adapted to speed up the
stop.
If this is combined with SAPHanaController parameter 'AUTOMATED_REGISTER=true',
HANA needs to release all OS resources prior to the automated registering.
.\" TODO This action is recommended for scale-out. ?
.br
The hook script should wait maximum 25 seconds on the sapcontrol command to
return.
.br
The section ha_dr_provider_suschksrv is needed on all HANA nodes.
The HANA has to be stopped before the file can be changed.
.br
Note: HANA scale-out is supported only with exactly one master nameserver.
No HANA host auto-failover.
.PP
.RS 2
[ha_dr_provider_suschksrv]
.br
provider = susChkSrv
.br
path = /usr/share/SAPHanaSR-angi/
.br
execution_order = 2
.br
action_on_lost = stop
.br
stop_timeout = 25
.RE
.PP
\fB*\fP Example for entry in SAP HANA daemon configuration
/hana/shared/HA1/global/hdb/custom/config/daemon.ini
.PP
TODO
Example SID is HA1, tenant is HA1.
.br
The sections daemon and indexserver.HA1 are needed on all HANA nodes.
The HANA has to be stopped before the file can be changed.
Please refer to SAP documentation before setting this parameters.
.PP
.RS 2
[daemon]
.br
terminationtimeout = 45000
.br
forcedterminationtimeout = 15000
.PP
[indexserver.HA1]
.br
gracetime = 6000
.RE
.PP
\fB*\fP Example for sudo permissions in /etc/sudoers.d/SAPHanaSR .
.PP
SID is HA1. See also manual page SAPHanaSR-hookHelper(8).
.PP
.RS 2
# SAPHanaSR needs for susChkSrv
.br
ha1adm ALL=(ALL) NOPASSWD: /usr/bin/SAPHanaSR-hookHelper --sid=HA1 --case=fenceMe
.RE
.PP
\fB*\fP Example for looking up the sudo permission for the hook script.
.PP
All related files (/etc/sudoers and /etc/sudoers.d/*) are scanned.
Example SID is HA1.
.PP
.RS 2
# sudo -U ha1adm -l | grep "NOPASSWD.*/usr/bin/SAPHanaSR-hookHelper"
.RE
.PP
\fB*\fP Example for checking the HANA tracefiles for srServiceStateChanged() events.
.PP
Example SID is HA1. To be executed on the respective HANA master nameserver.
.br
If the HANA nameserver process is killed, in some cases hook script actions do
not make it into the nameserver tracefile. In such cases the hook script´s own
tracefile might help, see respective example.
.PP
.RS 2
# su - ha1adm
.br
~> cdtrace
.br
~> grep susChkSrv.*srServiceStateChanged nameserver_*.trc
.br
~> grep -C2 Executed.*StopSystem nameserver_*.trc
.RE
.PP
\fB*\fP Example for checking the HANA tracefiles for when the hook script has been loaded.
.PP
Example SID is HA1. To be executed on both sites' master nameservers.
.PP
.RS 2
# su - ha1adm
.br
~> cdtrace
.br
~> grep HADR.*load.*susChkSrv nameserver_*.trc
.br
~> grep susChkSrv.init nameserver_*.trc
.RE
.PP
\fB*\fP Example for checking the hook script tracefile for actions.
.PP
Example SID is HA1. To be executed on all nodes. All incidents are logged on
the node where it happens.
.PP
.RS 2
# su - ha1adm
.br
~> cdtrace
.br
~> egrep '(LOST:|STOP:|START:|DOWN:|init|load|fail)' nameserver_suschksrv.trc
.RE
.PP
\fB*\fP Example for checking the hook script tracefile for node fence actions.
.PP
Example SID is HA1. To be executed on both sites' master nameservers. See also
manual page SAPHanaSR-hookHelper(8).
.PP
.RS 2
# su - ha1adm
.br
~> cdtrace
.br
~> grep fence.node nameserver_suschksrv.trc
.RE
.PP
\fB*\fP Example for revoking a queued fence request from the Linux cluster.
.PP
This could be done if an HANA indexserver failure has triggerd an node fence
action while the Linux cluster is in maintenance. Before revoking a fence request,
be sure it has been issued by the HA/DR provider hook script. See example above
for checking the hook script tracefile for node fence actions.
Example node is node2. To be executed on that node.
See also manual pages SAPHanaSR-hookHelper(8) and crm_attribute(8).
.br
Note: This removes the node attribute terminate=true from the Linux cluster CIB.
It does not touch any fencing device.
.PP
.RS 2
# grep fenced:.termination.was.requested /var/log/pacemaker/pacemaker.log
.br
# crm_attribute -t status -N 'node2' -D -n terminate
.br
# crm_attribute -t status -N 'node2' -G -n terminate
.RE
.PP
\fB*\fR Example for killing HANA hdbindexserver process.
.PP
This could be done for testing the HA/DR provider hook script integration.
Killing HANA processes is dangerous. This test should not be done
on production systems.
Please refer to SAP HANA documentation. See also manual page killall(1).
.br
Note: Understand the impact before trying.
.PP
1. Check HANA and Linux cluster for clean idle state.
.PP
2. On secondary master name server, kill the hdbindexserver process.
.RS 2
# killall -9 hdbindexserver
.RE
.PP
3. Check the nameserver tracefile for srServiceStateChanged() events.
.PP
4. Check HANA and Linux cluster for clean idle state.
.RE
.PP
.\"
.SH FILES
.\"
.TP
/usr/share/SAPHanaSR-angi/susChkSrv.py
the hook provider, delivered with the RPM
.TP
/usr/bin/SAPHanaSR-hookHelper
the external script for node fencing
.TP
/etc/sudoers, /etc/sudoers.d/*
the sudo permissions configuration
.TP
/hana/shared/$SID/global/hdb/custom/config/global.ini
the on-disk representation of HANA global system configuration
.TP
/hana/shared/$SID/global/hdb/custom/config/daemon.ini
the on-disk representation of HANA daemon configuration
.TP
/usr/sap/$SID/HDB$nr/$HOST/trace
path to HANA tracefiles
.TP
/usr/sap/$SID/HDB$nr/$HOST/trace/nameserver_suschksrv.trc
HADR provider hook script tracefile
.PP
.\"
.SH REQUIREMENTS
.\"
1. SAP HANA 2.0 SPS05 or later provides the HA/DR provider hook method
srServiceStateChanged() with needed parameters.
.PP
2. No other HADR provider hook script should be configured for the
srServiceStateChanged() method. Hook scripts for other methods, provided in
SAPHanaSR and SAPHanaSR-ScaleOut, can be used in parallel to susChkSrv.py, if
not documented contradictingly.
.PP
3. The user ${sid}adm needs execution permission as user root for the command
SAPHanaSR-hookHelper.
.PP
4. The hook provider needs to be added to the HANA global configuration, in
memory and on disk (in persistence).
.PP
5. HANA daemon timeout TODO
.PP
6. The hook script runs in the HANA nameserver. It runs on the node where the event
srServiceStateChanged() occurs.
.PP
7. HANA scale-out is supported only with exactly one master nameserver. HANA
host auto-failover is not supported. Thus no standby nodes.
.PP
8. A Linux cluster STONITH method for all nodes is needed, particularly if
susChkSrv.py parameter 'action_on_lost=fence' is set.
.PP
9. If susChkSrv.py parameter 'action_on_lost=stop' is set and the RA SAPHana or
SAPHanaController parameter 'AUTOMATED_REGISTER=true' is set, it depends on HANA
to release all OS resources prior to the registering attempt.
.PP
10. For HANA scale-out, the susChkSrv.py parameter 'action_on_lost=fence' should
be used only, if the SAPHanaSR-alert-fencing is configured.
.PP
11. If the hook provider should be pre-compiled, the particular Python version
that comes with SAP HANA has to be used.
.\"
.SH BUGS
.\"
In case of any problem, please use your favourite SAP support process to open
a request for the component BC-OP-LNX-SUSE.
Please report any other feedback and suggestions to [email protected].
.PP
.\"
.SH SEE ALSO
.\"
\fBSAPHanaSR\fP(7) , \fBSAPHanaSR-ScaleOut\fP(7) , \fBSAPHanaSR.py\fP(7) ,
\fBocf_suse_SAPHanaTopology\fP(7) , \fBocf_suse_SAPHanaController\fP(7) ,
\fBSAPHanaSR-hookHelper\fP(8) , \fBSAPHanaSR-manageProvider\fP(8) ,
\fBSAPHanaSR-alert-fencing\fP(8) ,
\fBcrm\fP(8) , \fBcrm_attribute\fP(8) ,
\fBpython3\fP(8) , \fBkillall\fP(1) ,
.br
https://help.sap.com/docs/SAP_HANA_PLATFORM?locale=en-US
.br
https://help.sap.com/docs/SAP_HANA_PLATFORM/42668af650f84f9384a3337bcd373692/e2064c4aa47f443ab6a107f9ab7f5edd.html?version=2.0.01
.br
https://help.sap.com/docs/SAP_HANA_PLATFORM/6b94445c94ae495c83a19646e7c3fd56/5df2e766549a405e95de4c5d7f2efc2d.html?locale=en-US
.br
SAP note 2177064
.PP
.\"
.SH AUTHORS
.\"
A.Briel, F.Herschel, L.Pinne.
.PP
.\"
.SH COPYRIGHT
.\"
(c) 2022-2026 SUSE LLC
.br
susChkSrv.py comes with ABSOLUTELY NO WARRANTY.
.br
For details see the GNU General Public License at
http://www.gnu.org/licenses/gpl.html
.\"