rssLink RSS for all categories
 
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_green
icon_green
icon_orange
icon_red
icon_red
icon_green
icon_green
icon_green
icon_orange
icon_green
icon_red
icon_green
icon_green
icon_red
icon_green
icon_green
icon_red
icon_green
icon_green
icon_orange
icon_blue
icon_green
icon_blue
icon_green
icon_blue
icon_red
icon_green
 

FS#5820 — rbx-s1/rbx-s2 ace

Attached to Project— Reseau Internet et Baies
Incident
Entire OVH Network
CLOSED
100%
Nous avons un incident sur l'ACE du rbx-s1. On cherche l'origine du pb.
Date:  Tuesday, 27 September 2011, 09:51AM
Reason for closing:  Done
Comment by OVH - Sunday, 25 September 2011, 00:58AM

Nous allons probablement être contraints de rebooter. Nous préparons la carte.


Comment by OVH - Sunday, 25 September 2011, 01:04AM

Nous redémarrons la carte.

20w1d: SP: The PC in slot 2 is shutting down. Please wait ...
20w1d: SP: PC shutdown completed for module 2
Sep 25 00:07:45 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)

20w1d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

Sep 25 00:13:03 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 00:13:14 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 00:13:18 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online


Comment by OVH - Sunday, 25 September 2011, 01:14AM

c'est fait. la carte est up à nouveau.


Comment by OVH - Sunday, 25 September 2011, 03:07AM

la carte slave s2 ace qui a repris la charge de s1 a planté

Sep 25 01:38:28 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 01:38:29 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 01:38:30 GMT: %DIAG-SP-3-TEST_FAIL: Module 2: TestAsicSync{ID=3} has failed. Error code = 0x76 (DIAG_QUERY_HYPERION_SYNC_ERROR)

la carte est revenu avec le message sur l'origine du plante:
last boot reason: SB Wdog uspace big loadavg


Comment by OVH - Sunday, 25 September 2011, 03:11AM

Sep 25 02:03:20 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 02:03:20 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 02:08:52 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 02:09:05 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 02:09:08 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online

la carte est up avec le message de reboot:
last boot reason: SB Wdog uspace big loadavg


Comment by OVH - Sunday, 25 September 2011, 03:40AM

si on etudie le message d'erreur ça vaudrait dire
qu'à cause d'un client (uspace) il y a un charge
importante (big loadavg) et qu'en consequence le
watchdog (ft fail-tolerance) declanche le basculement
de la carte master vers la carte slave. en cas où on
sait jamais je decide de basculer sur la carte slave
parce que je decide que le master n'est pas en forme.
aucune idée si c'est vrai. on va voir la reponse du TAC.

on a changé les valeurs de ft de

heartbeat interval 300
heartbeat count 20

vers

heartbeat interval 1000
heartbeat count 50

On va déjà voir si là c'est plus stable.


Comment by OVH - Sunday, 25 September 2011, 03:41AM

et pourquoi on a le probleme uniquement la nuit ? un client
est bourin ?

s2/ace est master :

rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 68%; one minute: 66%; five minutes: 63%

s1/ace est slave actuellement

rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 31%; one minute: 34%; five minutes: 33%


Comment by OVH - Sunday, 25 September 2011, 04:07AM

si la situation n'est pas stable, nous allons ajouter
une limitation à 4 connexions simultanés pour administration
de l'ACE. certains clients utilisent 50 ou 100 access !?
et ils sont probablement à l'origine du probleme.


Comment by OVH - Sunday, 25 September 2011, 04:12AM

nous l'avons appliqué sur certains context de certains clients.


Comment by OVH - Sunday, 25 September 2011, 12:03PM

rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 3%; one minute: 4%; five minutes: 5%
rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 10%; one minute: 12%; five minutes: 13%

c'est beaucoup mieux.