OVHcloud Private Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
mise à jour reseau
Scheduled Maintenance Report for Hosted Private Cloud
Completed
Nous allons mettre à jour l'ensemble de switch
du pCC. Normalement il n'y a pas de panne à
prevoir puisque les switchs profitent de mises
à jour en ISSU (sans interuption de service).
Mais on a déjà eu les plantages à ce niveau là.
Si ce cas là se represente, tout va basculer
sur le 2ème reseau.

Update(s):

Date: 2011-08-05 20:20:28 UTC
On arrete les travaux là pour aujourd'hui. Assez
d'emotions pour une petite journée :(

Date: 2011-08-05 19:52:23 UTC
Après les discutions avec TAC et quelques dump sur le reseau
il se peut que certains packets ont un effet surprennant sur
les N5 dans la version (3).Nx.x Il s'agit de packet spantree
avec une mac source 0100.0ccc.cccd qui se retrouvent sur le
reseau on ne sait pas d'où (probablement les clients nous les
envoient). Il s'agit d'un packet mal formé qui n'existe pas
dans le monde parfait. les packets peuvent avoir une destination
0100.0ccc.cccd mais pas source. Le packet arrive donc au CPU.

La 1ere idée a été de mettre une mac access-list pour filtrer
ces packets.

pcc-12b-n5# sh mac access-lists

MAC access list test
10 deny 0100.0ccc.cccd ffff.ffff.ffff any
20 permit any any

Ca marche pas. Le CPU est toujours à 100%

On nous a donc demandé d'activer le spantree pour
voir si les process spantree ne pourraient pas
traiter ces packets à la place du CPU.

On a activité le spantree mais lorsqu'on active
les ports il y a une nouvelle limite de nombre
d'instance spantree par port et par vlan. On a
mis en place le spantree mst qui reduit le nombre
d'instance mais ça ne change rien.

On a donc forcé le test en activant tous les ports
et en regardant avec le stresse les messages de
logs qui s'affichaient sur nos consoles.

2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74400) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74500) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75400) exceeded [MST mode] recommended
limit of 14500

Finalement la configuration a été prise et ça a l'air
de switcher. Les hosts fonctionnent. Le spantree
probablement pas, mais depuis le CPU est correct.

pcc-12b-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4210 588 201530 2 2.0% gatosusd
1 1014 1305 777 0.0% init

CPU util : 0.0% user, 1.0% kernel, 99.0% idle

Donc apparament c'est réellement ces packets qui sont
à l'origine du probleme de CPU.

On va remonter l'info à TAC de Cisco et on va voir
s'ils ne peuvent pas nous livrer une version patchée
de NX-OS pourqu'on vire le spantree.





Date: 2011-08-05 18:06:46 UTC
kickstart: version 5.0(3)N2(1)
system: version 5.0(3)N2(1)


Date: 2011-08-05 18:04:53 UTC
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
1 1025 1462 701 0.0% init
pcc-12b-n5(config)# inter po 100
pcc-12b-n5(config-if)# no shutdown
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
2011 Aug 5 20:03:38 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-100 On-line
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.0% netstack

plus qu'à downgrader.

Date: 2011-08-05 18:02:47 UTC
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.2% netstack
pcc-12b-n5(config-if)# inter po 111
pcc-12b-n5(config-if)# shutdown
2011 Aug 5 20:01:05 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 111 is offline
2011 Aug 5 20:01:05 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-111 Off-line (Serial Number )
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 2.0% netstack

Il a fallu couper tous les FEX pour retrouver le CPU à 2%


Date: 2011-08-05 17:49:40 UTC
on downgrade pcc-12 en n5000-uk9.5.0.3.N1.1b.bin qui semble ne pas
poser de probleme de netstack mais qui a d'autres bugs.

Date: 2011-08-05 17:41:24 UTC
on a remis les ports UP sur le B et on a le CPU explosé
sur le pcc-22

pcc-12b-n5# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 84.0% netstack

un host ou plusieurs doivent envoyer les packets qui vont
directement sur le N5 en software et prennent tout le cpu.
il s'agit d'un bug soft sur les N5. mais il faut encore
trouver ce qui pose le probleme exactement.

Date: 2011-08-05 17:22:00 UTC
les ports des 2 pcc-22 sont coupés.

Date: 2011-08-05 17:18:30 UTC
on coupe tous les ports de pcc-12 et on va le rebooter en hard.

Date: 2011-08-05 17:16:31 UTC
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102400
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102401
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102402
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102403
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102404
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102405
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102407
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102408
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102409
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102410
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102411
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102412


Date: 2011-08-05 17:12:04 UTC
les mises à jour à chaud ne fonctionnent pas à tous les coups sur
les Nexus 5xxx avec les FEX. On va changer la strategie: on coupe
les ports sur l'un de 2 côtés, on force donc le fonctionnement
sur le 2ème couple, puis on le met à jour. Il pourra planter s'il
veut. Et dés que c'est revenu on le remet en production.

Date: 2011-08-05 17:05:10 UTC
pcc-12a et b sont revenus après un reboot en hard, les FEX demarrent.

Date: 2011-08-05 16:54:39 UTC
les 2 pcc-12 se sont vautrer. mais ne lachent pas les ports
de hosts. on les reboot en hard

Date: 2011-08-05 16:44:34 UTC
pcc-12b-n5 a planté. pcc-12a continue à switch les FEX

Date: 2011-08-05 16:43:26 UTC
pcc-15-n5 on va couper tous les ports des FEX puis
redemarrer en hard le N5

Date: 2011-08-05 16:39:55 UTC
pcc-25-n5 fini

on retrouve le même probleme que sur le pcc-22-n5 qui
semble d'être lié à Nexus 5548P: netstack prend du CPU
On a déjà un TAC chez Cisco ouvert à ce sujet.

pcc-25-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4459 184 43 4294 49.5% netstack


Date: 2011-08-05 16:33:19 UTC
pcc-12a-n5 fini
pcc-12b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none
106 yes non-disruptive none
107 yes non-disruptive none
108 yes non-disruptive none
109 yes non-disruptive none
110 yes non-disruptive none
111 yes non-disruptive none


Date: 2011-08-05 16:15:28 UTC
pcc-25-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling


Date: 2011-08-05 16:05:55 UTC
storage-s27b-n5 fini

Date: 2011-08-05 16:02:55 UTC
pcc-12a-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Date: 2011-08-05 15:59:53 UTC
storage-s27a-n5 fini
storage-s27b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none


Date: 2011-08-05 15:55:42 UTC
pcc-11b est revenu. pcc-11a et b ont mis à jour les FEX
et ont activé les ports de chaque host qui a été configuré
puis dés que le port a été UP, le host a renvoyé le trafic
sur le pcc-11.

Date: 2011-08-05 15:37:30 UTC
pcc-11a-n5# 2011 Aug 5 17:36:55 pcc-11a-n5 %VPC-2-VPC_ISSU_END: Peer vPC switch ISSU end, unlocking configuration
2011 Aug 5 17:37:00 pcc-11a-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed

pcc-11b a aussi planté. le pcc-22 a repris le switching du vlan.

Date: 2011-08-05 15:35:32 UTC
storage-s27a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling


Date: 2011-08-05 15:32:20 UTC
storage-s28 mis à jour.
on passe à storage-s27



Date: 2011-08-05 15:28:39 UTC
storage-s28a-n5 fini avec ses FEX.
storage-s28b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none


Date: 2011-08-05 15:28:13 UTC
pcc-11a-n5 a planté pendant la mis à jour. pcc-11b-n5 continue
à gerer les FEX. pcc-11a est revenu. On lui coupe les FEX.
On met à jour le pcc-11b. Si ça se passe bien, il mettra à
jour les FEX et on pourra remettre les FEX sur le pcc-11a


Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Date: 2011-08-05 15:21:24 UTC
2011 Aug 5 17:18:23 pcc-11b-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed


Date: 2011-08-05 15:14:53 UTC
pcc-11b-n5# 2011 Aug 5 17:13:45 pcc-11b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration
storage-s28b-n5# 2011 Aug 5 17:14:33 storage-s28b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration


Date: 2011-08-05 15:14:28 UTC
storage-s28a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling


Date: 2011-08-05 15:13:38 UTC
pcc-11a

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling



Date: 2011-08-05 15:12:48 UTC
pcc-26-n5 en cours

Date: 2011-08-05 15:12:30 UTC
pcc-28-n5 en cours

Date: 2011-08-05 15:12:17 UTC
pcc-29-n5 en cours

Date: 2011-08-05 15:11:23 UTC
storage-s28a-n5 en cours

Date: 2011-08-05 15:10:44 UTC
pcc-10a fait
pcc-10b fait

pcc-11a en cours
Posted Aug 05, 2011 - 15:06 UTC