OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
rbx6-12b-n56
Incident Report for Network & Infrastructure
Resolved
L'utilisation des buffers est elevée sur ce switch (12b ne présente ps de signe anormal)

Cela est provoque par le process AFM, jamais très bon signe... (ACL Feature Manager )

rbx6-12b-n56# sh system internal mts buffers summary
node sapno recv_q pers_q npers_q log_q
sup 175 0 9 0 0
sup 377 0 0 0 47
sup 608 0 159 0 0
sup 284 0 4 0 0
sup 351 0 0 0 17
rbx6-12b-n56# sh system internal mts sup sap 608 description
Afm SAP

On investigue sur la root cause mais ça sent le reload.


Update(s):

Date: 2016-04-30 11:41:03 UTC
transceiver changé, fex116 up, buffer et sw okay, nous pouvons retourner a une activité normale.

Date: 2016-04-30 11:27:23 UTC
les fexs ont été uppé, la redondance est rétablie tous les fex sauf le 116
En effet, le 116 a flappé cote 12B, nous avons un optiques hs dessus => en cours de fix par le datacentre.



Date: 2016-04-30 10:48:27 UTC
rbx6-12b-n56# sh fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 fex100 Online N2K-C2248TP-E-1GE SSI181709KY
101 fex101 Online N2K-C2248TP-E-1GE FOX1844G5AX
102 fex102 Online N2K-C2248TP-E-1GE FOX1901G31F
103 fex103 Online N2K-C2248TP-E-1GE FOX1901G2YS
104 fex104 Online N2K-C2248TP-E-1GE FOX1844G75X
105 fex105 Online N2K-C2248TP-E-1GE FOX1905GDWS
106 fex106 Online N2K-C2248TP-E-1GE FOX1844GJHP

Date: 2016-04-30 10:31:58 UTC
reload du SW done.
Nous avons shutter les po vers les FEX pour eviter de saturer de nouveau en uppant les 1000eth d'un coup.
Nous remontons les fex 1 par 1 en surveillant les buffers

rbx6-12b-n56# sh fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 fex100 Online N2K-C2248TP-E-1GE SSI181709KY
101 fex101 Online N2K-C2248TP-E-1GE FOX1844G5AX
102 fex102 Online N2K-C2248TP-E-1GE FOX1901G31F

Date: 2016-04-30 10:16:42 UTC
CPU avant le reload, snmpd tabasse le switch: cela semble être une conséquence.
wild guess a confirmer avec Cisco: ETHPM galere => provoque la monter en buffer d'AFM => SNMP galère.
Le tout prend tout le CPU et on entre dans un cercle...

rbx6-12b-n56# sh system internal processes cpu
top - 12:10:39 up 315 days, 19:11, 3 users, load average: 1.28, 1.45, 1.15
Tasks: 240 total, 3 running, 236 sleeping, 0 stopped, 1 zombie
Cpu(s): 2.9%us, 1.7%sy, 0.0%ni, 95.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 8243352k total, 3861200k used, 4382152k free, 288k buffers
Swap: 0k total, 0k used, 0k free, 1463832k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28326 root 20 0 348m 38m 25m R 52.8 0.5 30903:01 snmpd
4458 root 20 0 321m 70m 19m R 33.9 0.9 19053:50 ethpm
3994 root 20 0 310m 41m 15m S 17.0 0.5 15171:01 stats_client
8423 nicolas. 20 0 3620 1528 1140 R 7.5 0.0 0:00.07 top
4050 root 20 0 321m 32m 20m S 3.8 0.4 5352:08 pm
4174 root 20 0 442m 73m 26m S 3.8 0.9 6659:44 netstack
4170 root 20 0 297m 49m 20m S 1.9 0.6 1567:58 satmgr
1 root 20 0 2004 664 580 S 0.0 0.0 5:19.84 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:11.29 migration/0
4 root 15 -5 0 0 0 S 0.0 0.0 94:25.26 ksoftirqd/0
5 root RT -5 0 0 0 S 0.0 0.0 5:09.96 watchdog/0
6 root RT -5 0 0 0 S 0.0 0.0 0:14.36 migration/1

Date: 2016-04-30 10:02:24 UTC
o spanning tree instance exists.
rbx6-12b-n56# sh platform afm info copp-tbls | diff
8,10c8,10
< 0 default 64000 6250 51700252190 4151275828
< 1 stp 2500000 4687 1214117872 0
< 2 lacp 128000 4687 574984688 0
---
> 0 default 64000 6250 51700312959 4151275828
> 1 stp 2500000 4687 1214119104 0
> 2 lacp 128000 4687 574985296 0
15c15
< 7 sat control 62500000 65535 2318965670683 0
---
> 7 sat control 62500000 65535 2318968001023 0
25c25
< 18 cdp 128000 4687 159709968 0
---
> 18 cdp 128000 4687 159710144 0
28,29c28,29
< 21 mgmt/ipv6-mgmt* 1500000 4687 139677728087 5781405
< 23 arp/ipv6-nd 8000 3515 16452102544 630004096
---
> 21 mgmt/ipv6-mgmt* 1500000 4687 139677925157 5781405
> 23 arp/ipv6-nd 8000 3515 16452118836 630004096
33c33
< 27 hsrp vrrp/ipv6-hsrp 128000 250 2987080360 85648746
---
> 27 hsrp vrrp/ipv6-hsrp 128000 250 2987083756 85648746
44c44
< 41 excp/ipv6-excp** 8000 4687 5679291770 384144830
---
> 41 excp/ipv6-excp** 8000 4687 5679301982 384144830


Nous prennons qq logs et reloadons la box, pas de downtime, le trafic est forwardé par 12a
Posted Apr 30, 2016 - 09:53 UTC
This incident affected: Infrastructure || RBX (RBX6).