Problema

Ao tentar executar o precheck para upgrade dos RoCE em um Exadata X8M-2, a operação falhou após esse mensagem de Warning ao tentar avaliar a atualização de um dos RoCE Switches:

[WARNING  ] Unable to find peer single-rack switch of exa01sw-roces in switch list

Exemplo:

================PatchMgr run started ===========
With arguments: --roceswitches /home/dbmadmin/roce_group --upgrade --roceswitch-precheck --log_dir /home/dbmadmin/log_precheck
       :Working: DO: Initiate pre-upgrade validation check on 3 RoCE switch(es).
++++++++++++++++++ Logs so far begin ++++++++++

1 of 3 :Running upgrade precheck on switch exa01sw-rocea

        [INFO     ] Performing Nodes connectivity tests on exa01sw-rocea
        [SUCCESS  ] Nodes connectivity tests on exa01sw-rocea are successful
        [INFO     ] ASR will be installed/upgraded during the next switch upgrade on exa01sw-rocea
        [INFO     ] Switch exa01sw-rocea will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin
        [INFO     ] Checking for free disk space on switch
        [INFO     ] disk is 95.00% free,  available: 111562809344 bytes
        [SUCCESS  ] There is enough disk space to proceed
        [INFO     ] Copying nxos.7.0.3.I7.9.bin onto exa01sw-rocea (eta: 5-30 minutes)
        [SUCCESS  ] Finished copying image to switch
        [INFO     ] Verifying sha256sum of bin file on switch
        [SUCCESS  ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325
        [INFO     ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes)
        [SUCCESS  ] FW install pre-check completed successfully

2 of 3 :Running upgrade precheck on switch exa01sw-roceb

        [INFO     ] Performing Nodes connectivity tests on exa01sw-roceb
        [SUCCESS  ] Nodes connectivity tests on exa01sw-roceb are successful
        [INFO     ] ASR will be installed/upgraded during the next switch upgrade on exa01sw-roceb
        [INFO     ] Switch exa01sw-roceb will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin
        [INFO     ] Checking for free disk space on switch
        [INFO     ] disk is 95.00% free,  available: 111563468800 bytes
        [SUCCESS  ] There is enough disk space to proceed
        [INFO     ] Copying nxos.7.0.3.I7.9.bin onto exa01sw-roceb (eta: 5-30 minutes)
        [SUCCESS  ] Finished copying image to switch
        [INFO     ] Verifying sha256sum of bin file on switch
        [SUCCESS  ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325
        [INFO     ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes)
        [SUCCESS  ] FW install pre-check completed successfully

3 of 3 :Running upgrade precheck on switch exa01sw-roces

        [WARNING  ] Unable to find peer single-rack switch of exa01sw-roces in switch list
        [FAIL     ] [FirmwareUpgradeError] Fabric health check failed
++++++++++++++++++ Logs so far end ++++++++++
       :FAILED : Initiate pre-upgrade validation check on RoCE switch(es).
================PatchMgr run ended ===========

Causa

O switch no qual o patchmgr apresenta falha é um Spine Switch que que está instalado e ligado neste Exadata, disponível para uma futura expansão de Sinle-Rack para Multi-Rack, mas não está sendo usado efetivamente atualmente.

Quando colocamos o nome dos 3 switches juntos no mesmo arquivo de grupo, o patchmgr assume que todos são um Leaf Switch e tenta realizar as mesmas validações e tratativas em todos os switces envolvidos na operação de precheck.

Solução

O caminho encontrado neste cenário foi colocar o Spine Switch e os Leaf Switch em arquivos de grupos separados, realizando duas execuções do patchmgr: Uma para Spine Switch isolado, e outra para os dois Leaf Switches A e B.

Antes (um único arquivo de grupo):

# cat /home/dbmadmin/roce_group
exa01sw-rocea 
exa01sw-roceb 
exa01sw-roces

Depois (dois arquivos de grupo separados):

# cat /home/dbmadmin/spine_group
exa01sw-roces

# cat /home/dbmadmin/leaf_group
exa01sw-rocea 
exa01sw-roceb 

1) Precheck

1.1) Executando precheck para o Spine Switch de forma isolada:

./patchmgr --roceswitches ~/spine_group --upgrade --roceswitch-precheck --log_dir ~/log_precheck

1.2) Executando precheck para os dois Leaf Switch:

./patchmgr --roceswitches ~/leaf_group --upgrade --roceswitch-precheck --log_dir ~/log_precheck

2) Upgrade:

2.1) Executando upgrade para o Spine Switch de forma isolada:

nohup ./patchmgr --roceswitches ~/spine_group --upgrade --log_dir ~/log_upgrade > UpgradeSpineSwitch.log 2>&1 &

2.2) Executando upgrade para os dois Leaf Switch:

nohup ./patchmgr --roceswitches ~/leaf_group --upgrade --log_dir ~/log_upgrade > UpgradeLeafSwitch.log 2>&1 &

Agora tanto no prehceck, quanto no upgrade em si, o resultado da etapa “Performing Nodes connectivity tests” deve apresetar algo como “Not needed due to isolated”.

Exemplo (linha 6):

:Working: Initiate upgrade of 1 RoCE switch(es) to 7.0(3)I7(9) Expect up to 15 minutes for each switch

1 of 1 :Running upgrade on switch exa01sw-roces

 [INFO     ] Performing Nodes connectivity tests on exa01sw-roces
 [INFO     ] Not needed due to isolated exa01sw-roces
 [INFO     ] Switch exa01sw-roces will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin
 [INFO     ] Checking for free disk space on switch
 [INFO     ] disk is 94.00% free,  available: 110364278784 bytes
 [SUCCESS  ] There is enough disk space to proceed
 [INFO     ] Found  nxos.7.0.3.I7.9.bin on switch, skipping download
 [INFO     ] Verifying sha256sum of bin file on switch
 [SUCCESS  ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325
 [INFO     ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes)
 [SUCCESS  ] FW install pre-check completed successfully
 [INFO     ] Performing FW install of nxos.7.0.3.I7.9.bin on exa01sw-roces (eta: 3-7 minutes)
 [SUCCESS  ] FW install completed
 [INFO     ] Waiting for switch to come back online (eta: 5-10 minutes)
 [INFO     ] Verifying if FW install is successful
 [SUCCESS  ] exa01sw-roces has been successfully  upgraded to nxos.7.0.3.I7.9.bin!
 [INFO     ] Begin ASR Installation on the switch
 [INFO     ] Copying asr.tar.gz onto exa01sw-roces (eta: 1-2 minutes)
 [SUCCESS  ] Finished copying asr package to switch
 [INFO     ] Verifying sha256sum of bin file on switch
 [SUCCESS  ] sha256sum matches: 7414d76f74debaedeb2974dbde2b288920da1cffe390060bba47c5f898779f65
 [INFO     ] ASR Installation succeded
:Working: Initiating config verification... Expect up to 6 minutes for each switch


1 of 1 :Verifying config on switch exa01sw-roces

 [INFO     ] Dumping current running config locally as file: /home/dbmadmin/log_upgrade/run.exa01sw-roces.cfg
 [SUCCESS  ] Backed up switch config successfully
 [INFO     ] Validating running config against template [1/10]: /Patches/FabricSwitch/patch_switch_22.1.4.0.0.220929/roce_switch_templates/roce_leaf_switch.cfg
 [INFO     ] Config matches template: /Patches/FabricSwitch/patch_switch_22.1.4.0.0.220929/roce_switch_templates/roce_leaf_switch.cfg
 [SUCCESS  ] Config validation successful!

:SUCCESS: Config check on RoCE switch(es)

:SUCCESS: upgrade 1 RoCE switch(es) to 7.0(3)I7(9)

:SUCCESS: Completed run of command: ./patchmgr --roceswitches ~/spine_group --upgrade --log_dir /home/dbmadmin/log_upgrade
:INFO   : upgrade performed on switch(es) in file ~/spine_group: [exa01sw-roces]
:INFO   : For details, check the following files in /home/dbmadmin/log_upgrade:
:INFO   :  - switch_admin.log
:INFO   :  - switch_admin.trc
:INFO   :  - patchmgr.stdout
:INFO   :  - patchmgr.stderr
:INFO   :  - patchmgr.log
:INFO   :  - patchmgr.trc
:INFO   : Exit status:0
:INFO   : Exiting.

Conclusão

Seguindo essa abordagem, todos os três RoCE Switches foram atualizado com sucesso sem apresentar nenhum erro.

Na prática eu cheguei na solução por intuição antes de confirmar a causa enquanto realizada os prechecks, mas continuei pesquisando a respeito até achar algo que desse um indicativo se era algo comum ou muito específico para o ambiente eu que estava atualizando. Encontrei uma nota no MOS que não apresenta exatamente o mesmo erro, mas descreve o mesmo cenário e sugere ser a mesma causa comum em ambos os casos:

Stand Alone X8M Rack Spine Switch fails patchmanager verify-config (Doc ID 2684096.1)

Por fim, como sempre abro uma SR Proativa no suporte antes de atualizar qualquer Exadata (incluindo Exadata Cloud), pedi ao suporte que validasse este item e uma confirmação para seguir com o upgrade.

Leave a Reply

Discover more from Blog do Dibiei

Subscribe now to keep reading and get access to the full archive.

Continue reading