Problema
Ao tentar executar o precheck para upgrade dos RoCE em um Exadata X8M-2, a operação falhou após esse mensagem de Warning ao tentar avaliar a atualização de um dos RoCE Switches:
[WARNING ] Unable to find peer single-rack switch of exa01sw-roces in switch list
Exemplo:
================PatchMgr run started ===========
With arguments: --roceswitches /home/dbmadmin/roce_group --upgrade --roceswitch-precheck --log_dir /home/dbmadmin/log_precheck
:Working: DO: Initiate pre-upgrade validation check on 3 RoCE switch(es).
++++++++++++++++++ Logs so far begin ++++++++++
1 of 3 :Running upgrade precheck on switch exa01sw-rocea
[INFO ] Performing Nodes connectivity tests on exa01sw-rocea
[SUCCESS ] Nodes connectivity tests on exa01sw-rocea are successful
[INFO ] ASR will be installed/upgraded during the next switch upgrade on exa01sw-rocea
[INFO ] Switch exa01sw-rocea will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin
[INFO ] Checking for free disk space on switch
[INFO ] disk is 95.00% free, available: 111562809344 bytes
[SUCCESS ] There is enough disk space to proceed
[INFO ] Copying nxos.7.0.3.I7.9.bin onto exa01sw-rocea (eta: 5-30 minutes)
[SUCCESS ] Finished copying image to switch
[INFO ] Verifying sha256sum of bin file on switch
[SUCCESS ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325
[INFO ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes)
[SUCCESS ] FW install pre-check completed successfully
2 of 3 :Running upgrade precheck on switch exa01sw-roceb
[INFO ] Performing Nodes connectivity tests on exa01sw-roceb
[SUCCESS ] Nodes connectivity tests on exa01sw-roceb are successful
[INFO ] ASR will be installed/upgraded during the next switch upgrade on exa01sw-roceb
[INFO ] Switch exa01sw-roceb will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin
[INFO ] Checking for free disk space on switch
[INFO ] disk is 95.00% free, available: 111563468800 bytes
[SUCCESS ] There is enough disk space to proceed
[INFO ] Copying nxos.7.0.3.I7.9.bin onto exa01sw-roceb (eta: 5-30 minutes)
[SUCCESS ] Finished copying image to switch
[INFO ] Verifying sha256sum of bin file on switch
[SUCCESS ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325
[INFO ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes)
[SUCCESS ] FW install pre-check completed successfully
3 of 3 :Running upgrade precheck on switch exa01sw-roces
[WARNING ] Unable to find peer single-rack switch of exa01sw-roces in switch list
[FAIL ] [FirmwareUpgradeError] Fabric health check failed
++++++++++++++++++ Logs so far end ++++++++++
:FAILED : Initiate pre-upgrade validation check on RoCE switch(es).
================PatchMgr run ended ===========
Causa
O switch no qual o patchmgr apresenta falha é um Spine Switch que que está instalado e ligado neste Exadata, disponível para uma futura expansão de Sinle-Rack para Multi-Rack, mas não está sendo usado efetivamente atualmente.
Quando colocamos o nome dos 3 switches juntos no mesmo arquivo de grupo, o patchmgr assume que todos são um Leaf Switch e tenta realizar as mesmas validações e tratativas em todos os switces envolvidos na operação de precheck.
Solução
O caminho encontrado neste cenário foi colocar o Spine Switch e os Leaf Switch em arquivos de grupos separados, realizando duas execuções do patchmgr: Uma para Spine Switch isolado, e outra para os dois Leaf Switches A e B.
Antes (um único arquivo de grupo):
# cat /home/dbmadmin/roce_group exa01sw-rocea exa01sw-roceb exa01sw-roces
Depois (dois arquivos de grupo separados):
# cat /home/dbmadmin/spine_group exa01sw-roces # cat /home/dbmadmin/leaf_group exa01sw-rocea exa01sw-roceb
1) Precheck
1.1) Executando precheck para o Spine Switch de forma isolada:
./patchmgr --roceswitches ~/spine_group --upgrade --roceswitch-precheck --log_dir ~/log_precheck
1.2) Executando precheck para os dois Leaf Switch:
./patchmgr --roceswitches ~/leaf_group --upgrade --roceswitch-precheck --log_dir ~/log_precheck
2) Upgrade:
2.1) Executando upgrade para o Spine Switch de forma isolada:
nohup ./patchmgr --roceswitches ~/spine_group --upgrade --log_dir ~/log_upgrade > UpgradeSpineSwitch.log 2>&1 &
2.2) Executando upgrade para os dois Leaf Switch:
nohup ./patchmgr --roceswitches ~/leaf_group --upgrade --log_dir ~/log_upgrade > UpgradeLeafSwitch.log 2>&1 &
Agora tanto no prehceck, quanto no upgrade em si, o resultado da etapa “Performing Nodes connectivity tests” deve apresetar algo como “Not needed due to isolated”.
Exemplo (linha 6):
:Working: Initiate upgrade of 1 RoCE switch(es) to 7.0(3)I7(9) Expect up to 15 minutes for each switch 1 of 1 :Running upgrade on switch exa01sw-roces [INFO ] Performing Nodes connectivity tests on exa01sw-roces [INFO ] Not needed due to isolated exa01sw-roces [INFO ] Switch exa01sw-roces will be upgraded from nxos.7.0.3.I7.8.bin to nxos.7.0.3.I7.9.bin [INFO ] Checking for free disk space on switch [INFO ] disk is 94.00% free, available: 110364278784 bytes [SUCCESS ] There is enough disk space to proceed [INFO ] Found nxos.7.0.3.I7.9.bin on switch, skipping download [INFO ] Verifying sha256sum of bin file on switch [SUCCESS ] sha256sum matches: 992cbb067bc72f29403f643eafd73079c57f74dd632a391201a1e31fdebba325 [INFO ] Performing FW install pre-check of nxos.7.0.3.I7.9.bin (eta: 2-3 minutes) [SUCCESS ] FW install pre-check completed successfully [INFO ] Performing FW install of nxos.7.0.3.I7.9.bin on exa01sw-roces (eta: 3-7 minutes) [SUCCESS ] FW install completed [INFO ] Waiting for switch to come back online (eta: 5-10 minutes) [INFO ] Verifying if FW install is successful [SUCCESS ] exa01sw-roces has been successfully upgraded to nxos.7.0.3.I7.9.bin! [INFO ] Begin ASR Installation on the switch [INFO ] Copying asr.tar.gz onto exa01sw-roces (eta: 1-2 minutes) [SUCCESS ] Finished copying asr package to switch [INFO ] Verifying sha256sum of bin file on switch [SUCCESS ] sha256sum matches: 7414d76f74debaedeb2974dbde2b288920da1cffe390060bba47c5f898779f65 [INFO ] ASR Installation succeded :Working: Initiating config verification... Expect up to 6 minutes for each switch 1 of 1 :Verifying config on switch exa01sw-roces [INFO ] Dumping current running config locally as file: /home/dbmadmin/log_upgrade/run.exa01sw-roces.cfg [SUCCESS ] Backed up switch config successfully [INFO ] Validating running config against template [1/10]: /Patches/FabricSwitch/patch_switch_22.1.4.0.0.220929/roce_switch_templates/roce_leaf_switch.cfg [INFO ] Config matches template: /Patches/FabricSwitch/patch_switch_22.1.4.0.0.220929/roce_switch_templates/roce_leaf_switch.cfg [SUCCESS ] Config validation successful! :SUCCESS: Config check on RoCE switch(es) :SUCCESS: upgrade 1 RoCE switch(es) to 7.0(3)I7(9) :SUCCESS: Completed run of command: ./patchmgr --roceswitches ~/spine_group --upgrade --log_dir /home/dbmadmin/log_upgrade :INFO : upgrade performed on switch(es) in file ~/spine_group: [exa01sw-roces] :INFO : For details, check the following files in /home/dbmadmin/log_upgrade: :INFO : - switch_admin.log :INFO : - switch_admin.trc :INFO : - patchmgr.stdout :INFO : - patchmgr.stderr :INFO : - patchmgr.log :INFO : - patchmgr.trc :INFO : Exit status:0 :INFO : Exiting.
Conclusão
Seguindo essa abordagem, todos os três RoCE Switches foram atualizado com sucesso sem apresentar nenhum erro.
Na prática eu cheguei na solução por intuição antes de confirmar a causa enquanto realizada os prechecks, mas continuei pesquisando a respeito até achar algo que desse um indicativo se era algo comum ou muito específico para o ambiente eu que estava atualizando. Encontrei uma nota no MOS que não apresenta exatamente o mesmo erro, mas descreve o mesmo cenário e sugere ser a mesma causa comum em ambos os casos:
Stand Alone X8M Rack Spine Switch fails patchmanager verify-config (Doc ID 2684096.1)
Por fim, como sempre abro uma SR Proativa no suporte antes de atualizar qualquer Exadata (incluindo Exadata Cloud), pedi ao suporte que validasse este item e uma confirmação para seguir com o upgrade.