Recover VMs with corrupt snapshots

By Ather Beg 1 19052

Consulting throws up many challenges during the design and implementation stages but none more than the actual environment integration. Being at the ‘coal face’ invariably provides a point at which things don’t always go to plan and it’s this real world experience that we at Xtravirt excel at.

In this, my first blog posting, I’m going to discuss VMware snapshots and the possibility that you can recover from corrupted ones.

Particular events can create situations where a VM might start rebooting or shut down completely, and during this unplanned process one or more snapshots for that machine may get corrupted.

A common scenario for this kind of corruption is when:

  •  A VM starts displaying the message in the console:

 “The redo log of <Machine Name>.vmdk is corrupted.  Power off the virtual machine.  If the problem still persists, discard the redo log.”

  • Pressing OK to the message mentioned above, causes the machine to display the message again
  • Powering-off the VM might not be possible and could be displaying the message in the console:

 “The attempted operation cannot be performed in the current state”

Depending on the type of failure, recovery from such a situation is possible and at times, with all data intact.  The latter is especially true in the case for backup solutions that utilize the snapshot feature as part of their process but become corrupt just after it’s taken; therefore there isn’t a lot of changed data at that point.  A complete recovery in this example is achievable.

I’ve recovered from such scenarios a few times and thought the process should be documented to help others.  This blog posting came about as I felt that while different KB articles document the process in parts, I couldn’t find one that guides someone through the whole recovery process.

Some of the assumptions that I am making here are:

  • The failure is occurring on VM(s) with one or more snapshots, created either manually or via an automated mechanism eg: a backup solution
  • The virtual machine is displaying errors about inconsistent, corrupt or invalid snapshots
  • The person working through the issue is familiar with VMware operations and can deal with minor variations in the discussed scenario
  • The process to force shutdown of a VM is required for ESXi 5.x hosts (while syntax for other versions will be different, the process remains the same)

Virtual Machine Restore Process

Step 1: Save Virtual Machine Logs

The first action is to save logs for this VM; these can be found in the virtual machine folder on the datastore.  This is to avoid losing potentially valuable diagnostic data in the event of a catastrophic failure.  Due to the state the virtual machine is in, it might not be able to save vmware.log but the other log files should be copied directly from the datastore to a safe location.

Step 2: Shutdown Virtual Machine

This is to avoid having any further damage to the current snapshots before a copy of the machine is made.  It’s possible for vCenter to lose control of the virtual machine in such situations and power operations might not work from the VI Client.  If that happens, refer to “Force Virtual Machine Shutdown Process” section near the end of this posting for techniques to force the shutdown of the machine.

Step 3: Make a copy of the Virtual Machine folder

Once the virtual machine is shut down, make a copy of the virtual machine folder to another location on the same or another datastore.  Name the folder something appropriate eg: <Machine Name>-Backup.

Note: A clone is not what is required and it probably won’t work in such a situation.

Step 4: Attempt to fix the snapshots

First check if the datastore has enough space remaining; snapshots do become corrupted if there isn’t enough space available.  As there might be other snapshots in the background, estimate generously and if there isn’t enough space, use Storage vMotion to migrate machines off that datastore, to have a safe level of headroom available.

Once there is enough space available, try taking another snapshot, and if successful, try committing it.  This operation might fix the snapshot chain and consolidate all data into the disks.  If this process fails, then follow the remainder of the process to manually restore the machine from remaining snapshots.

Step 5: Confirmation of existing virtual disk configuration

Go into the VM settings and confirm the number and names of the existing virtual disks.  As there are snapshots present, the disk(s) will be pointing to the last-known snapshot(s).  Also, make note of the datastore the machine resides on.

Step 6: Command-Line access to ESXi server

Gain shell access to an ESXi server in the cluster which can see the datastore with the virtual machine in question.  The ESXi server should also have access to the datastore where the repair will be carried out.  As SSH may be disabled (by default), you may have to start the service manually.

Note: Seek approval (if security policy requires it) before this is done.

Once SSH is enabled, use PuTTY (or a similar tool) to connect and login using “root” credentials

Step 7: Confirmation of snapshots present

Once logged in, change directory to:

/vmfs/volumes/<Datastore Name>/<Machine Name>

ls *.vmdk –lrt

to display all virtual disk components.

Make note of what “Flat” and “Delta” disks are present.  While it can vary in certain situations, the virtual machine’s original disks will be named the same as the virtual machine name by default.  If there is more than one virtual disk present, it should have “_1” appended to the base name and so on.  If there are snapshots present, they will have “-000001” appended to each disk name for the first snapshot and “-000002” for the second and so on, by default.  Make note of all this information.

Step 8: Repair of the virtual disks

Start with the highest set of snapshots and for each disk in that set run the following command, where <Source Disk> is the source snapshot:

vmkfstools –i <Source Disk> <Destination Disk>

Please note: <Source Disk> is the base .vmdk name, ie: not the one with –flat, -delta or –ctk in the name.  <Destination Disk> is the new disk, where all disk changes need to be consolidated.  The new name should be similar to the source but not identical.  <Machine Name>-Recovered.vmdk is one example for the first disk.  Keep the same naming convention throughout for all disk names eg: <Machine Name>-Recovered_1.vmdk, <Machine Name>-Recovered_2.vmdk and so on.

For example:

vmkfstools –i <Machine Name>-000003.vmdk <Machine Name>-Recovered.vmdk

for the first disk from the third snapshot set.

vmkfstools –i <Machine Name>_1-000003.vmdk <Machine Name>-Recovered_1.vmdk

for the second disk in the same set and so on.

Repeat the process for all disks in the snapshot set identified earlier in step 7.  If the process is successful, move on to step 9.

If there is failure on one or more disks in the set, the following error message may be displayed:

Failed to clone disk: Bad File descriptor (589833)

If that error occurs, skip that disk and keep running the process for other disks as they might still be useful.  However, the set will likely be rejected to run as production so the next recent snapshot set should be tried.  Follow the same process until all disks in a snapshot set are successfully consolidated into a new disk set  If this is an investigation into the events leading up to the failure then additional sets might have to be consolidated in the same way.  All sets should now consolidate successfully.

Step 9: Restoration of the virtual machine

Using the “Datastore Browser”, create a new folder called “<Machine Name>-Recovered”, either on the same datastore or another.  Move the newly-created “Recovered” vmdk file(s) to the new folder.  Also, copy <Machine Name>.vmxand <Machine Name>.nvram to the new folder and rename both files to become <Machine Name>-Recovered.*

Download <Machine Name>-Recovered.vmx to the local machine and edit it in Wordpad or similar.  Replace all instances of <Machine Name>-00000x (where “x” is the last snapshot the machine’s disks are pointing to) with <Machine Name>-Recovered.  Repeat for other disks if present e.g. _1, _2 and save the file.  This should make the .vmx match all newly-consolidated disks.  Rename the original vmx file in the datastore to <Machine Name>.vmx.bak and upload the edited <Machine Name>.vmx back into the same location.  Once uploaded, go to the “Datastore Browser”, right-click the vmx file and follow the standard process of adding a virtual machine to inventory, possibly naming it “<Machine Name>-Recovered”.

Once in the list, edit the VM settings and disconnect the network adapter.  It might require connecting to a valid VM network first but the main thing is that the network adapter should be disconnected.

Once done, take a snapshot of the virtVM and power the machine up.  At this point, a “Virtual Machine Question” will come up.  Answer it by selecting the “I copied it” answer.  If the disk consolidation operation was successful for all disks, the machine will come up successfully.  The machine can now be inspected and put into service or investigated for a problem.

Once operation of the machine has been tested and the decision has been made to bring it into service, shutdown the virtual machine, reconnect the virtual network adapter to the correct network and power it back up.  After boot is complete, login to the machine to confirm service status, network connectivity, domain membership and other operations.  If all operations are as expected then the restore process is complete and the snapshot can be deleted.

Force Virtual Machine Shutdown Process

First Technique: Using vim-cmd to identify and shutdown the VM

While connected to the ESXi shell and logged in as “root”, run the following command to get a list of all VMs running on the target host:

vim-cmd vmsvc/getallvms

The command will return all the VMs currently running on the host.  Note the Vmid of the VM in question.  Get the current state of that VM as seen by the host first, by running:

vim-cmd vmsvc/power.getstate <Vmid>

If the VM is still running, try to shut it down gracefully using:

vim-cmd vmsvc/power.shutdown <Vmid>

If the graceful shutdown fails, try the power.off option:

vim-cmd vmsvc/power.off <Vmid>

Second Technique: Using ps to identify and kill the VM

Warning: Only use the following process as a last resort.  Terminating the wrong process could render the host non-responsive.

While connected to the ESXi shell and logged in as “root”, list all processes for target virtual machine on the current host by running:

ps | grep vmx

That will return a number of lines.  Identify entries containing vmx-vcpu-0:<Machine Name> and others.  Make note of the number in the second column of numbers, which represents the Parent Process ID.  For most of the lines returned for that machine, this number should be the same in the second column.  One line belonging to “vmx” will contain that number in both first and second columns.  That is the ProcessID of the target virtual machine.

Once identified, terminate the process using the following command:

kill <ProcessID>

Wait for a minute or so as it might take some time.  If after that, the VM hasn’t powered-off, then run the following command:

kill -9 <ProcessID>

The method in the section will not result in a graceful shutdown but it should terminate the machine, allowing for the recovery to take place.  If the machine still cannot be terminated, further investigation will be required on the host and the only option left will be to vMotion other virtual machines off this host and rebooting the host in question.

http://xtravirt.com/recover-vms-with-corrupt-snapshots/blog

Read more

如何在Raspberry Pi4上安裝Proxmox for ARM64

第一步 準備好Raspberry Pi 4 / CM4 4GB RAM,這裡要留意CM4如果是買有內建eMMC storage會限制不能使用SD卡開機而限制本地空間容量,如果沒有NAS外接空間或使用USB開機的話,建議買CM4 Lite插上大容量SD卡 第二步 去Armbian官網下載最小化Debian bookworm image https://www.armbian.com/rpi4b/ Armbian 25.2.2 Bookworm Minimal / IOT 然後寫入SD/USB開機碟,寫入方法參考官方文件 https://github.com/raspberrypi/usbboot/blob/master/Readme.md Note: 官方提供的預先設定系統方法,可以在Armbian初次啟動自動化完成系統設定。連結在此 https://docs.armbian.com/User-Guide_Autoconfig/

By Phillips Hsieh

世界越快心越慢

在晚飯後的休息時間,我特別享受在客廳瀏灠youtube上各樣各式創作者的影音作品。很大不同於傳統媒體,節目多是針對大多數族群喜好挑選的,在youtube上我會依心情看無腦的動畫、一些旅拍記錄、新聞時事談論。 尤其在看了大量的Youtube的分享後,我真的感受到會限制我的是我的無知,特別是那些我想都沒想過的實際應用,在學習後大大幫助到我的生活和工作層面。 休息在家時,我喜歡想一些沒做過的菜,動手去設計生活和工作上的解決方案,自己是真的很難閒著沒事做。 如創作文章,陪養新的習慣都能感覺到成長的喜悅,是不同於吃喝玩樂的快樂的。 創作不去限制固定的形式,文字是創作、影像聲音也是創作,記錄生活也是創作,我想留下的就是創造—》實現—》回憶,這樣子的循環過程,在留下的足跡面看到自己一路上的成長、失敗、絕望、重新再來。 雖然大部份的時候去做這些創作也不明白有什麼特別的意義,但不去做也不會留下什麼,所以呀不如反事都去試試看,也許能有不一樣的水花也許有意想不到的結果,投資自己永遠不會是失敗的決定,不是嗎?先問問自己再開始計畫下一步,未來沒人說得準。 像最近看youtube仍大一群人在為DOS開

By Phillips Hsieh

知識管理的三個步驟:一小時學會把知識運用到生活上

摘錄瓦基「閱讀前哨站」文章作為自己學習知識管理的內容 Part1「篩選資訊」 如何從海量資訊中篩選出啟發性、實用性和相關性的精華,讓你在學習過程中不再迷失方向。 1. 實用性 2. 啟發性 Part2「提高理解」 如何通過譬喻法和應用法,將抽象的知識與日常生活和工作緊密結合,建立更深刻的理解。 1. 應用法 2. 譬喻法 Part3「運用知識」 如何連結既有知識,跟自己感興趣的領域和專案產生關聯,讓你在運用知識的路途上游刃有餘。 1. 跟日常工作專案、人際活動產生連結 # 為什麼要寫日記? * 寫日記是為了忘記,忘卻瑣碎事情,保持專注力 * 寫日記就像在翻譯這個世界,訓練自己的解讀能力 * 不只是透過日記來記錄生活,而是透過日記來發展生活 #如何寫日記? * 不要寫流水帳式的日記,而是寫覆盤式的日記 當我們試著記錄活動和感受之間的關聯,有助於辦認出真正快樂的事 日記的記錄方式要以過程為主,而非結果 * 感恩日記的科學建議,每日感恩的案例

By Phillips Hsieh
2024年 3月30日 14屆美利達環彰化百K

2024年 3月30日 14屆美利達環彰化百K

這是場半小時就被秒報名額滿的經典賽事, 能順利出賽實屬隊友的功勞, 這次的準備工作想試試新買的外胎, 因為是無內胎用的外胎, 特別緊超級難安裝的, 問了其他朋友才知道, 要沾上肥皂水才容易滑入車框。 一早四點起床準備, 五點集合備好咖啡在車上飲用, 約了六點在彰化田尾鄉南鎮國小, 整好裝四人一起出發前往會場。 被排在最後一批出發, 這次的路線會繞行的員林148上139縣道, 其實在早上五點多天就開始有點飄雨, 大伙就開始擔心不會要雨戰吧! 果不其然才出發準備上148爬坡雨勢越來越大, 戴著防風眼鏡的我在身體的熱氣加上雨水冷凝效果下, 鏡面上滿是霧氣肉眼可視距離才剩不到五公尺, 只能緊依前前方的車友幫忙開路, 之後洪大跟上來我立馬請求他幫忙開路, 上了139停下車把防風眼鏡收起來, 反正下雨天又陰天完全用不到太陽眼鏡了。 雨是邊下邊打雷, 大伙都在這條139上一台一台單車好像避電針, 一時有點害怕不然想平時沒做什麼壞事, 真打到自己就是天意了。 下了139雨勢開始變小, 大伙的速度開始有所提昇, 開高鐵列車的時機己成熟, 物色好列車就跟好跟滿。 最後找了一隊似乎整團有固定在練

By Phillips Hsieh