Snapshot removal issues with BackupExec and locked files

<Status update Nov 22nd 2012>
Check out the comments after reading this article, some interesting points there. The case at Symantec never got resolved to this very day. We’re also still using Backup Exec 2010 R3.
</Status update>

<Update Sept 12th 2013>
The issue is now finally resolved since our backup admins upgraded to Backup Exec 2010 R3 Service Pack 3.
</Update>

We are currently facing an issue with a small number of VMs, where  snapshots created by our backup software, Symantec BackupExec, can’t be removed properly because of locked files (neither through BackupExec nor manually).
In vCenter, the warning “Virtual machine disks consolidation failed.” is being logged as a simple event (I might create an alarm for this now that I think about it).

The problem

You will not see these snapshots in the snapshot manager (fix this for good, VMware!), but only on the filesystem.
Unfortunately, unlike with vSphere 4 there is no obvious, specific error. The remove snapshot task completes successfully and you’ll only notice on the VM summary page that it needs disk consolidation.

Pre-vSphere 5, the task would fail with an error about how it couldn’t consolidate the snapshot due to a locked file. This info is now only found in the vmware.log file (and surely vmkernel log files) of the VM in its datastore:

# grep -i lock vmware.log
2012-06-04T08:06:21.069Z| vcpu-0| AIOGNRC: Failed to open ‘/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002-flat.vmdk’ : Failed to lock the file (400000003) (0x2013).2012-06-04T08:06:21.069Z| vcpu-0| AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002-flat.vmdk’ locked (try 0)2012-06-04T08:06:22.580Z| vcpu-0| DISKLIB-VMFS : “/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002-flat.vmdk” : failed to open (Failed to lock the file): AIOMgr_Open failed. Type 3
2012-06-04T08:06:22.580Z| vcpu-0| DISKLIB-LINK : “/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002.vmdk” : failed to open (Failed to lock the file).
2012-06-04T08:06:22.580Z| vcpu-0| DISKLIB-CHAIN : “/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002.vmdk” : failed to open (Failed to lock the file).
2012-06-04T08:06:22.580Z| vcpu-0| DISKLIB-LIB : Failed to open ‘/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002.vmdk’ with flags 0x20a Failed to lock the file (16392).
2012-06-04T08:06:22.580Z| vcpu-0| SNAPSHOT:Failed to open disk /vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002.vmdk : Failed to lock the file (16392)
2012-06-04T08:06:22.601Z| vcpu-0| DISK: Failed to open disk for consolidate ‘/vmfs/volumes/4fb20a9a-1b7f7c20-0363-002481e443c1/SomeVM002/SomeVM002-000004.vmdk’ : Failed to lock the file (16392) 53452012-06-04T08:06:22.657Z| vcpu-0| Vix: [675925 vigorCommands.c:577]: VigorSnapshotManagerConsolidateCallback: snapshotErr = Failed to lock the file (5:4008)

Nice, isn’t it? Creating a new snapshot and selecting delete all snapshots will not work because it’s still locked. It will only increase the number of delta files for your VM.

Digging down to the root of the issue

With vmkfstools -D [File] you can get some information about file ownership and locks:

# vmkfstools -D SomeVM002-flat.vmdk
Lock [type 10c00001 offset 243873792 v 293, hb offset 3891200gen 7, mode 2, owner 00000000-00000000-0000-000000000000 mtime 695517 nHld 1 nOvf 0]RO Owner[0] HB Offset 3891200 4fb100c0-2ef7ae6a-7412-002581b2ed3bAddr , gen 128, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 75161927680, nb 71680 tbz 6929, cow 0, newSinceEpoch 0, zla 3, bs 1048576
 # vmkfstools -D SomeVM002-000001-delta.vmdk
Lock [type 10c00001 offset 243679232 v 1242, hb offset 3891200gen 7, mode 2, owner 00000000-00000000-0000-000000000000 mtime 708559 nHld 1 nOvf 0]
RO Owner[0] HB Offset 3891200 4fb100c0-2ef7ae6a-7412-002581b2ed3bAddr , gen 1227, links 1, type reg, flags 0, uid 0, gid 0, mode 600len 84033536, nb 81 tbz 0, cow 0, newSinceEpoch 0, zla 1, bs 1048576

For the original flat-VMDK file as well as all the non-most-recent delta-VMDKs, we can ascertain that the running host is the Read-Only Owner (“RO Owner”). Notice the last portion of the hex-string in that line, (“002581b2ed3b”), which corresponds to the physical MAC address of the first physical NIC in the host:

# esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:02:00.00 bnx2 Up 100Mbps Full 00:25:81:b2:ed:3b1500 Broadcom Corporation Broadcom NetXtreme II BCM5709 1000Base-T
[…]

For the most recent delta, so the one the VM is currently running from, the host should be normal full owner like this:

# vmkfstools -D SomeVM002-000004-delta.vmdk
Lock [type 10c00001 offset 243924992 v 1257, hb offset 3891200
gen 7, mode 1, owner 4fb100c0-2ef7ae6a-7412-002581b2ed3b mtime 708544 nHld 0 nOvf 0]
Addr , gen 1250, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 151142400, nb 145 tbz 0, cow 0, newSinceEpoch 0, zla 1, bs 1048576

In any case, finding out where exactly this lock came from seemed difficult, but you can do an esxcli-style netstat and check if your backup server (in this case 10.5.1.20) has an active TCP-connection to the ESXi host:

# esxcli network ip connection list | grep 902
tcp 0 0 10.1.10.123:902 10.5.1.20:50657 ESTABLISHED 0

What’s interesting is that the ESXi host will hold the connection open seemingly forever, even though the backup server closed it already long ago. With tcpdump-uw on the host we can observe how it keeps resending the same packet every 75 seconds from port 902 to the backupserver -without any reply. TCP-retransmissions are cool and all, but you ought to give up after a certain point:

# tcpdump-uw -nnvvvXX -i vmk0 -s1500 host 10.5.1.20
tcpdump-uw: WARNING: SIOCGIFINDEX: Invalid argument
tcpdump-uw: listening on vmk0, link-type EN10MB (Ethernet), capture size 1500 bytes
10:16:50.058661 IP (tos 0x0, ttl 64, id 50511, offset 0, flags [DF], proto TCP (6), length 40)
10.1.10.123.902 > 10.5.1.20.50657: Flags [.], cksum 0x2608 (incorrect -> 0xdb76), seq 207511068, ack 1591352116, win 128, length 0
0x0000: 0010 dbff 2180 0050 5670 831a 0800 4500 ….!..PVp….E.
0x0010: 0028 c54f 4000 4006 4f93 0a6e 0517 0afa .(.O@.@.O..n….
0x0020: 0b6f 0386 c5e1 0c5e 5e1c 5eda 1b34 5010 .o…..^^.^..4P.
0x0030: 0080 2608 0000 ..&…
10:18:05.070488 IP (tos 0x0, ttl 64, id 56228, offset 0, flags [DF], proto TCP (6), length 40)
10.1.10.123.902 > 10.5.1.20.50657: Flags [.], cksum 0x2608 (incorrect -> 0xdb76), seq 0, ack 1, win 128, length 0
0x0000: 0010 dbff 2180 0050 5670 831a 0800 4500 ….!..PVp….E.
0x0010: 0028 dba4 4000 4006 393e 0a6e 0517 0afa .(..@.@.9>.n….
0x0020: 0b6f 0386 c5e1 0c5e 5e1c 5eda 1b34 5010 .o…..^^.^..4P.
0x0030: 0080 2608 0000 ..&…
10:19:20.082332 IP (tos 0x0, ttl 64, id 63899, offset 0, flags [DF], proto TCP (6), length 40)
10.1.10.123.902 > 10.5.1.20.50657: Flags [.], cksum 0x2608 (incorrect -> 0xdb76), seq 0, ack 1, win 128, length 0
0x0000: 0010 dbff 2180 0050 5670 831a 0800 4500 ….!..PVp….E.
0x0010: 0028 f99b 4000 4006 1b47 0a6e 0517 0afa .(..@.@..G.n….
0x0020: 0b6f 0386 c5e1 0c5e 5e1c 5eda 1b34 5010 .o…..^^.^..4P.
0x0030: 0080 2608 0000 ..&…
[…]

It appears that as long as the ESXi host has that TCP-connection, the lock is active. I haven’t noticed a specific amount of time required for the connection/lock to be removed by itself, but I saw it disappearing after about 45 minutes on ESXi 4.1 and at least ~6 hours on ESXi 5. Usually, the connection isn’t listed anymore the next day and I can consolidate the snapshots successfully then, though I think I had a case where it took even longer.

Workaround

To workaround this annoying issue sooner, you can restart the management agents on the host with “services.sh restart” or from the DCUI. After this you should be able to consolidate the disks  with the new vSphere 5 consolidate option or get rid of them all with a new create snapshot and subsequent delete all snapshots run.
If the backup server still has the active TCP-connection to the host (check with “netstat -anto | find “902” “) you could try restarting the backup service holding the connection too (check the PID from netstat and look for the respective process in taskmgr by adding the PID row if you’re unsure).

Manually creating and removing snapshots is no problem (unless there is a lock already), the issue really only occurs with snapshots created by BackupExec and only if you use the file-level restore (GRT Granular Recovery Technology) functionality.
The issue was present with BackupExec 2010 R3 in vSphere 4.1 U1 and is still present after our update to vSphere 5 U1 (and all currently released patches). For some reason though, only 2 VMs are affected by the issue. It can be reproduced and researched with less “danger” by using clones of the affected VMs too.

This KB article seems to resemble issue I outlined here the closest, though you certainly don’t need to go as far for resolving it like rebooting the whole host:
KB1019849 Unable to remove locked snapshots

If you run into a similar issue around locked files, you probably want to read this article too:
KB10051 Investigating virtual machine file locks on ESX/ESXi

Current status

After a frustrating while of  arguing with the Symantec support, they finally acknowledged it’s a bug on their side. Apparently, it incorrectly issues 2 snapshot remove operations, which end up blocking each other with one terminating without releasing the lock properly. We are still waiting on a fix for this though.

Update

Advertisements

20 thoughts on “Snapshot removal issues with BackupExec and locked files

  1. Hello,

    we still had the same issue with VMware vSphere 5.0 and Backupexec 2010 SP1a all hotfixes. Do you have resiúlts from symantec?

    Your workaround works fine.

    mfg Bjoern

  2. Hi, I beleive i am having this same issue, ESX 5U1 and NBU 7.1. Did you get any further with a resolution ?

    Ive been trying to figure out if NBU 7.5 uses the latest VDDK ?

    • Hi, no nothing as of yet.
      On your backupserver in the backupsoftware’s installation directory, there should be the VDDK libraries with which the software ships by default. Search for “vmware-vdiskmanager.exe” or “vmware-mount.exe” for example. You can compare the file version information with the ones of the manual VDDK installation in your “C:\Program Files (x86)\VMware\VMware Virtual Disk Development Kit\bin” folder. There are also the 64bit libraries zipped up in vddk64.zip.

      The version of the latest 32bit vmware-vdiskmanager.exe included in the VDDK 5.0 U1 is 3.1.0.2661, Product build 5.0.0 build-614080. If your NBU installation has older versions, which I assume it will, you could try overwriting all files manually from the VDDK /bin folder. We haven’t tried that with our Backupexec installation yet though.

      • Hi,

        I have logged a call with symantec and am going over the logs currently, symantec pointed me at http://www.symantec.com/business/support/index?page=content&id=TECH167659 This fits as its a 1.5tb VM on a 2TB lun but snapshots mean the provisioned size in over 3TBs during the backup process. I enabled my VxMS logs so am waiting for them to populate.

        Regarding the VDDK symantec pointed me to http://www.symantec.com/business/support/index?page=content&id=TECH145168 which means i am running 1.2, i have asked them what version of the VDDK runs in 7.5

        As the backup server is production i wont be changing the VDDK files unless i get instruction from symantec on how and making sure its covered by support. Dont want to break all my backups :(

        Ill let you know when i get further info back from symantec

      • Judging by this article, 7.5 should use the VDDK 5.0 but not 5.0 U1 which is supposed to fix the snapshot removal problem.

        Yes I agree with you about replacing the files. What the Symantec support in our case said was basically just “update the VDDK because I somehow found these release notes” but not how to update them or what implications this could have on his backupsoftware. I don’t think he really knew what he was talking about and now we’re already waiting for weeks for a reply again.

        Also, this article here looks pretty spot on to our issue (or yours, as we’re on Backupexec only).

  3. Hi,

    Looks like my issue was caused by a bug as per TECH167659

    We upgraded to 7.1.0.4 and the virtual machine backs up successfully. Strange that historically it had functioned so i still dont know what triggered the bug.

    I plan to move to 7.5.0.3 in the next week or so and ive asked the question if the point release upgraded the VDDK as well.

    Have you had any movement in your support case?

    Thanks

    • Speaking of triggers, in our case the issue also appeared completely out of the blue on just a single VM. Btw. our problematic VM is only 30GB in size.
      There has been no progress in our case and I don’t really expect anything anymore at this point.

  4. Thanks for this blog post! It was the only one I could find that described how to troubleshoot and fix the actual problem.

  5. I hate Symantec support with a passion, we’ve had this issue as well but seems to rarely happen anymore so I just have been doing the consolidation when needed. However we have been unable to do differential VM backups at all on any VM, it gives some message about it not being configured correctly or something and I’ve had a ticket open with them since July on it with no fix in sight yet. They’ve had me do all sorts of things like “maybe you need to update the VMTools or remove/reinstall them” and such but this is a fresh new VMWare 5.01U environment that we’ve only had up for a few months so nothing should be junked up on it. I swear they their support are a bunch of idiots!

    • Oh also we are still on 2010 R3 because when 2012 was finally released I upgraded to it and it plain didn’t work, it was giving me some stupid “credentials are invalid” message when trying to backup anything even though the credentials were 100% correct. Worked with support and couldn’t figure it out, so then I installed 2012 R3 to a freshly installed Server 2008 server with nothing else on it as a new install, and setup a backup job and tested and same exact problem! Never could figure it out so stayed with 2010, 2012 looked crappy though imo, too bloated and harder to find the quick stuff you needed.

  6. Pingback: Invoke-VMScript issue: “The guest operations agent could not be contacted” | alpacapowered

  7. Quest vRanger 6.0.1 (latest at the time of writing) still has this same problem which is what led me here. Kind of gives credence to this not being a Symantec issue but rather an issue with the VDDK.

    • It’s important to note however, that if this is in fact a bug in the 5.0 VDDK and fixed in the 5.0 U1 VDDK as Symantec pointed at in our case, it’s still the vendor’s responsibility to provide an update of their backup product packaged with the updated VDDK version.

  8. We have the same problem but with vsphere 5.1 and unitrends as the backup appliance. have anyone call vmware about this? first backup is fine but the second backup fail. same error.

  9. @alpacapowered Did you ever get this resolved? I’m seeing this on my 2010 R3 as well and am really looking for any kind of resolution.

    I even tried updating the VDDK files in the Backup Exec\VMware\VixDiskLib folder and had no luck with it. Same issue, a VM that needs consolidation as BE couldn’t take care of it itself.

    So, are you dealing with this today still?

    • No, we unfortunately never got it resolved to this very day. The guy in charge of our backup (talk about silos) gave up the call with Symantec and we live with having to consolidate this single VM.

  10. hi, i really appreciated the investigations you did to find the cause.
    I don’t use Symantec but another VDDK based software..From what i understand, all vm backup software just talks to the ESXi host by relying on the VDDK API.

    Me too investigated this issue as it is one of the causes of nightmares if not carefully monitored.

    According to me, the problem occurs when the backup software request the ESXi host to consolidate the snapshots but the base disk is locked perhaps by the software itself (as it was copying the disk) or by ESXi itself (as you mentioned above).

    Before the consolidation, i think the software could have checked whether the base disk is locked or not.

    I asked same on the vmware forum here : https://communities.vmware.com/thread/494734

    Cheers,
    Vicky | http://www.techvicky.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s