Thursday, December 4, 2014

The quality of VMware backups

Most agentless VMware backup solutions rely on VMware Tools to  quiesce a virtual machine before the actual backup starts.   This process is invoked by taking a temporary snapshot with the quiesce option enabled.   

This step ensures data integrityA snapshot without this important step is only crash-consistent. This means: all files that were open will still exist, but are not guaranteed to be free of incomplete I/O operations or data corruption.

Even so-called storage based snapshots take a VMware snapshot first to trigger VMware tools. 

If something goes wrong in this step, you get inconsistent backups.  Changing to another backup solution/storage vendor will not necessarily solve this problem if the problem is in the quiesce mechanism of the virtual machine.

VMware offers 3 mechanisms to quiesce:  
  • the sync driver
  • the vmsync module
  • Microsoft's Volume Shadow Copy (VSS) service.

The Sync Driver

Targeted VM's:  older Windows OS that doesn't have VSS

The SYNC driver holds incoming I/O writes while it flushes all dirty data to a disk, thus making file systems consistent after a while. If a lot of I/O is taking place on the virtual machine and the quiescing takes a long time, this can cause serious problems for the application causing the I/O.   It is advised to put so called pre-freeze and post-thaw scripts in place to gracefully shut down / pause /do whatever to this application to reduce the I/O during the snapshot.  

  • For Windows systems running ESXi 5.5 these scripts reside in c:\windows and have the fixed names pre-freeze-script.bat and post-thaw-script.bat.   
  • For Linux the files are /usr/sbin/pre-freeze-script and /usr/sbin/pre-freeze-thaw. An interesting Linux command to run in this file is fsfreeze

The vmsync module

Targeted VM's:  Linux systems

This module explicitly needs to be activated at the installation of VMware-tools in Linux.  And is considered experimental. Read: NOT SUPPORTED.

Microsoft's Volume Shadow Copy (VSS) service

This is what's probably used on your Windows VM's right now.

Targeted VM's:  all recent Windows OS from Windows 2003 on, but the implementation varies on the version:
  • Vista and 7:  File-system consistent quiescing
  • 2003:  Application-consistent quiescing
  • 2008, 2008R2, 2012 Application-consistent quiescing but we must meet certain conditions.
These conditions are (I just take this over from the docs)
  • Virtual machine must be running on ESXi 4.1 or later.
  • The UUID attribute must be enabled. It is enabled by default for virtual machines since 4.1
  • The virtual machine must use SCSI disks only and have as many free SCSI slots as the number of disks. Application-consistent quiescing is not supported for virtual machines with IDE disks.
  • The virtual machine must not use dynamic disks.
When some ignorant helpdesk guy tells you to change the enableUUID attribute to False to get rid of Windows errors, you don't meet the conditions anymore and your backups will be downgraded to crash-consistent.
So what happens on the background?
With the installation of VMware-tools, VMware provides the guest OS with a VSS Requestor
VMware Tools is responsible for initiating the VSS snapshot process as the VSS requestor, but the VSS mechanism itself is designed and provided by Microsoft.  Note that Windows Server Backup, Bacula Windows Client are also requestors.  The VSS requestor sets up the overall configuration for the backup operation, including whether the snapshot should be performed in component mode or not, whether to take a snapshot with a bootable system state, and whether the snapshot should be for a full copy or differential backup. 
The VSS provider is the component that takes care of keeping the shadow copies.  Microsoft provides one with Windows, this one is used by VMware.  There exist other VSS providers (like storage arrays etc.).   To check if you are using the Microsoft VSS provider, use vssadmin list providers in the virtual machine.   
Writers are Application-specific software that acts to ensure that application data is ready for shadow copy creation (eg. an Active Directory writer, an Exchange writer, ...). To list all possible writers and their state on the virtual machine, use vssadmin list writers.  Usually, when a writer is available, an application is considered VSS aware.
Depending on the OS, VMware Tools initiates VSS quiescing using either one of these contexts: 
  • VSS_CTX_BACKUP context (this is the standard backup context of VSS) for application quiescing capable guests with backup state set to select components, backup bootable system state with backup type VSS_BT_COPY and no partial file support.  Files on disk will be copied to a backup medium regardless of the state of each file's backup history, and the backup history will not be updated.  All writers and all components are involved by default.  A mechanism to exclude certain writers exists.
  • VSS_CTX_FILE_SHARE_BACKUP context for file system quiescing capable guests.  There is no writer involvement.
Currently there is no way to control any of these parameters.

By now you understand VSS brings together and orchestrates technologies from different parties (storage systems, OS, backup tools, VSS aware software...).  So a lot can go wrong with VSS!  And all the Linux friends need to be very quiet as there is no comparable system in Linux.
This Technet article explains VSS more in depth.

VMware or your backup tool will tell you the snapshot was created successfully but imho the only way to really tell is to check Event Viewer on the virtual machine.  I usually make a custom filter to quickly find these messages.  Look for things like timeouts, errors, warnings and  google on them.  The quality of your backup could depend on it!

Next time I'll present you a specific case.

No comments:

Post a Comment