Troubleshooting Data Domain Replication

Recently I was tasked with troubleshooting replication with a pair of Data Domains for a customer. I had not seen the interface for Data Domain before, let alone configured or troubleshot one. After some training by Ryan Garner (http://rgmoney.wordpress.com), I set about trying to figure out what was going on.

Background:

Data Domain storage is really one large file system on the bottom. It’s designed this way to allow for the super awesome deduplication and compression functions to do their job. You logically separate this filesystem using mTrees. Think of this like a partition on a drive, but far less stringent. You can “fast copy” data between mTrees, which really only tells the mTree where the original data lives (e.g. You can “copy” 35 TB of data in 1.5 seconds because it’s all pointers and not a duplication of the data). Underneath the mTrees, you have directories. These operate just like directories on any other devices, and you can build network shares from them. 

Replication is built using what are called contexts. A replication context is a source and a destination. In the past, directory replication was the way that Data Domain configured replication. You gave the context a source directory and it created a directory on the destination and copied the data. With later versions of DD-OS, mTree replication came along and allowed you to replicate everything in that mTree. mTree replication uses snapshots to show what changed between the last sync and now. It’s a more efficient way to replicate data, but much more sensitive to network problems. Each context replicates the post-compression data (which is a great thing) from source to destination. 

I highly recommend using the CLI for troubleshooting Data Domain. The GUI is nice for configuration, but to actually see what’s going on and where errors are occurring you need to dive into the CLI.

Some show commands:

  • replication show config
    • This command shows you the configuration of the replication contexts.
  • replication show stats
    • This command shows you the replication statistics for each context. The number to keep an eye on is “Pre-Comp Remaining”. This is the number of bytes pre-compressed that need to be replicated. Remember what I said earlier about how the replication contexts move the data post-compression? That’s why this number is important. The amount of compression you’ll get per file varies, so knowing a real amount of data to watch for changes is helpful.
  • replication throttle show
    • If you’re replicating data, you’re likely doing it over a limited WAN link. As nice as it would be to have gigabit WAN links everywhere, the reality is that you’re going to have something in the 10-100 Mb range. Data Domain allows you to throttle the replication traffic to prevent saturating your WAN link (which is something it’s more than happy to do). Note: If you’re using a port channel and run your replication traffic through it, remember to divide the amount of bandwidth you’d like to use by the number of links. Replication throttling happens on a per link basis. If you want to limit your bandwidth to 40 Mbps and you have 4 links in your port channel, set the bandwidth throttle to 10 Mbps. You’ll also need to convert Mbps to KBps as that’s how the replication throttles take their settings. Easy way to do this: Google “40 Mbps in KBps”. It will give you the result right at the top of the results. You can also set throttles by time. For example, you can set it to 20 Mbps during the day and then ramp up to 40 Mbps after business hours.
  • iostat 2
    • This will show you I/O stats per link and per protocol (NFS/CIFS) updating on a 2 second interval. This plays into the previous command to see throttle changes.
    • system show stats view net interval 2
      • This is a more granular view that just shows network traffic by link on the same 2 second interval
  • log view debug/ddfs.info
    • This command shows the log file for replication. There is a lot of stuff in this file, so it can be tough to wade through. Thankfully, you can search in this log file using /. Here are some example searches:
      • /07/01 16:01
        • This will send you to the first mention of 07/01 16:01, skipping entire days worth of logs and getting you directly to the time you’re looking for.
      • /ctx 1
        • This will show you mentions of “ctx 1”, which is the first replication context shown in the “replication show stats” command
    • Pressing N will go to the next mention

Hope this helps when trying to troubleshoot Data Domain problems!

Creating and Mapping LUNs on XtremIO

I’ve recently had the opportunity to work with EMC’s XtremIO array as part of a customer install. It’s really simple to manage and changes how you think about storage and latency. In this post, I’ll go over creating and mapping LUNs two ways; GUI and CLI.

 

GUI Configuration

The GUI of XtremIO is a pretty nice Java console called XtremIO Management Station (XMS). I mean that in the nicest way possible. I have seen issues with disconnects over slower VPN links, but on-site on the same network it was as fast as a native app. It also handles disconnects really well. Where Unisphere and UCS Manager basically kick you out and make you re-login, XMS handles the disconnect and reconnects if the network connection is still available. I will make a quick recommendation here: if at all possible, run the XtremIO app once locally before going remote. By this I mean if you have a laptop that you use for remote support, run the XtremIO app on the same network first before taking it for remote support. It can take a long time to load over slower VPN links, and the first run caches a good bit of the app for you.

 

1. Connect to the IP address of the XMS. Click Launch.

 

2. Once in the XMS application, click Configuration at the top of the window.

 

3. This takes you to the Configuration pane. In here, you can create volumes (LUNs), register hosts and map volumes to hosts. On the left, click Add in the Volumes section.

 

4. This brings up a new window in which you can create single or multiple volumes. Give the volume a name and tell it how big it will be. Set VAAI-TP Alerts to Enabled. As XtremIO does thin provisioning exclusively, we need vSphere to be informed as to the exact sizing of the volumes and when they’re getting close to full. Click Next.

 

5. Select a parent folder or in this case select nothing (the parent folder is the root volume folder) and click Next. Click Finish to create the volume.

 

6. Now we have our volume on the left and a host to present to on the right. Click the volume on the left and the host on the right to bring them to the middle.

 

7. Click the Map All button and give the volume on the left a LUN number. XMS will automatically match it on the host side. I owe someone at EMC a hug for this. Click the Apply button at the bottom and you’re done.

 

8. Say you wanted to mount that volume on 20 hosts? Same process only you select 20 hosts instead of just the one host. Same thing with multiple volumes. Add multiple volumes and multiple hosts, click Map All and set your LUN numbers, click Apply, mapped.

 

9. Rescan your HBAs for new storage and they should see the new volume.

 

CLI Configuration

The CLI of XMS is Linux based, but operates entirely inside the XMS Admin program. The command set can be accessed by typing a ? at any command prompt and to find a syntax guide, type a command and then a question mark (ex. add-volume ?). I highly recommend logging into EMC’s support site and downloading the configuration guide. It has fairly well written manual pages for each command and their syntax.
1. Connect to the XMS via SSH. Open Putty and SSH to the IP of XMS and login as xmsadmin. I’m not posting the password here. Google knows it. Once you’ve logged in, you’re prompted to login again. The first login is for the host itself, the second is for the XMS Admin program.

 

2. Here are some useful commands:
show-initiator-groups – shows the hosts and what Initiator Group folder they’re in.
show-volumes – shows the current volumes
show-lun-mappings – shows the volumes and what they’re mapped to

 

3. To create a volume, run the following command. This creates a 1 TB LUN named VolumeName and enables VAAI-TP alerts.

add-volume vol-name=“VolumeName” vol-size=”1t” vaai-tp-alerts=enabled

 

4. To map the LUN to a host, run the following command. It presents VolumeName to Host1 as LUN 10.

map-lun vol-id=“VolumeName” ig-id=“Host1″ lun=10

 

Note: If you’re presenting the same volume to multiple hosts (as you’re probably going to do in a virtual environment), it prompts you for a yes/no answer agreeing that you’re presenting the volume to multiple hosts. If you’re scripting the creation and mapping of a volume to multiple hosts, keep this in mind.

5. Rescan your HBAs for new storage and they should see the new volume.

Have fun with all the IOPS.

 

Failure to Configure Coredump Partition Causes Host Profile Creation to Fail

vSphere 5.5 Update 1 is out, bringing VSAN to a host near you. I could spend a blog post on the new features, but that’s what VMware’s website is for. <shameless plug>Check out my post on updating the vSphere Server Appliance for instructions.</shameless plug>

I updated the appliance with no issue at all, and then updated VUM to 5.5 Update 1. Everything is rolling. No problems. Fun fact: the C# client for vSphere doesn’t prompt you to update when it connects to the updated vCenter. You have to manually update it. This lead to weird errors in Update Manager about not having enough space on the host to run the upgrade. I finally get a host upgraded and go to run the other host’s upgrade and it continually fails with an error about being unable to run the upgrade script. I tried rebooting the host, rebooting VUM, rebooting vCenter, installing the patches, running it as an ESXi upgrade….everything. Same generic failures every time. All of the logs were unhelpful, so I decided to rebuild the host. “I have host profiles”, I thought. “This won’t be so bad”, I thought.

I got the host rebuilt and back into vCenter and went to apply the host profile and it gave me an error saying “Failed to execute command to configure or query coredump partition.”

The only way to configure the coredump partition from the GUI is via a host profile, so that option was out. Time to dig into the CLI. Here’s what I had to do to configure the coredump partition.

1. SSH to your host as root.

2. Run the command “esxcli system coredump partition get” (without the quotes). You may get different results. The results below show an unconfigured host. The errors I was receiving said the configured device was unknown. If you get that error, run “esxcli system coredump partition set -u” (without quotes) to clear that.

3. Run the command “esxcli system coredump partition set -e true -s” (without quotes) to tell the system to find a suitable partition for coredumps. 

4. Run the command “esxcli system coredump partition get” (without the quotes) to verify that a partition has been set.

You should be able to create a host profile from this host now.

Updating the vCenter Server 5.5 Appliance

One of the things I really appreciated about the vSphere 5.5 announcement was the increase in supported hosts and guests with the vCenter Server Appliance. Previously, the appliance could only support 5 hosts and 50 guests maximum due to the embedded DB2 database. If you had an external Oracle database, you could support more hosts and clusters. I’ve only had one customer go this route and it was due to having a lot of in-house Oracle experience.

With the 5.5 update, the appliance now uses a vPostgres embedded database that can support 100 hosts and 3000 guests. This is a huge increase that encompasses most, if not all, small to midsize environments that I work with (and even some of the larger ones). Even though Oracle and DB2 are (update: I was wrong on that. Oracle is the only supported external database. Thanks @haslund for the correction!) is the only supported external database (the lack of SQL Server support is due to SUSE Linux not having an ODBC driver for it), the internal database covers a large majority of use cases and makes it even easier to deploy vSphere. 

This also changes how vCenter gets updated. The Windows version of vCenter basically had you go through the install process again and it would upgrade in place. Not so bad. On the appliance, here’s how you do it.

1. Log into the management interface for the appliance: https://%5BvCenter IP]:5480/

2. Login as root. (The default password is vmware, this is your reminder that you should change it)

3. Click the Update tab at the top.

4. Click the Check Updates button on the right. This goes out to VMware’s website and looks for patches

5. Once an update is found, click the Install Updates button. This will download and install the update. It will ask are you sure you want to install. Click OK. This just installs the update and then will wait for you to reboot once it’s done.

6. Go grab yourself a beverage and sit tight while it updates.

7. When you’re done, you’ll get a message to reboot the appliance to complete the update.

8. Click the System tab and then click Reboot. This will reboot the appliance and thus take vCenter down. ***You should do this during a maintenance window.***

9. When you’re done, you’ll come back to a login prompt and you’re all done!
It’s a really easy process and makes the update process far more streamlined. 

Connecting to Office 365 with Adium

Instant messaging is a really nice way to stay in contact with coworkers and bug them with questions that seem either too short or requires more back and forth than e-mail is really useful for.

Office 365 allows you to use Lync as an instant messaging platform, which requires the use f the Lync client. The Lync 2010 client for OS X is, well, terrible. It crashes regularly, doesn’t handle sleep (laptop lid closing and opening) and is generally not good for what it’s been built to do.

Adium is a multiple protocol IM client that has been on the Mac since 2004 (April 6, 2014 marks 10 years). It’s built on the LibPurple foundation that runs Pidgin (the Windows counterpart). Recently a 3rd party extension called SIPE that brings in support for the extended version of SIP/SIMPLE that Microsoft uses for Lync and Office Communicator was released. 

Here’s how to get connected with an Office 365 account:

1. Download Adium and SIPE. Install Adium, then double click the SIPE plugin to install it into Adium.

2. Go to Adium -> Preferences and click Accounts, then click the plus in the bottom left corner of the window and select Office Communicator

3.  In the screen that appears, enter your e-mail address under Username and your password under Password, then click Options at the top

4. Change the authentication mode to TLS-DSK and the User Agent to “UCCAPI/4.0.7577.314 OC/4.0.7577.314” without the quotes. Without this, you’ll get an error about not having the recommended version of the client. You can see other options for the user agent here. I chose this one as it uses Lync 2010 and Office 365.

5. Click OK and watch it connect. 

Enjoy an IM client that doesn’t crash and gives you more features!

Podcast List

Some readers have asked what podcasts I listen to from reading the Downcast post. Here they are:

I drive a lot, so having long form podcasts (an hour-ish plus in length) makes the drives much shorter. Hope you found some new things to listen to.

How To Backup The vCenter Server Appliance Embedded DB

VMware made a huge stride with the vCenter Server Appliance for vSphere 5.5. It now supports 100 hosts and 3000 VMs, which hits a pretty wide range of our customers. While it doesn’t support external SQL Server databases (though that could be coming soon now that Microsoft has a SUSE Linux ODBC driver), it does support Oracle as an external database source. 

Here’s where it gets fun. The embedded database is vPostgres. You may be asking yourself, “I’ve never heard of this. Does it have a backup agent for my backup application?” The answer is no, no it likely doesn’t. That just means you’ll have to do it manually/with scripts.

Check out this VMware KB article on backing up that database. 

Option 2: Take a snapshot and backup the whole VM. 

Follow

Get every new post delivered to your Inbox.

Join 581 other followers