Tuesday, April 03, 2012

Windows Azure Traffic Manager to handle your public cloud DR strategy

I did a talk last month at the Windows Azure User Group in London. To be honest I had too much content to get through and part of my talk was to talk about drastic recovery otherwise known as disaster recovery (DR). I ran out of time and didn't get a chance to talk about it unfortunately.

What is Drastic (Disaster) Recovery
Some people say DR is not possible or difficult to implement in a Cloud Platform such as the PaaS model. This is in contrast to Cloud Platform IaaS as you have more control over hardware/configuration in a IaaS environment than you do in PaaS. But hopefully after you have read this article, you'll soon realise it is very easy now in Windows Azure.

For people not aware of DR, I'll explain it using a picture that illustrates the problem when using Windows Azure, consider figure 1 below:
Figure 1: Basic drastic recovery fail over configuration
In the above diagram we have some roles running in a Windows Azure data centre in both Europe and North Central US and a consumer that is consuming from 1 data centre, Europe.

The configuration depicted above is known as a failover or active/passive configuration which is very commonly found in drastic recovery configurations in the enterprise today. If the above were a private data centre whether this is private cloud or a traditional data centre, it would look almost identical.

When you deploy an application in an Azure data centre, you can't spread it across multiple data centres for resiliance or to implement DR. Well you can but you'll get a different URL for each data centre you deploy your application into. For example for the Europe data centre above, our full URL could be: http://myamazingapp-europe.cloudapp.net/

Then for the North Central US data centre, the URL has to be something different like: http://myamazingapp-us.cloudapp.net/

Because we have two different URL's this leads us to a problem because if the Europe data centre goes down, which in the above case is our active configuration, the users will not be able to use the application which affects availability unless the active configuration becomes the DR environment which is the USA data centre.

This means the Europe data centre is currently serving up the users requests all the time and the USA data centre is in passive/sleep state - in other words not being used unless of a failure in the Europe data centre. When/if this failure occurs, we need to switch from Europe to the USA data centre. This is not easy with the current configuration because the users/actors will have to switch URLs from Europe to the USA data centre - this is far from ideal as often the user/actor probably wouldn't know when to try the other URL, it really needs to be seamless.

So we really want to use a single URL that has the ability to reference both data centres when we need to without the users actually knowing.

A simple way to implement DR in Windows Azure
This is where DNS (Domain Name Service) play a very important role in hardware infrastructure and helps us solve this problem relatively easilly.

Now consider an amendment to the above diagram (figure 1) to now abstract the user from the *.cloudapp.net domain name using a internet DNS registrar and a CNAME record that resolves to the Azure data centre required. Remember, each cloudapp sub domain represents a single data centre region. When you design for a drastic recovery solution, you wouldn't normally use the same data centre as it kind of defeats the purpose of having a DR strategy.

Figure 2: Basic drastic recovery fail over configuration with DNS
With the amendment above, we can give url: http://myamazingapp.com to end users/actors. They are now completely unaware of where their application is being served up - which is how it should be.

Of course they could run a trace (TRACERT) on http://myamazingapp.com and see where it resolves to. In fact, I have made the above configuration on a application I have deployed in Azure right now. If I run TRACERT on my sub domain: http://remotemedia.simonrhart.com I get the following:

Figure 3: Running TRACERT on my sample app hosted in Azure
You can see from the above trace, my subdomain resolves to Microsoft's data centre DNS http://remotemedia.cloudapp.net IP address:

We know that IP address is a real Azure data centre as it is registered to Microsoft. Here is the result from running a whois on the resolved IP address:
WHOIS information for

[Querying whois.arin.net]
[Redirected to whois.ripe.net:43]
[Querying whois.ripe.net]
% This is the RIPE Database query service.
% The objects are in RPSL format.
% The RIPE Database is subject to Terms and Conditions.
% See http://www.ripe.net/db/support/db-terms-conditions.pdf
% Note: this output has been filtered.
%       To receive output for a database update, use the "-B" flag.
% Information related to ' -'
inetnum: -
descr:           Microsoft Limited
org:             ORG-MA42-RIPE
netname:         UK-MICROSOFT-20081107
country:         GB
admin-c:         AS9763-RIPE
tech-c:          EN603-RIPE
tech-c:          BR329-ARIN
status:          ALLOCATED PA
mnt-by:          RIPE-NCC-HM-MNT
mnt-lower:       MICROSOFT-MAINT
mnt-domains:     MICROSOFT-MAINT
mnt-routes:      MICROSOFT-MAINT
source:          RIPE # Filtered
organisation:   ORG-MA42-RIPE
org-name:       Microsoft Limited
org-type:       LIR
address:        Microsoft
                Darren Norman
                One Microsoft Way
                WA 98052 Redmond
                UNITED STATES
phone:          +1 (425) 703 6647
fax-no:         +1 425 936 7329
e-mail:         danorm@microsoft.com
admin-c:        NORM1-RIPE
admin-c:        NORM1-RIPE
admin-c:        NORM1-RIPE
mnt-ref:        MICROSOFT-MAINT
mnt-ref:        RIPE-NCC-HM-MNT
mnt-by:         RIPE-NCC-HM-MNT
source:         RIPE # Filtered
person:         Allie Settlemyre
address:        Microsoft Limited
address:        One Microsoft Way,
address:        Redmond, WA 98052
address:        USA
phone:          +1 (425) 705 0516
phone:          +1 (425) 936 7329
e-mail:         iprrms@microsoft.com
nic-hdl:        AS9763-RIPE
source:         RIPE # Filtered
person:         Bharat Ranjan
address:        Microsoft Corporation
address:        Redmond, WA, 98102
address:        One Microsoft Way
address:        USA
phone:          +1 (425) 706 3230
fax-no:         +1 (425) 936 7329
nic-hdl:        BR329-ARIN
source:         RIPE # Filtered
e-mail:         bharatr@microsoft.com
person:         Edet Nkposong
address:        Microsoft, One Microsoft Way,Redmond, WA 98052
address:        USA
e-mail:         edetn@microsoft.com
phone:          +14257071045
nic-hdl:        EN603-RIPE
mnt-by:         MICROSOFT-MAINT
source:         RIPE # Filtered

So that is wonderful isn't it? DR and failover problem sorted. Well kindof. It's not perfect as it's very manual. If the European data centre where my application is deployed goes down, I need to know about it so I can tell my DNS registrar to change the CNAME record to point to the application that is deployed in the DR data centre - North Central US.

This means I will have to log into my DNS registrar and change the CNAME when a failure occurs like so:
Figure 4: Setting up a CNAME record
I don't really want my IT admins having to deal with this as it's expensive and adds complexity. I could automate it but then I'd have to put a load of process in place and write some custom code not to mention I'll need some infrastructure running on-premise (most probably).

Surely there is a better way?

Windows Azure Traffic Manager
Although what I have talked about above will work, it's fairly simple and I have done this for some time. But thankfully there is a better way. Microsoft has made available in Community Technical Preview (CTP) a feature called Windows Azure Traffic Manager.

Unlike the way the beta programmes work in Azure, you can start using the Traffic Manager right away. There is no request to make in order to start using it - as per the beta programme

Windows Azure Traffic Manager can solve you're failover DR strategy without having to touch any DNS server/registrar once it's setup and more. It supports the following:
  1. Performance – traffic is forwarded to the closest hosted service in terms of network latency
  2. Round Robin – traffic is distributed equally across all hosted services
  3. Failover – traffic is sent to a primary service and, if this service goes offline, to the next available service in a list
As we are talking about failover, the feature we need from Traffic Manager is number 3: Failover.
So Traffic Manager will solve our problem of having to manually update the DNS registrar with the new Azure data centre DNS cloudapp domain name. Great, how do I do it?

Enabling Traffic Manager
To start using Traffic Manager you need to use the Windows Azure Management Portal to create a policy.

To do this navigate to the Windows Azure Management Portal and sign-in: http://windows.azure.com. Then click Virtual Network > Get Started With Traffic Manager.

See figure 5 below:

Figure 5: Getting started with Windows Azure Traffic Manager
Notice how this is different from using the beta programmes in Windows Azure. With Traffic Manager you can start using it straight away and right now there is no cost to using it.

Once you click the Get Started with Traffic Manager button, you'll see a dialog box similar to the following popup:

Figure 6: Creating a Traffic Manager Policy

Notice, there is a lab that you can do that covers all this setup of Traffic Manager here: http://msdn.microsoft.com/en-us/gg197529 but I have included here for the bigger picture of what specifically Traffic Manager is designed to solve and how you would solve these problems without it.

I have filled in the policy above as per the original high-level architecture diagram in figure 1 above. Note: DNS names are different from my diagram but the concept and design is the same.

In the above, as mentioned we select Failover as a load-balancing method. We specify Europe (remotemedia) as our primary active configuration and the North Central US (remotemedia-dr) namespace as the failover data centre. This one is our passive configuration, the application is there, deployed and waiting to be used should a failure occur.

Some data here is important, one piece is the DNS time to live (TTL). This is the maximum time users will have to wait until the DNS server gets updated with the new URL should a failure occur. The default is 5 minutes (300 seconds). The other important peice of information is the Traffic Manager DNS Prefix field.

Well, the Traffic Manager DNS Prefix field can be anything we want (so long as it hasn't been used already) as the users will never see it. Later we will reconfigure our DNS registrar to point to this DNS address.

Once I click OK, the policy is then created and it is active in Traffic Manager:

Figure 7: Our policy in traffic manager
Figure 7 above shows how these policies look in Traffic Manager. There is 1 thing left to do though, and that is to configure our DNS registrar to point our custom DNS to our specified traffic manager policy URL we chose.

Figure 8: That's it, our DNS configured and never needs to change again!
Figure 8 above shows our final DNS configuration. So what happened here?

We are simply handing the problem of failover over to Windows Azure. So in the above case, Azure will handle changing the DNS CNAME configuration should a failure occur.

Making Sure Traffic Manager is Working
What we now need to do is test that the Traffic Manager failover feature is working correctly.

If we now run a trace route on our new traffic manager URL it should resolve to the Europe data centre (in my case http://remotemedia.cloudapp.net) - remember I have two data centres 1 in Europe (active) and 1 in North Central US (passive):

Figure 9: Tracing traffic manager configuration
So I'm happy with that, Traffic Manager's DNS configuration looks correct to me.

Now I want to force a failure so I can test the failover. This is easy, all I need to do is shutdown the Europe data centre services like so:

Figure 10: Shutting down active node in Windows Azure
Now that my Europe data centre services are not running as per figure 10 above, I'll need to wait the 5 minutes (which is what I configured) before I test the failover.

Once 5 minutes has elapsed, I'll run the same trace route command via a command-prompt like so:

Figure 11: Tracing now that Europe services are down
I think this is a success, notice the trace now resolves to our North Central US data centre (my URL: http://remotemedia-dr.cloudapp.net)

Also, if I run the trace one layer out from my custom domain: http://remotemedia.simonrhart.com, I get the expected failover data centre as above [remotemedia-dr.cloudapp.net]:

Figure 12: Running a trace route from my custom domain
So now you can see how the actual Traffic Manager DNS that you pick can be anything you want, it doesn't really matter what it is.

How does all this look, consider the new amended high-level architecture diagram in figure 13 below:

Figure 13: Complete high-level architecture diagram using Traffic Manager for DR

So I think the Windows Azure Traffic Manager is a good solution at solving your Windows Azure failover needs. Checkout the Traffic Manager training lab for a hands-on exercise on how to use it in more detail.

In this article, I have also used a public DNS registrar, but if your users are within a corporate LAN but you want to make use of a public cloud platform like Windows Azure, the same concepts apply to an internal DNS server farm.

In this blog post, I wanted to show how DR can be done in a PaaS model like Windows Azure - hopefully you can see how easy it is with Windows Azure Traffic Manager.

No comments: