I am going to “live blog” my notes from CodeMash as it goes. I’ll write a post during each session. My commentary will be in italics.
I started my day in Jimmy Bogard’s NServiceBus seminar, but I quickly realized he wasn’t going to go into a lot more beyond what I already know about messaging and service buses, so I had to bail. It looked like it was going to be a really good session if you didn’t know a lot about either of those subjects. It definitely would be an in-depth hands on session, so I was disappointed to bail on it, but there are so many things here that I just couldn’t justify staying and hoping he got to the things I could stand to learn more about around 3 pm. The good news is, the room was full. It is good to see so many people interested in message/event based architectures and techniques.
I moved to Mike Wood’s (with Brent Stineman) Cloud Architecture with Windows Azure seminar. So far, it has been nothing but marketecture, but I am hoping it gets better once they get through the basics. I hope I didn’t miss some disclaimer about it not being hands on.
SQL Azure is not the same as SQL Server. The databases are limited to 150 GB and don’t have all of the features. One way to deal with this is to scale out instead of scaling up, which is perfect for building a service-oriented system built from small services. I am a big believer in small (not nano) service based systems. Latency is also a potential issue you need to deal with, even within the same Azure datacenter. The sweet spot is large scale with modest performance, not things requiring high performance. Caching becomes critical for things that need to perform.
Ironically, NoSQL can handle data much better with Azure Tables. The limit of a “table” is 1 TB, which is not a limit of the table but a limit of the underlying storage accounts in Azure. You can also store blobs of data, which is useful for large documents or media. You can use them for streaming.
Azure also supports queues and something called HDInsight, which is a Hadoop implementation. There is an Azure Service Bus, which supports AMQP 1.0. There is also a version you can run on Windows Server, so you can probably connect them together.
Virtual Networking allows you to set up trust relationships between Azure client machines AND your own infrastructure, although the support for this is still limited. Additionally there is Mobile Service which facilitates building mobile apps and storing the data/handling processing in the cloud.
You can use the AppFabric services (which handle caching) locally (it is included in the Windows Server license) and then use Dedicated Caching on Azure. Azure Caching Service is an extra charge, so they recommend you use Dedicated Caching.
There is federated integration with Active Directory which helps you manage security across your local datacenter and Azure. I have to admit that I don’t know enough about this to know how well this might work or how useful it actually is, but it seems like it would solve some problems with security and administration.
The Content Delivery Network (CDN) holds storage accounts. You pick the data center where your storage account lives. When someone requests static data from a location closer to a different data center, the data will be pulled over and cached at that data center, speeding access for additional uses. The data can be cached up to 72 hours, which will vary depending upon usage of the data. It does test for staleness using hashes of the files involved. Obviously this only applies to static, publicly available (i.e. not secured) content. You can use the CDN to cache dynamically generated data if you need to without a whole lot more work.
Aaannnnddd we’ve dropped back into marketing again…
Traffic Manager allows you to control how data and process flows between data centers> It has three profiles. Performance, which routes all requests to the closest data center to the user. Failover, which sends all traffic to the primary solution in one data center and only directs it to another solution in the same or another datacenter if the primary fails. Failover is probably only useful if you have a regional site that does not make sense to have hosted in a bunch of data centers. Round Robin, which also is most useful only within one datacenter, since it will not help when you are running internationally (In fact, it will hurt A LOT). Traffic Manager is a “preview item”. Lots of this stuff is preview items, which is Microsoft-speak for beta.
Data synchronization between data centers is tricky because someone may expect that things are in sync between multiple data centers. Geo replicated storage is automatically turned on for Azure Storage and will have a secondary somewhere else in the same “region” (so, US, Europe, etc.). Queue data does not get replicated – only table and blobs. There is no guarantee about how much of a delay there is between the data getting stored at primary and the data being available at the secondary. Also, only Microsoft has control of the “switch” to fail over to the secondary. So, geo-replicated storage does not necessarily solve the problem of data being available at multiple data centers in a timely fashion. This actually caused a data outage of 60 hours recently for some people hosted at the South Central DC in Texas because it didn’t affect all customers and so they were afraid to pull the trigger on the fix so that they wouldn’t kill working services.
There are feeds and resources that will tell you about the state of Azure that you need to watch if you are using Azure to find out about outages. You can also buy premier support to help mitigate this problem, but that might not be sufficient for explaining to your customers. The solution is to create your own replicates if you find this scenario will not work for your use case or customers. This of course requires you to do the work for yourself, which adds a lot of complexity.
SQL Azure Data Sync uses the Sync Framework so you can plug other things in to it. It will allow you to do some of the replication you need and setting up SQL data sync is not very hard. You define one store as the master and then it will push the data out to the replicas. Schema changes DO NOT sync, so if you change schema you will have to address that yourself. It is data sync, not replication. It is a different speed than replication – there is no guarantee that things will actually be the same at any given time. The minimum amount of time that data will be out of sync is 5 minutes. There is also currently a bug that will not replicate more than 20 MB in a single sync operation, so you need to set your parameters to avoid that at the moment.
Clearly, when you are dealing with clouds and data, eventual consistency is the norm. Eventual consistency should just be the way we think about things now, and so we need to coach our users and business analysts about their expectations, If we need to get something better than eventual, there will be a big cost.
Service Bus Topics and Subscriptions – if no one subscribes to a message/subscription, the message just goes to the bit bucket. Pretty much exactly the way every other service bus works. You can define a queue as a sync queue to provide you some redundancy.
Because things are very fluid in a cloud environment (i.e. you will not have fixed IP addresses) you need to make sure you can handle addresses and names changing without recompiling your code. Of course, this is just a good practice anyway, even if you aren’t using a cloud. It is probably a good idea to build test cases around these sorts of things as well.
Accessible v. Available. If a service is up but no one can reach it, it is neither.
Resiliency is about the system being able to recover without human intervention. Logging and telling the user there was a problem is not sufficient. In order to be resilient, your code has to seek out problems and correct them. You should have multiple avenues for things to occur so that any one failure does not cause the whole system to collapse. Blue Strike is a product that can help you determine how well your site is running and allow you to monitor to ensure that it is operating correctly or where it is failing. There is also a Windows Azure Diagnostics tool/toolset that you can read more about on the Azure site. It aggregates various logs and counters and pushes them to a storage account on a schedule. Note that hardware failures can prevent you from getting this logging and it is possible to overrun your storage account thresholds (both transactions per second and total storage). You should not write your diagnostics to the same storage account as you are storing application data. This allows you to give access to diagnostics to DevOps without them being able to see production data they might not need to have access to.
Logging levels are of course tricky. You don’t want to fill up storage with logs, but you don’t want to leave yourself without enough information. One solution Brent suggested was to log everything to some transient store and only persist it along with an error when the error occurs. The transient storage would age out so you’d only have recent log events, which is generally sufficient.
Modern architecture and coding techniques can mitigate a lot of failures that we used to need to buy more hardware to deal with in the past. Request buffering is one option (One I am a huge fan of – this is message queuing/event driven architectures). Many times the issue is something like a deadlock where if you just retry the operation it will succeed. Asynchronicity is pretty much a requirement for building applications now. It helps resolve front end waits, but to do it properly you should be queuing things up, not just throwing things to background threads directly.
Capacity buffering allows you to queue up work you cannot currently do so that it can be handled later. Your backend processing can eventually catch up. You can also add one more machine than you need to deal with spikes, if they occur with some regularity or you need that level of performance to handle an SLA. It also gives you a buffer of time to scale out to more nodes when you see the spike coming. With an extra “node” you are also in a better position to do a rolling upgrade of your system.
Always carry a spare. Put half of your load on each of two nodes/system and scale them to be at 75% capacity. If one fails, then the other can handle it with some level of service degradation. You need to test your system to ensure that the level of performance is acceptable (perhaps not ideal, but that satisfies the user). You may want to have the features degrade in places that are not as important. For example, if you have a recommendation engine as part of your system, you can give stale recommendations or tell the user that no recommendations are currently available. At least they can still use the primary use case of your system.
The “HI” point. When things are going bad, there should be a point where your automation finally stops and requires a user to take over. Obviously there needs to be notification protocols around this as well so that the humans aren’t brought in cold. The “Leap Year” outage last year triggered this HI protocol. Microsoft also published a post-mortem about this event, and does for every outage. There is probably a lot to learn in these post-mortems.
An SLA (Service Level Agreement) is just a contract that spells out terms for availability and penalties for outages. It is NOT a guarantee. It does not prevent outages. When negotiating an SLA, you need to take into account your processes for recovery – time to detect (is your logging and monitoring any good), time to diagnose, time to decide, and time to act. Do not just blindly agree to an SLA of no outage longer than 5 minutes, you will not meet it (unless your process is amazing). Ensure that the customer demanding the SLA will pay for the investments you have to make to meet the SLA.
Uptime. Four services, each with an uptime of 99.95% uptime has a mean availability of 99.75%. You have to take into account that any service can take you down, so you have to accumulate the total outage time for each service. You can mitigate with redundancies, but they have costs. SLAs are all about cost v. risk. If you require the high level of availability, you need to pay for all of the extra costs, so you must have business justification for this expenditure. What is the cost of the outage versus the cost of the extra availability. It is an insurance policy. So it would seem we might be able to apply actuarial science to this problem…
Redundancy math is not additive, it is multiplicative. It is the intersection of both systems being down at the same time. So, 2 boxes with 95% uptime is 5/100 * 5/100 = 25/10,000 or 0.25% downtime (22 hours per year).
Try to express SLAs in terms of what the business value is, not just hardware uptime.So, 99% of emails will be sent in 5 minutes or less instead of 99% email server uptime. Make it clear what is required so that you can then accurately weigh the cost versus the benefit.
Perform your own root cause analysis for your failures, but just as importantly read and learn from the various vendor’s root cause analyses. “You are not alive long enough to only learn from your mistake.”