CodeMash 2013 Precompiler Day One, Session One

CodeMash

I am going to “live blog” my notes from CodeMash as it goes. I’ll write a post during each session. My commentary will be in italics.

I started my day in Jimmy Bogard’s NServiceBus seminar, but I quickly realized he wasn’t going to go into a lot more beyond what I already know about messaging and service buses, so I had to bail. It looked like it was going to be a really good session if you didn’t know a lot about either of those subjects. It definitely would be an in-depth hands on session, so I was disappointed to bail on it, but there are so many things here that I just couldn’t justify staying and hoping he got to the things I could stand to learn more about around 3 pm. The good news is, the room was full. It is good to see so many people interested in message/event based architectures and techniques.

I moved to Mike Wood’s (with Brent Stineman) Cloud Architecture with Windows Azure seminar. So far, it has been nothing but marketecture, but I am hoping it gets better once they get through the basics. I hope I didn’t miss some disclaimer about it not being hands on.

SQL Azure is not the same as SQL Server. The databases are limited to 150 GB and don’t have all of the features. One way to deal with this is to scale out instead of scaling up, which is perfect for building a service-oriented system built from small services. I am a big believer in small (not nano) service based systems. Latency is also a potential issue you need to deal with, even within the same Azure datacenter. The sweet spot is large scale with modest performance, not things requiring high performance. Caching becomes critical for things that need to perform.

Ironically, NoSQL can handle data much better with Azure Tables. The limit of a “table” is 1 TB, which is not a limit of the table but a limit of the underlying storage accounts in Azure. You can also store blobs of data, which is useful for large documents or media. You can use them for streaming.

Azure also supports queues and something called HDInsight, which is a Hadoop implementation. There is an Azure Service Bus, which supports AMQP 1.0. There is also a version you can run on Windows Server, so you can probably connect them together.

Virtual Networking allows you to set up trust relationships between Azure client machines AND your own infrastructure, although the support for this is still limited. Additionally there is Mobile Service which facilitates building mobile apps and storing the data/handling processing in the cloud.

You can use the AppFabric services (which handle caching) locally (it is included in the Windows Server license) and then use Dedicated Caching on Azure. Azure Caching Service is an extra charge, so they recommend you use Dedicated Caching.

There is federated integration with Active Directory which helps you manage security across your local datacenter and Azure. I have to admit that I don’t know enough about this to know how well this might work or how useful it actually is, but it seems like it would solve some problems with security and administration.

The Content Delivery Network (CDN) holds storage accounts. You pick the data center where your storage account lives. When someone requests static data from a location closer to a different data center, the data will be pulled over and cached at that data center, speeding access for additional uses. The data can be cached up to 72 hours, which will vary depending upon usage of the data. It does test for staleness using hashes of the files involved. Obviously this only applies to static, publicly available (i.e. not secured) content. You can use the CDN to cache dynamically generated data if you need to without a whole lot more work.

Aaannnnddd we’ve dropped back into marketing again…

Traffic Manager allows you to control how data and process flows between data centers> It has three profiles. Performance, which routes all requests to the closest data center to the user. Failover, which sends all traffic to the primary solution in one data center and only directs it to another solution in the same or another datacenter if the primary fails. Failover is probably only useful if you have a regional site that does not make sense to have hosted in a bunch of data centers. Round Robin, which also is most useful only within one datacenter, since it will not help when you are running internationally (In fact, it will hurt A LOT). Traffic Manager is a “preview item”. Lots of this stuff is preview items, which is Microsoft-speak for beta.

Data synchronization between data centers is tricky because someone may expect that things are in sync between multiple data centers. Geo replicated storage is automatically turned on for Azure Storage and will have a secondary somewhere else in the same “region” (so, US, Europe, etc.). Queue data does not get replicated – only table and blobs. There is no guarantee about how much of a delay there is between the data getting stored at primary and the data being available at the secondary. Also, only Microsoft has control of the “switch” to fail over to the secondary. So, geo-replicated storage does not necessarily solve the problem of data being available at multiple data centers in a timely fashion. This actually caused a data outage of 60 hours recently for some people hosted at the South Central DC in Texas because it didn’t affect all customers and so they were afraid to pull the trigger on the fix so that they wouldn’t kill working services.

There are feeds and resources that will tell you about the state of Azure that you need to watch if you are using Azure to find out about outages. You can also buy premier support to help mitigate this problem, but that might not be sufficient for explaining to your customers. The solution is to create your own replicates if you find this scenario will not work for your use case or customers. This of course requires you to do the work for yourself, which adds a lot of complexity.

SQL Azure Data Sync uses the Sync Framework so you can plug other things in to it. It will allow you to do some of the replication you need and setting up SQL data sync is not very hard. You define one store as the master and then it will push the data out to the replicas. Schema changes DO NOT sync, so if you change schema you will have to address that yourself. It is data sync, not replication. It is a different speed than replication – there is no guarantee that things will actually be the same at any given time. The minimum amount of time that data will be out of sync is 5 minutes. There is also currently a bug that will not replicate more than 20 MB in a single sync operation, so you need to set your parameters to avoid that at the moment.

Clearly, when you are dealing with clouds and data, eventual consistency is the norm. Eventual consistency should just be the way we think about things now, and so we need to coach our users and business analysts about their expectations, If we need to get something better than eventual, there will be a big cost.

Service Bus Topics and Subscriptions – if no one subscribes to a message/subscription, the message just goes to the bit bucket. Pretty much exactly the way every other service bus works. You can define a queue as a sync queue to provide you some redundancy.

Because things are very fluid in a cloud environment (i.e. you will not have fixed IP addresses) you need to make sure you can handle addresses and names changing without recompiling your code. Of course, this is just a good practice anyway, even if you aren’t using a cloud. It is probably a good idea to build test cases around these sorts of things as well.

Accessible v. Available. If a service is up but no one can reach it, it is neither.

Resiliency is about the system being able to recover without human intervention. Logging and telling the user there was a problem is not sufficient. In order to be resilient, your code has to seek out problems and correct them. You should have multiple avenues for things to occur so that any one failure does not cause the whole system to collapse. Blue Strike is a product that can help you determine how well your site is running and allow you to monitor to ensure that it is operating correctly or where it is failing. There is also a Windows Azure Diagnostics tool/toolset that you can read more about on the Azure site. It aggregates various logs and counters and pushes them to a storage account on a schedule. Note that hardware failures can prevent you from getting this logging and it is possible to overrun your storage account thresholds (both transactions per second and total storage). You should not write your diagnostics to the same storage account as you are storing application data. This allows you to give access to diagnostics to DevOps without them being able to see production data they might not need to have access to.

Logging levels are of course tricky. You don’t want to fill up storage with logs, but you don’t want to leave yourself without enough information. One solution Brent suggested was to log everything to some transient store and only persist it along with an error when the error occurs. The transient storage would age out so you’d only have recent log events, which is generally sufficient.

Modern architecture and coding techniques can mitigate a lot of failures that we used to need to buy more hardware to deal with in the past. Request buffering is one option (One I am a huge fan of – this is message queuing/event driven architectures). Many times the issue is something like a deadlock where if you just retry the operation it will succeed. Asynchronicity is pretty much a requirement for building applications now. It helps resolve front end waits, but to do it properly you should be queuing things up, not just throwing things to background threads directly.

Capacity buffering allows you to queue up work you cannot currently do so that it can be handled later. Your backend processing can eventually catch up. You can also add one more machine than you need to deal with spikes, if they occur with some regularity or you need that level of performance to handle an SLA. It also gives you a buffer of time to scale out to more nodes when you see the spike coming. With an extra “node” you are also in a better position to do a rolling upgrade of your system.

Always carry a spare. Put half of your load on each of two nodes/system and scale them to be at 75% capacity. If one fails, then the other can handle it with some level of service degradation. You need to test your system to ensure that the level of performance is acceptable (perhaps not ideal, but that satisfies the user). You may want to have the features degrade in places that are not as important. For example, if you have a recommendation engine as part of your system, you can give stale recommendations or tell the user that no recommendations are currently available. At least they can still use the primary use case of your system.

The “HI” point. When things are going bad, there should be a point where your automation finally stops and requires a user to take over. Obviously there needs to be notification protocols around this as well so that the humans aren’t brought in cold. The “Leap Year” outage last year triggered this HI protocol. Microsoft also published a post-mortem about this event, and does for every outage. There is probably a lot to learn in these post-mortems.

An SLA (Service Level Agreement) is just a contract that spells out terms for availability and penalties for outages. It is NOT a guarantee. It does not prevent outages. When negotiating an SLA, you need to take into account your processes for recovery – time to detect (is your logging and monitoring any good), time to diagnose, time to decide, and time to act. Do not just blindly agree to an SLA of no outage longer than 5 minutes, you will not meet it (unless your process is amazing). Ensure that the customer demanding the SLA will pay for the investments you have to make to meet the SLA.

Uptime. Four services, each with an uptime of 99.95% uptime has a mean availability of 99.75%. You have to take into account that any service can take you down, so you have to accumulate the total outage time for each service. You can mitigate with redundancies, but they have costs. SLAs are all about cost v. risk. If you require the high level of availability, you need to pay for all of the extra costs, so you must have business justification for this expenditure. What is the cost of the outage versus the cost of the extra availability. It is an insurance policy. So it would seem we might be able to apply actuarial science to this problem…

Redundancy math is not additive, it is multiplicative. It is the intersection of both systems being down at the same time. So, 2 boxes with 95% uptime is 5/100 * 5/100 = 25/10,000 or 0.25% downtime (22 hours per year).

Try to express SLAs in terms of what the business value is, not just hardware uptime.So, 99% of emails will be sent in 5 minutes or less instead of 99% email server uptime. Make it clear what is required so that you can then accurately weigh the cost versus the benefit.

Perform your own root cause analysis for your failures, but just as importantly read and learn from the various vendor’s root cause analyses. “You are not alive long enough to only learn from your mistake.”

Always Ask for the Moon

Uncategorized

We are sizing and spec’ing hardware for a new product we are building. It is still on-premise (but could go to the cloud some day, honest!) so that means real honest to God servers. We were trying to decide on how best to do it while if possible not straying from what we typically ship for our other solutions. One of the servers will be a database server, meaning it will run SQL Server. All by itself – no other software on the box (this is a big deal for us – we generally just throw everything on one box and throw it out the door – best practices be damned).

When it came time to figure out what set of spindles to put TempDB on, I remembered that you could put it in RAM. The machine has a fair amount of RAM, so it was a reasonable thing to consider. Our manager of Core Technologies, whom the DBAs report up to, went to the DBAs to finalize the configuration plan with them. When he suggested putting TempDB in RAM, they looked at him in awe as if to say “are we really allowed to do that?” They had wanted to do it for a while, but were afraid to ask. They wanted to do something to make the system better, but never tried to see if they were allowed.

The morale of this story, always ask for the Moon. Don’t expect to get what you ask for, but if you don’t ask, you’ll never get it.

Thinking About Verbs

Architecture, Distributed Systems, General Coding, Improvement, Techniques

In my last post, I said you should model verbs, not nouns. I probably exaggerated a bit (I tend to do that) – you shouldn’t completely forget about the nouns. Your verbs wouldn’t have much to do without them. Instead, you should be focused on the verbs and the details of the interactions, not the details of the nouns and adjectives (those will take care of themselves).

Your users want to use your application to get work done. As a I said, gone are the days of simple data entry applications (well, not completely, but certainly anyone building new products isn’t working on solving data entry problems). They want to perform tasks using your application – ordering a book is a task, not a data entry activity. Dispensing a medication, verifying the number of 2x4s in stock, checking a book out of a library – tasks, all.

This is not a new concept. Microsoft published an article about Inductive User Interfaces back in February, 2001 (Inductive was their term for task-based). Tasks have been the focus of various writings about Domain Driven Design, CQRS, and Event Sourcing. Another similar approach is the DCI Architecture, which is a little more formal but also acknowledges the importance of the verbs (or interactions, as they are called).

Which of these approaches, architectures, or techniques you use isn’t overly important. What is important is that you consider the verbs, the events, and the interactions. My last post came from a meeting we had at my company talking about how best to handle data transfer objects. We came to the realization that it was better to have a different DTO for every context, even if they contained the same properties, because as time went on, each of these objects would have their own reasons to change.

This led to trying to determine how best to name these things. How do you differentiate between 6 different OrderDTO classes? The answer is, you don’t – the order isn’t the important part. You focus on what the consumer of the DTO is trying to achieve and use that for the name. This is of course a fairly contrived starting point for a discussion of verbs – I still am starting from nouns (orders are nouns). But it led to further discussion about the way we needed to think about things – in terms of the interactions. If we had started there, we would have had an easier time of things. Fortunately the system under discussion still has a ways to go and we have plenty of time to make use of the lesson.

Special thanks to Matt Otto, who reminded me of DCI Architecture, as well as linked me to this very amusing article that basically makes the same points, although much better.

Model the Verbs, Not the Nouns

Architecture, Distributed Systems, General Coding, Improvement, Techniques

For most of my career, the “best practice” has been to build applications from the data up. You model the database and then everything will be happy. Its just the way you do it. There is no other way.

So what’s the problem? You end up building applications around what data you need to display and what data you will update. So you show the user all of the data they might need, because you don’t know what they need. You ask for all sorts of data, because you might need it for some scenario. You build screens that they can use to enter any changes that might occur with this data, no matter why those changes are required.

The problem with this approach is there is no “why?”. Why are you showing the user this data? Why are they updating it? What is it that they are really trying to do? You end up with a lot of very obtuse code that is hard to follow, because its only concern is pushing data to the screen from the database, or vice versa. It flows through a lot of logic that you might need in case of various scenarios, but its impossible to know what rules apply to what scenarios, because there is nothing about your code to imply intent.

Back when I started in this field, I was building applications to allow users to put paper forms into databases. There really wasn’t much more logic than that. I was not alone in this – its what most applications did back in the early 90’s. We were trying to build the paperless office, after all.

We can do so much more now. In fact, our users expect it. In order to do so, though, we need to think about the behavior that is expected. What is the user trying to accomplish? Why? What is the intent?

If we examined the behavior, the verbs of the system, instead of just the data (the nouns), we’d have a better understanding of what it was we are trying to build. Our code would be more obvious. The user’s intent would become clear. And then we could build the system the users actually want to use, the one that helps them get their work done in a more efficient fashion. The one that they don’t constantly complain about (OK, that might be a stretch…).

Model behavior by thinking about the verbs. The nouns will follow.

The Wild Goose Chases of 2011

Improvement

I spent a lot of time last year trying to learn a lot about everything I could in terms of the latest development fads (other than Agile – I think I’ve flogged that horse to pieces). I figured I needed to learn Ruby, Rails, JavaScript, Node, jQuery, Backbone,… Part of the department is rebuilding a PowerBuilder application that needs to stay a thick client using WPF, so I learned a bit about Caliburn Micro. I found a little time to dabble in CQRS, MicroORMs, RabbitMQ, and MassTransit as well, but I didn’t focus on them too much.

The net result, come the holidays in December, I was feeling quite burnt out. So much so that I pretty much ignored software for a bit. January started and I felt better, and then I went to CodeMash, and that helped a bit. But the thing that really snapped me out of my funk was the realization last week that while I should know the basics about everything I can, I can’t know everything in any real depth. I have to have something I specialize in.

I’ve always been a fan of the generalizing specialist theory espoused by Scott Ambler. The problem was, I was spending all of my time generalizing. Part of that was an occupational hazard – I am tasked with figuring out the general direction for the technology and architecture of our product suite going forward. In order to do that, I needed to evaluate the suitability of some of these things. But I certainly didn’t need to dive into all of the shiny tools and toys I did – some of it didn’t matter at all. Some of it could have been delegated to someone else.

So now, as I step away from the fire hose that I’ve spent the last year trying to drink from, I realize what I need to do. I need to focus on the things that I know well, that have gotten me to where I am today. Sure, I still need to occasionally play with other technologies. But that can’t be the focus of my software development life. By diluting myself across these things, I haven’t really helped anyone, especially myself.

Now that I have that sense of clarity, I think this year will go much better…