CodeMash 2013 Precompiler Day One, Session Two


I am going to “live blog” my notes from CodeMash as it goes. I’ll write a post during each session. My commentary will be in italics.

Sridhar Nanjundeswaran is the dev lead for MongoDB on Azure, so that makes him a useful person to know, I think. 🙂

MongoDB is a scalable, high-performance, open source, document-oriented database. Data is stored in BSON, which is optimized for fast querying. MongoDB trades off some of the speed of a key-value store for a lot of the functionality of a relational store. The big mixing features are transactions and joins.

Data is kept in memory-mapped files with b-tree indexes. It is implemented in C++ and runs on pretty much every environment. There are drivers for most development environments/languages. There is a mongo shell that you can use to access it directly. There are drivers from third parties as well for other environments. The drivers translate the BSON to native types, which allows developers to use the idioms of their language.

Table Collection
Row(s) Document
Index Index
Partition Shard
Join Embedding/Linking
Fixed Schema Flexible/Implied Schema

There is a a transactional guarantee around single document saves, and when you consider that what would have been in a relational transaction across multiple tables is generally all in a single document. So, for the use cases where you would want to use MongoDB, you have sufficient transactional integrity.

MongoDB has extensions to JSON to facilitate some additional data types. BSON spec defines these. In general, MongoDB will take more space than a relational DB for the same data due to the schema being embedded in the document.

MongoDB is very light in it’s use of CPU, but demands lots of memory and disk (relative to a similar relational DB).

_id is the “primary key” for an element in collection. ObjectId is a 12 byte value that is like a Guid but smaller (Guids are 16 bytes). ObjectIds are guaranteed unique across a cluster.

The MongoDB Shell is sort of like SQL Server Query Analyzer except it uses JavaScript instead of SQL.

Counts in MongoDB can be slow – the b-tree index used is not optimized for counting, but the next version of MongoDB is addressing this.

The JavaScript looks very much like twisted SQL, but it allows you to use Map-Reduce as well. MongoDB supports Hadoop, too.

A document in MongoDB can be up to 16 MB. GridFS is a mapping system to allow you get get around this if you need to.

You can change a document’s data or schema at any time, but you cannot do a push to an existing field that is a scalar. You would need to rewrite the entire entity.

“Normalization” gets really tricky. Do you store the child entities in the document of their parent, or so you store some sort of link/reference and store the child as it’s own collection. The trade offs are how hard is it too query, storage, performance, and maintainability. So, how do you update the entity if it changes and is used in a lot of documents? Sometimes that is acceptable and sometimes not. You have to determine what makes sense for your case.

There is a DBRef, but it is really just syntactic sugar that generally does not add an value. It is NOT a form of foreign key constraint as people often try to use it.

Indexes are the single biggest tunable performance factor in MongoDB (just like in relational databases). Too many indexes has the same problems as in relational – writes become slow, heavy resource usage (memory in this case is the bigger deal). Indexes get built during the write operation, same as in relational. Generally, the same rules and practices apply to MongoDB indexes as relational.

You can actually create unique constraints via indexes. The only database constraint you can enforce in Mongo. You can recreate the index as well, but that is a blocking operation.

You can create an index that defines a time to live which will delete the document when the TTL expires. There is a process that does this work every minute.

A collection cannot have more than 64 indexes (which is way more than practical). The size of the key values in the index cannot exceed 1024 characters and the name cannot exceed 127 bytes. You can only use one index per query. You can not sort more than 32 MB of data without there being an index on it. MongoDB indexes are case sensitive.

Append .explain() to any JavaScript query to see it’s query plan. There is also a profiler you can turn on that will profile everything or any query that exceeds a threshold. Several query plans are tried, the winning one is cached. It remains cached for some period of time, but not forever.

Replicate you database to multiple other nodes. One node is the primary. If it dies, the remaining nodes vote on who should become the new primary. In the event of a tie, there won’t be a primary and we’ll have a read-only situation. To avoid that, make one non-primary node the arbitrator who will ensure that someone wins the election to primary. You can also set priorities for the nodes to help define failover order. You can define a node as “hidden” which will prevent it from ever becoming the primary. You might do this for a reporting node.

You can set replication delays. Replication is asynchronous and pull data from the primary. You might set a node to delay replication (slaveDelay) so that you get a bit of a backup that will still have things that might be dropped or other destructive events. You would want to mark this node as hidden as well.

Replication is eventually consistent, but if you require things to be always consistent you could query the primary always. There are times when you need that and times when you don’t. There are some very complex procedures around how to configure replication and primary voting – you can probably cover yourself for most scenarios you would need. You can replicate to a node that is physically not in the same data center, which allows for disaster recovery.

You can define how you read data – Primary Only, Primary preferred, Secondary only, Secondary preferred, or nearest. If you need consistent reads, use primary only or preferred.

In order to get disaster recovery, you need to have multiple data centers (more than 2) with more than one node per data center.

Heartbeat goes out every 2 seconds. It times out after 10 seconds. A missed heartbeat means a node is down and triggers the election if it is the primary.

There is an oplog that tells what has happened in replication. It is replayable. There is no multi-operation transactions, but every single operation is transactional – it succeeds or fails all on it’s own.

Bad things happen when the clocks of a replica set get out of sync. Keep them in sync.

The slides for these presentations are generally here, I think:


CodeMash 2013 Precompiler Day One, Session One


I am going to “live blog” my notes from CodeMash as it goes. I’ll write a post during each session. My commentary will be in italics.

I started my day in Jimmy Bogard’s NServiceBus seminar, but I quickly realized he wasn’t going to go into a lot more beyond what I already know about messaging and service buses, so I had to bail. It looked like it was going to be a really good session if you didn’t know a lot about either of those subjects. It definitely would be an in-depth hands on session, so I was disappointed to bail on it, but there are so many things here that I just couldn’t justify staying and hoping he got to the things I could stand to learn more about around 3 pm. The good news is, the room was full. It is good to see so many people interested in message/event based architectures and techniques.

I moved to Mike Wood’s (with Brent Stineman) Cloud Architecture with Windows Azure seminar. So far, it has been nothing but marketecture, but I am hoping it gets better once they get through the basics. I hope I didn’t miss some disclaimer about it not being hands on.

SQL Azure is not the same as SQL Server. The databases are limited to 150 GB and don’t have all of the features. One way to deal with this is to scale out instead of scaling up, which is perfect for building a service-oriented system built from small services. I am a big believer in small (not nano) service based systems. Latency is also a potential issue you need to deal with, even within the same Azure datacenter. The sweet spot is large scale with modest performance, not things requiring high performance. Caching becomes critical for things that need to perform.

Ironically, NoSQL can handle data much better with Azure Tables. The limit of a “table” is 1 TB, which is not a limit of the table but a limit of the underlying storage accounts in Azure. You can also store blobs of data, which is useful for large documents or media. You can use them for streaming.

Azure also supports queues and something called HDInsight, which is a Hadoop implementation. There is an Azure Service Bus, which supports AMQP 1.0. There is also a version you can run on Windows Server, so you can probably connect them together.

Virtual Networking allows you to set up trust relationships between Azure client machines AND your own infrastructure, although the support for this is still limited. Additionally there is Mobile Service which facilitates building mobile apps and storing the data/handling processing in the cloud.

You can use the AppFabric services (which handle caching) locally (it is included in the Windows Server license) and then use Dedicated Caching on Azure. Azure Caching Service is an extra charge, so they recommend you use Dedicated Caching.

There is federated integration with Active Directory which helps you manage security across your local datacenter and Azure. I have to admit that I don’t know enough about this to know how well this might work or how useful it actually is, but it seems like it would solve some problems with security and administration.

The Content Delivery Network (CDN) holds storage accounts. You pick the data center where your storage account lives. When someone requests static data from a location closer to a different data center, the data will be pulled over and cached at that data center, speeding access for additional uses. The data can be cached up to 72 hours, which will vary depending upon usage of the data. It does test for staleness using hashes of the files involved. Obviously this only applies to static, publicly available (i.e. not secured) content. You can use the CDN to cache dynamically generated data if you need to without a whole lot more work.

Aaannnnddd we’ve dropped back into marketing again…

Traffic Manager allows you to control how data and process flows between data centers> It has three profiles. Performance, which routes all requests to the closest data center to the user. Failover, which sends all traffic to the primary solution in one data center and only directs it to another solution in the same or another datacenter if the primary fails. Failover is probably only useful if you have a regional site that does not make sense to have hosted in a bunch of data centers. Round Robin, which also is most useful only within one datacenter, since it will not help when you are running internationally (In fact, it will hurt A LOT). Traffic Manager is a “preview item”. Lots of this stuff is preview items, which is Microsoft-speak for beta.

Data synchronization between data centers is tricky because someone may expect that things are in sync between multiple data centers. Geo replicated storage is automatically turned on for Azure Storage and will have a secondary somewhere else in the same “region” (so, US, Europe, etc.). Queue data does not get replicated – only table and blobs. There is no guarantee about how much of a delay there is between the data getting stored at primary and the data being available at the secondary. Also, only Microsoft has control of the “switch” to fail over to the secondary. So, geo-replicated storage does not necessarily solve the problem of data being available at multiple data centers in a timely fashion. This actually caused a data outage of 60 hours recently for some people hosted at the South Central DC in Texas because it didn’t affect all customers and so they were afraid to pull the trigger on the fix so that they wouldn’t kill working services.

There are feeds and resources that will tell you about the state of Azure that you need to watch if you are using Azure to find out about outages. You can also buy premier support to help mitigate this problem, but that might not be sufficient for explaining to your customers. The solution is to create your own replicates if you find this scenario will not work for your use case or customers. This of course requires you to do the work for yourself, which adds a lot of complexity.

SQL Azure Data Sync uses the Sync Framework so you can plug other things in to it. It will allow you to do some of the replication you need and setting up SQL data sync is not very hard. You define one store as the master and then it will push the data out to the replicas. Schema changes DO NOT sync, so if you change schema you will have to address that yourself. It is data sync, not replication. It is a different speed than replication – there is no guarantee that things will actually be the same at any given time. The minimum amount of time that data will be out of sync is 5 minutes. There is also currently a bug that will not replicate more than 20 MB in a single sync operation, so you need to set your parameters to avoid that at the moment.

Clearly, when you are dealing with clouds and data, eventual consistency is the norm. Eventual consistency should just be the way we think about things now, and so we need to coach our users and business analysts about their expectations, If we need to get something better than eventual, there will be a big cost.

Service Bus Topics and Subscriptions – if no one subscribes to a message/subscription, the message just goes to the bit bucket. Pretty much exactly the way every other service bus works. You can define a queue as a sync queue to provide you some redundancy.

Because things are very fluid in a cloud environment (i.e. you will not have fixed IP addresses) you need to make sure you can handle addresses and names changing without recompiling your code. Of course, this is just a good practice anyway, even if you aren’t using a cloud. It is probably a good idea to build test cases around these sorts of things as well.

Accessible v. Available. If a service is up but no one can reach it, it is neither.

Resiliency is about the system being able to recover without human intervention. Logging and telling the user there was a problem is not sufficient. In order to be resilient, your code has to seek out problems and correct them. You should have multiple avenues for things to occur so that any one failure does not cause the whole system to collapse. Blue Strike is a product that can help you determine how well your site is running and allow you to monitor to ensure that it is operating correctly or where it is failing. There is also a Windows Azure Diagnostics tool/toolset that you can read more about on the Azure site. It aggregates various logs and counters and pushes them to a storage account on a schedule. Note that hardware failures can prevent you from getting this logging and it is possible to overrun your storage account thresholds (both transactions per second and total storage). You should not write your diagnostics to the same storage account as you are storing application data. This allows you to give access to diagnostics to DevOps without them being able to see production data they might not need to have access to.

Logging levels are of course tricky. You don’t want to fill up storage with logs, but you don’t want to leave yourself without enough information. One solution Brent suggested was to log everything to some transient store and only persist it along with an error when the error occurs. The transient storage would age out so you’d only have recent log events, which is generally sufficient.

Modern architecture and coding techniques can mitigate a lot of failures that we used to need to buy more hardware to deal with in the past. Request buffering is one option (One I am a huge fan of – this is message queuing/event driven architectures). Many times the issue is something like a deadlock where if you just retry the operation it will succeed. Asynchronicity is pretty much a requirement for building applications now. It helps resolve front end waits, but to do it properly you should be queuing things up, not just throwing things to background threads directly.

Capacity buffering allows you to queue up work you cannot currently do so that it can be handled later. Your backend processing can eventually catch up. You can also add one more machine than you need to deal with spikes, if they occur with some regularity or you need that level of performance to handle an SLA. It also gives you a buffer of time to scale out to more nodes when you see the spike coming. With an extra “node” you are also in a better position to do a rolling upgrade of your system.

Always carry a spare. Put half of your load on each of two nodes/system and scale them to be at 75% capacity. If one fails, then the other can handle it with some level of service degradation. You need to test your system to ensure that the level of performance is acceptable (perhaps not ideal, but that satisfies the user). You may want to have the features degrade in places that are not as important. For example, if you have a recommendation engine as part of your system, you can give stale recommendations or tell the user that no recommendations are currently available. At least they can still use the primary use case of your system.

The “HI” point. When things are going bad, there should be a point where your automation finally stops and requires a user to take over. Obviously there needs to be notification protocols around this as well so that the humans aren’t brought in cold. The “Leap Year” outage last year triggered this HI protocol. Microsoft also published a post-mortem about this event, and does for every outage. There is probably a lot to learn in these post-mortems.

An SLA (Service Level Agreement) is just a contract that spells out terms for availability and penalties for outages. It is NOT a guarantee. It does not prevent outages. When negotiating an SLA, you need to take into account your processes for recovery – time to detect (is your logging and monitoring any good), time to diagnose, time to decide, and time to act. Do not just blindly agree to an SLA of no outage longer than 5 minutes, you will not meet it (unless your process is amazing). Ensure that the customer demanding the SLA will pay for the investments you have to make to meet the SLA.

Uptime. Four services, each with an uptime of 99.95% uptime has a mean availability of 99.75%. You have to take into account that any service can take you down, so you have to accumulate the total outage time for each service. You can mitigate with redundancies, but they have costs. SLAs are all about cost v. risk. If you require the high level of availability, you need to pay for all of the extra costs, so you must have business justification for this expenditure. What is the cost of the outage versus the cost of the extra availability. It is an insurance policy. So it would seem we might be able to apply actuarial science to this problem…

Redundancy math is not additive, it is multiplicative. It is the intersection of both systems being down at the same time. So, 2 boxes with 95% uptime is 5/100 * 5/100 = 25/10,000 or 0.25% downtime (22 hours per year).

Try to express SLAs in terms of what the business value is, not just hardware uptime.So, 99% of emails will be sent in 5 minutes or less instead of 99% email server uptime. Make it clear what is required so that you can then accurately weigh the cost versus the benefit.

Perform your own root cause analysis for your failures, but just as importantly read and learn from the various vendor’s root cause analyses. “You are not alive long enough to only learn from your mistake.”