Codemash Thursday 9:15a – A guided tour of the BigData technologies zoo


A guided tour of the BigData technologies zoo
#Itamar Syn-Hershko

Big Data is a buzzword. “Big Data is any thing which will crash Excel” (@devopsborat)

Even if you don’t have Big Data, these tools and technologies can still be useful.

Agenda – Data at Rest, Streams, Moving data around

There are a LOT of tools and technologies around big data.

Where are we today:
* Database Schemas
* Unreliable at scale
* Expensive at scale
* Relational mindset
* Data is being moved from storage to compute

Schemas assume structured data, can be hard to set up, and are hard to adapt (lack agility)
Scaling strategy is bigger machines (scaling up) which is more expensive than scaling out (multiple simple machines)

Quote from Grace Hopper (heavily paraphrased):
You can’t grow larger oxen, so you need to get another ox to move a bigger load.

Hadoop – based on Google File System and MapReduce
Commodity hardware
Created by Doug Cutting and Mike Cafarella. Open sourced under Apache.
The original product was called Nutch in 2002. Became Hadoop in 2006.
HDFS – Hadoop Distributed File System
Basically takes a big file and store it on a lot of servers. Divide the file into partitions and store each partition on different servers. Essentially sharding. The partitions are each stored on more than one machine for protection.
There is a NameNode that manages how the data is partitioned and how to reassemble it. Losing NameNode is a problem, so it needs to have redundancy.

More DFS: S3, CephFS, GlusterFS, Lustre

Dedicated File Formats: SequenceFile, RCFile, Avro,

MapReduce – parallel computations on data, based on functional programming concepts
Map processes documents in some way (take sentences and break them in to words, for example), producing tuples.
Reduce takes the tuples and combines them (so take each word and add up the counts)

Hadoop does this in Java – you write a Mapper and a Reducer (they implement interfaces). Then you put the .jar files on Hadoop and it runs the job in place using TaskTrackers. These TaskTrackers are controlled by a JobTracker, which runs the job and spins up the TaskTrackers to do the work. There will be a TaskTracker for each partition of data. This is how parallelism is achieved.

Hadoop now has a bunch of distributions. Apache, Cloudera, Hortonworks are the key ones. Each beyond Apache adds other technologies on top (Impala; HCatalog, Tez). Various features are added by cloud vendors to make managing Hadoop in the cloud easier, too.

Apache Hive – Runs SQL over HDFS using HiveQL
HiveQL is not exactly SQL, but very similar
Compiles down to MapReduce, later versions compile down to DAG (Tez)
Think of MapReduce as assembly language, there are various abstractions you can use above it which have their own advantages (HiveQL is one of them obviously)
Apache Pig is a procedural language that expresses processes on data and compiles down to MapReduce (scripts are called Pig Latin). You can write user defined functions in your own language (Javascript, for example) and use them in Pig

Apache HCatalog – Hortonworks distro only
Defines another way to look at your data files and figure out what files you want
HBase is another one

Workflow schedulers – Apache Oozie, LinkedIn Azkaban, Spotify Luigi are examples

The bad and the ugly:
* Data is not always local
* Still too much I/O
* Slow to compute
* Hard to make JobTracker High Availability (HA)
* Poor resource utilization (you can be either a mapper or a reducer)
* NameNodes are a single point of failure

YARN and MapReduce 2.0
YARN does cluster resource management. People call it an operating system for data processing. Improves on the issues with Hadoop above.

Apache Spark
Resilient Distributed Datasets (RDD) – represented as DAG (Directed Acyclic Graph)
Combine data and actions
Take data, transforms it, performs actions on it
RDD is split to do work in parallel as possible.
Transformation: map, filter, union, distinct, join, etc.
Actions: take, count, first, reduce, foreach, etc.
Works continously instead of in batches
Out of the box – Scala and Python
Has integration with Spark R, Spark SQL, Spark GraphX, Spark Streaming
Spark runs in clusters – can self-manage, or you can run in YARN or Apache Mesos
Driver program sends work to the cluster manager. Worker nodes do the work. Worker starts after the last processed data, so somewhat crash tolerant.
Spark has a large ecosystem of it’s own, similar to Hadoop

Stream Processing
Iterative batch processing (Deterministic batch operations)

Apache Storm
Handles streams
Takes from sources (spouts)
Processes in Bolts
Define a topology of Spouts and Bolts connected together
Runs continously, not batch

Apache Samza
Similar to Storm
Handle each message as it arrives
Garantees ordering

Data Pipes – how do we stream into Hadoop?
RabbitMQ, Cassandra, Redis, Kafka, etc.
Apache Flume

Configuration Management, Synchronization

Since we are talking about distributed systems: read Aphyr’s “Call Me Maybe” blog series[]

ELK – Elastic Search, Logstash, Kafka to work with log streams

Apache Mahout – Machine learning framework

Codemash Thursday 8:00am – Building Highly Scalable Apps on the Azure Platform: Real World Guidance


I am going to be posting my notes from each session I attend. These are raw and unedited, so I apologize for any grammar or spelling mistakes.

Building Highly Scalable Apps on the Azure Platform: Real World Guidance
#Kevin Grossnicklaus

You should sign up for an Azure account.
You get an Azure portal where you can access and create any kind of Azure service.
You can use bit and pieces – he sometimes uses Azure Service Bus with his AWS applications.
He puts an Azure deployment project in every Visual Studio solution.
He does not use the Azure emulator – he sets up a console app to run the service directly for development. It is just a wrapper around his Azure service. This is because it is faster to work with since it does not have to do the deployment, even to the local emulator.
He uses the Azure services that he needs from the console application (ASB, for example). This does require a constant internet connection, but it gives you full access to the services you need with all of the features. The emulator is not at parity with the real thing.
He believes very heavily in message-driven architectures. He uses MongoDB.
He has a logging service that takes in the log events and writes to a queue, and has a log writer service that actually writes to a DB.
All of the parts of Azure have libraries to support them available via NuGet if you are using .Net. There are SDKs available for non-.Net applications.
In order to work with other people, he appends things like the machine name or the environment to the front of the queue name so that messages aren’t stolen by the wrong computer or application.
Serialize small objects to the queue using a text format like JSON.
When you create Azure assets, you select what datacenter they will go in. Typically you want all of you stuff in the same datacenter to minimize latency (but what about redundancy – guess that is a different consideration and out of scope for this talk).
Send work via an async call to a queue to be done by a “backend server” if that work doesn’t need to be immediately visible to the user. For example, you might be able to do a simple form validation then queue up the form submission for later processing.
You can have multiple instances of a service running that listen to the same queue, and only the first one to pick it up will process it. If you want to have multiple consumers of the message, you would send it to a topic and all of the services that are interested in that topic would receive the message.
Now he is talking about precalculating things like a Facebook timeline. He didn’t use the terms “read model” or “eventual consistency” but that is what he is talking about here. He would use SignalR to send the updated read model to the client when it is ready.
Integrated Caching – use the memory in the servers you are already running instead of a dedicated cache. You can define the allocation. He allocates 1/3 of the memory for each of his web servers. This can save you money if you have the memory to spare. You can spin up dedicated cache – your application does not see any difference – it is just a different connection string. Redis is also available, but he doesn’t use it. It would require some code changes.
Azure Blob Store is like an unlimited disk. You put documents in containers, which can be public or private. You can use a public container as a CDN. It is probably better to wrap your blobs in an API instead of allowing direct access. This allows you to do things like reformatting (image size, for example).
He definitely spends a lot of time trying to figure out the best ways to use Azure cheaply, which makes sone sense.
He hosts MongoDB through mongolab, which actually runs on Azure so his databases end up in the same datacenter. You can even put the database in the same Azure network as the rest of your stuff. mongolab actually runs mongo on whichever cloud provider you want – it works with AWS as well.

Optimize for Maintenance


When you are building software, you can optimize for one of two things. You can reduce the time it takes to create or you can reduce the amount of effort you will put into maintenance. Old school RAD tools (and much of Microsoft’s demo-ware) optimize for quick construction. Tools like ORMs (especially Entity Framework) and Rails also make this choice. They all allow you to build solutions quick, and there is nothing wrong with that, if that is what you need.

Most software will live for many years, so you will spend much more time fixing it and adding features. This is where the whole RAD strategy falls apart. Most of the tooling that allows you to build fast (look Ma – no code!) does not hold up well to attempts to maintain it. It allows hidden complexity to creep in while you work around the tooling to fix a bug or add a feature. Before you know it the whole stack of software is hard to debug or add to so nobody wants to do it.

This is why developers fight against having to do maintenance – because they have made it hard on themselves. Developers would like their jobs better if they built for maintenance instead of trying to meet aggressive deadlines.

Out of the funk


I have not been feeling creative for a long time. I suppose this is something that happens from time to time for people. I can recall other times when it happened to me. This time was different. It lasted a long time. But enough with that.

One of the things that helped me get back on track was watching conference talk videos. I have been trying to exercise more for the last couple years with varying degrees of success. The one thing that consistently works for me is walking. The problem with walking is I live in Pittsburgh. Once September ends the sun becomes a rare thing until the following April.

So the solution is obvious. Go to the gym to walk or use the treadmill we bought eons ago. Even though my gym is very close to my house, it is amazing how easy it is to make excuses about not having enough time. The treadmill doesn’t take that excuse. The problem there is walking on a treadmill is SO BORING. Then a realization hit me. I have a tablet. It fits nicely on the magazine holder on the treadmill. And I had been collecting links to conference talks I never find time to watch.

In the past, the number of talks I wanted to watch would have probably only lasted me a week or two. Recently it seems that most of the major conferences (especially some that I would love to attend but are just too far away) have been posting videos of pretty much every talk. Now the solution to the treadmill problem was in focus: I would watch 30-60 minutes of video every morning while I walked on the treadmill.

I know a lot of you are asking why I didn’t do this sooner. I don’t have a good answer. But I wish I had. I started seeing inspiring talks about all sorts of subjects. And then just a few days ago I saw the one that launched me into action: MythBashers: Adventures in Overlooked Technologies by Avdi Grimm.

I have been trying to think up a meaningful side project to learn some new things and scratch a few itches. That word, meaningful, has been the killer. I build distributed systems to manage medications in my day job. If I want to do something meaningful, it should help me get better at that. I was trying to figure out ways to play with message queues and streaming systems and other crazy things. That would be meaningful.

But really what I needed was something I could work on that would be fun. If I managed to learn something along the way that was applicable to my day job, great. But I started programming as a hobby. I wasn’t doing it for the money or the fame. I was doing it because I enjoyed it.

I have been dabbling in “functional” languages because I find them intriguing. The reasons are the same as anybody else. I played with Erlang, tried (and hated) Scala, fell in love with Clojure. The learning curve on all of them is steeper for me than F#, though, since in my day job I use C#. So I decided to focus on F#. I’ll pick up Clojure again later and I’m sure I’ll check out Elixir. But for now I need to focus, so F# it is.

And so we come to the project. Back in college I ended up getting very poor grades one semester. I partly blame taking two 300-level econ classes at the same time. My parents blame beer. I know the real truth. It was MUDs. To those who weren’t alive in the early 90’s, a MUD is a Multi-User Domain (or Dungeon). Think World of Warcraft, but with nothing but text. I was obsessed. I spent a lot of time in the computer lab playing. And then it got worse – my roommate showed me how to connect from my dorm room. Yeah.

To this day I have resisted games like World of Warcraft because I know what will happen. But now it is time to put that misspent youth to use. I have started building a MUD in F#. I doubt it will ever be great. I may be the only person who ever plays it. That is not what matters. What matters is it is a real thing to build. A thing I can care about and enjoy. That is really what a hobby project should be. I may come up with a better idea someday, but in the meantime I should be using my hobby time to do something more useful than reading millions of words about software development. I should be writing software.

I plan to write about what I learn along the way. I actually have set up a micro-blog of sorts in the project where I am posting my “stories” and other notes along the way. I expect that as I start I will not be writing good F# code, but by writing the code I will learn more and eventually I will have written some good F# code. Probably just a couple methods across the project. 🙂

If you would like to follow along (and tell me what I am doing wrong – this is a learning experience, after all) you can find the project here.



I’ve been spending some time learning about different languages and paradigms, which introduces me to terminology I’ve never heard before. I am not a “classically trained” computer science major – my programming and design skills are mostly self-taught. When I was staying mostly in my own space (building line of business applications in .Net, and VB (and some dBase) before that) I did not get exposed to any of this crazy terminology.

So, mostly as notes to myself, I am writing down what these terms I keep hearing mean (in my own words), instead of just glossing over them in my head and hoping the rest of the context clues will keep me going. If you get any value from this, great! 🙂

Memoization is a term I keep hearing in the various functional programming/distributed systems videos I’ve been watching. I figured it has something to do with remembering things and not put too much more thought into it. And, indeed, it does have to do with “remembering”, but a fairly specific case of it.

Memoization is keeping the results of past computations so that you can refer to them again without having to take the time to do the computation. It shows up often in recursive solutions. Basically, if you have a deterministic (a word I do know, since I did study a good amount of statistics WAY back in college) function, you can store the result in a dictionary keyed with the input value(s) (or a hash of them), and just use it in the future instead of recalculating.