I would have written a shorter letter but didn’t have time.

Improvement, Techniques

I received an email from a coworker. The email was terse and wordy at the same time. I’m not sure what the author wants from me. I think he wants me to take some action. Perhaps he wants me to help advocate for the use of the tool he is talking about. Perhaps he wants me to tell him what his next steps should be. It is a rambling email that never gets to the point.

I read my fair share of emails like this. I have written them as well. And not just email. Much of what I read could get to the point much faster. I am trying to get better at writing by writing more. I am using an excellent tool, Hemingway Editor, that helps a lot with keeping my writing succinct.

The biggest problem I have in my writing is long rambling sentences. I often write sentences that should have been a paragraph. These are never as clear and understandable as they sounded in my head. That alone is reason enough to use Hemingway but it has other features that help reduce complexity. If you spend any amount of time writing, you should give it a try.

Responsibility

Uncategorized
From the random stack of unfinished blog posts…

There is a problem with processes and documentation. We tend to use them as shields to prevent usĀ from having to think or make decisions

Software development as a profession is a funny idea. We want treated as professionals, but most of the time we don’t seem to want to act as professionals. We want to hide behind “processes”. We love our processes. It only makes sense, I suppose, since we spend most of our time trying to figure out how to get computers to keep the users stuck in processes. Processes to buy goods and services, or to fill orders, or to handle insurance claims. Things that are repetitive.

Automating processes like that is a good thing – that’s why we do it. It saves time and reduces errors caused by repeating similar actions. It takes the drudgery out of these things and lets the user to get back to their other activities. These other activities are more creative or fulfilling. Or it enables unskilled labor to deal with tasks in a reliable fashion, freeing up skilled labor to do more profitable things.

That is the problem: we are trying to automate processes that make it possible for unskilled labor to develop software. But we’re not unskilled labor – its just not possible for unskilled labor to do these things. Developing software is a creative activity. Creativity doesn’t happen as well when its done under a long list of constraints, which is what process does.

Is all process bad for software development? Just like empowerment doesn’t mean self-management you can’t completely avoid process. Process constrains us to be able to produce software in reliable, semi-predictable ways – to a point. The problem is that we tend to go past that point. We want constrained into tiny boxes, so we don’t have to think a lot, so we can blame the process when we fail. We are risk averse and afraid of failure.

Agile is attractive because we know we have to do something to enable us to produce good software in a reasonable amount of time. We know that the old ways just weren’t working. The problem is, we have years of experience (and beatings) that failure is bad. We clutch to any process we can so that we can blame the process when things go wrong. And because we don’t truly embrace agility, things do go wrong. So we add more process. And things get worse. And so on.

The next time a problem occurs, try something else. Instead of figuring out how to add more process to prevent the problem, think about how taking responsibility could have stopped it sooner. Sometimes you just have to stand up and do what’s right. Admit that you were wrong about some aspect that is now causing a problem. We have “fear of failure” so ingrained that we have to do anything to prevent being seen as failures. And that is just wrong.

Codemash Friday 2:45p – Getting Started with Machine Learning on Azure

Uncategorized

Getting Started with Machine Learning on Azure
#Seth Juarez

ML on Azure is easy…
…if you understand a few things about machine learning first

Agenda
1. Data Science
2. Prediction
3. Process
4. nuML
5. AzureML
6. Models (if time permits)

Data Science
key word – science – try something, it might work, repeat with a different trial, etc.
Science is inexact
Guess, test, repeat

Machine Learning
finding (and exploiting) patterns in data
replacing “human writing code” with “human supplying data”
the trick of ML is generalization

Supervised v. Unsupervised Learning
Supervised – prediction – we know what we are looking for
Unsupervised – lots of data, try and figure out things from the data (clusters, trends, etc.)

Kinect finding your body is Supervised (knows what it is looking for)
Netflix figuring out recommendations is Unsupervised

What kinds of decision?
binary – yes/no, male/female
multi-class – grades, class
regression – number between 0 and 100, real value

multi-class can be done using binary (A v. something else, B v. something else, etc. – then take the best scoring one at the value)

Process
1. Data
2. Clean and Transform the Data
3. Create a Model
4. Use the Model to predict things

“features” is the term used to describe the attributes that could influence your decision
Data cleaning and transformation is just shaving yaks (it takes a lot of time)

nuML [http://www.numl.net/]
A .Net ML library
Comes with a REPL for C#
Attributes to mark things [Feature], [Label], etc.
gets data into matrix
(Mahout does this stuff in Java)
turns it into a model
has visualizers for the model (graphical and text based)
Can use the model to create a prediction given a set of inputs (features)

How do you know it worked?
Train and Test – use some portion of your data to train the model, and then the rest to test (Seth is suggesting 80-20)
nuML has facilities to do this
It will take your data, create a model, score the model, and repeat 100 times
Then it returns the best model
You have to be careful of overfitting the problem – if you create too fine grained of a model you might overfit the model to your data and get bad predictions.
Training v. Testing means – if you have 10 rows of data, you will train on 8 rows and then test with 2

Limitation – limited amount of resources on your machine (CPU, RAM, Disk)

AzureML
Drag and droppy action to set up the path through the steps above
Will give you some charts of your data so you can start to get some insight into it
Has a bunch of transformation options, also draggy
If you don’t know the 4 step process, the “wizard” can be tricky, but if you know it, it’s fairly straightforward.
You just drag and drop your flow through the steps – you can have multiple transformations (shaving the yak)
You can define the training/testing ratio by specifying the percentage to use for training
Define scoring for the trained model
Evaluate the scored models
You need to define what you are learning so that it can train the model
precision – true positive over true positive plus false positive
accuracy – tp + tn / the whole set

If you are getting 95% accuracy, you probably have something wrong – that is usually too accurate (overfitting). Ideal is in the 80-90% range.

You can run multiple scoring methods – you will get a chart comparing them

You can publish as a web service to use your model for predictions

Linear Classifiers – need to define a boundary between sets
Perceptron algorithm – draws a line between two sets
Kernel Perceptron – takes data that can’t have a line drawn between it, 3D-ifies it, draws a plane between the sets

Lots of math formulas – review the slides

“Kernel Trick” – will allow you to determine a seperator between two sets no matter what space by going into multiple dimensions with the data

If you cannot get a reasonable answer – neural networks
A network of perceptrons
Now we’re in to calculus

Codemash Friday 11:00a – Consuming Data with F# Type Providers

Uncategorized

Consuming Data with F# Type Providers
#Rachel Reese

What are type providers?
A mechanism to provide types to the compiler – a compiler plugin.
WSDL, CSV, JSON, Other languages (R, Python, Powershell), etc.
As long as it has a schema, you can write a type provider for it

Why?
Simplify consuming data by encapsulating it in a type provider
Facilitates using data from the REPL
Allows you to play with the data in the REPL to figure things out about it

Demos
Showing us how to use existing type providers

Codemash Friday 9:45a – F# and Machine Learning – A Winning Combination

Uncategorized

F# and Machine Learning – A Winning Combination
#Mathias Brandewinder

A fellow accidental developer with an operations research background!

Data Scientist – person who is better at statistics than any software engineer and better at software engineering than any statistician. (from a tweet)

.Net underrepresented in data science/machine learning because it is perceived (possibly correctly) as not being a very good fit. But F# can be, so we are going to talk about it.

Machine Learning
* Writing a program to perform a task
* More data, better performance
* Not explicitly programmed for this [no code change]

i.e. the program gets better all by itself

Classification and Regression
Classification – using data to classify items (spam v. ham for example)
Regression – predicting a number (price given some attributes for example)
Both are part of the concept Supervised Learning – you know what question you are trying to answer and you use data to fit a predictive model

Support Vector Machine example – classic algorithm – we need to look it up
Using the Accord.Net library

F# has a REPL, which makes it easier to iteratively tweek algorithm type problems. Load up the data set and then keep running different variations of the application against it without reloading. When working with lots of data this can be a big time saver.

Goal is to build a model that will predict new data. Need to build it with “training data”. Take your data set and split it in half. Use half to train the algorithm. Don’t expect your model to be perfect.

Algebra
Math.Net allows you to do algebra in .Net
let A = matrix [ [ 1.; 2.; 3.; ];
[4,; 5.; 6.; ]
[7.; 8.; 9.;]]
Typical in Python, not possible in C#. F# makes the matrix obvious.
Linear Regression problems can be solved in this problem

F# has a timer built in to the REPL, so you can find out how long your functions take to run -> #time;;

Gamers care about algebra – graphics rendering uses vectors
GPUs are optimized for algebra processing
You can use a different LinearAlgebraProvider that uses the GPU for your work, which runs MUCH faster.
Esentially compiles your F# code to native GPU code. F# is good at cross compiling – there is of course an F# to Javascript compiler as well

Interactive experience with a REPL is a huge advantage, and .Net does actually have some decent machine learning libraries available

Unsupervised Machine Learning
You usually have to write your own libraries because a suitable one probably doesn’t exist. As you learn your domain you may need a custom model

Most ML algos
* Read Data
* Transform into Features
* Learn a Model from the Features
* Evaluate Model Quality

Maps to functional
* Read -> Read
* Transform -> Map
* Learn -> Recursion
* Evaluate -> Fold

Unsupervised example – Tell me somthing about my data
Example – Clustering – find groups of “similar” entities
K-Means
Create centroids for the number of expected groups
Move them closer to group averages
Keep going until there is no more change

Implemented the K Means algo in about 20 lines of F# code (Code is on his github repo)

Type Providers
“No data, no learning” – you need to get data into your system or the algorithm is of no use (80-90% of the job is getting and cleaning data)
“Machine learning is a data janitorial job”
Dynamic languages are nice but you don’t find out about data issues until runtime
Static prevents runtime errors but requires “heavy artillery” (ORMs, etc.)
Type Providers are a compromise between the two
Type providers for csv, JSON, SQL, etc. plus customs for special data sets
There is a WorldBankData provider that has various country and economic statistics, which could be useful. It essentially wraps a public API and makes calls over the Internet to obtain the data.
Type Providers can support queries (like SQL/LINQ) as well
There are even type providers to langauges (R, for example)
Allows you to use functions from R in F# (even gives you a bit of intellisense-like functionality to discover the fuctionality)
You can use F# for what it is good at (type providers) and R for what it is good at