Codemash Friday 2:45p – Getting Started with Machine Learning on Azure


Getting Started with Machine Learning on Azure
#Seth Juarez

ML on Azure is easy…
…if you understand a few things about machine learning first

1. Data Science
2. Prediction
3. Process
4. nuML
5. AzureML
6. Models (if time permits)

Data Science
key word – science – try something, it might work, repeat with a different trial, etc.
Science is inexact
Guess, test, repeat

Machine Learning
finding (and exploiting) patterns in data
replacing “human writing code” with “human supplying data”
the trick of ML is generalization

Supervised v. Unsupervised Learning
Supervised – prediction – we know what we are looking for
Unsupervised – lots of data, try and figure out things from the data (clusters, trends, etc.)

Kinect finding your body is Supervised (knows what it is looking for)
Netflix figuring out recommendations is Unsupervised

What kinds of decision?
binary – yes/no, male/female
multi-class – grades, class
regression – number between 0 and 100, real value

multi-class can be done using binary (A v. something else, B v. something else, etc. – then take the best scoring one at the value)

1. Data
2. Clean and Transform the Data
3. Create a Model
4. Use the Model to predict things

“features” is the term used to describe the attributes that could influence your decision
Data cleaning and transformation is just shaving yaks (it takes a lot of time)

nuML []
A .Net ML library
Comes with a REPL for C#
Attributes to mark things [Feature], [Label], etc.
gets data into matrix
(Mahout does this stuff in Java)
turns it into a model
has visualizers for the model (graphical and text based)
Can use the model to create a prediction given a set of inputs (features)

How do you know it worked?
Train and Test – use some portion of your data to train the model, and then the rest to test (Seth is suggesting 80-20)
nuML has facilities to do this
It will take your data, create a model, score the model, and repeat 100 times
Then it returns the best model
You have to be careful of overfitting the problem – if you create too fine grained of a model you might overfit the model to your data and get bad predictions.
Training v. Testing means – if you have 10 rows of data, you will train on 8 rows and then test with 2

Limitation – limited amount of resources on your machine (CPU, RAM, Disk)

Drag and droppy action to set up the path through the steps above
Will give you some charts of your data so you can start to get some insight into it
Has a bunch of transformation options, also draggy
If you don’t know the 4 step process, the “wizard” can be tricky, but if you know it, it’s fairly straightforward.
You just drag and drop your flow through the steps – you can have multiple transformations (shaving the yak)
You can define the training/testing ratio by specifying the percentage to use for training
Define scoring for the trained model
Evaluate the scored models
You need to define what you are learning so that it can train the model
precision – true positive over true positive plus false positive
accuracy – tp + tn / the whole set

If you are getting 95% accuracy, you probably have something wrong – that is usually too accurate (overfitting). Ideal is in the 80-90% range.

You can run multiple scoring methods – you will get a chart comparing them

You can publish as a web service to use your model for predictions

Linear Classifiers – need to define a boundary between sets
Perceptron algorithm – draws a line between two sets
Kernel Perceptron – takes data that can’t have a line drawn between it, 3D-ifies it, draws a plane between the sets

Lots of math formulas – review the slides

“Kernel Trick” – will allow you to determine a seperator between two sets no matter what space by going into multiple dimensions with the data

If you cannot get a reasonable answer – neural networks
A network of perceptrons
Now we’re in to calculus