Codemash Friday 2:45p – Getting Started with Machine Learning on Azure


Getting Started with Machine Learning on Azure
#Seth Juarez

ML on Azure is easy…
…if you understand a few things about machine learning first

1. Data Science
2. Prediction
3. Process
4. nuML
5. AzureML
6. Models (if time permits)

Data Science
key word – science – try something, it might work, repeat with a different trial, etc.
Science is inexact
Guess, test, repeat

Machine Learning
finding (and exploiting) patterns in data
replacing “human writing code” with “human supplying data”
the trick of ML is generalization

Supervised v. Unsupervised Learning
Supervised – prediction – we know what we are looking for
Unsupervised – lots of data, try and figure out things from the data (clusters, trends, etc.)

Kinect finding your body is Supervised (knows what it is looking for)
Netflix figuring out recommendations is Unsupervised

What kinds of decision?
binary – yes/no, male/female
multi-class – grades, class
regression – number between 0 and 100, real value

multi-class can be done using binary (A v. something else, B v. something else, etc. – then take the best scoring one at the value)

1. Data
2. Clean and Transform the Data
3. Create a Model
4. Use the Model to predict things

“features” is the term used to describe the attributes that could influence your decision
Data cleaning and transformation is just shaving yaks (it takes a lot of time)

nuML []
A .Net ML library
Comes with a REPL for C#
Attributes to mark things [Feature], [Label], etc.
gets data into matrix
(Mahout does this stuff in Java)
turns it into a model
has visualizers for the model (graphical and text based)
Can use the model to create a prediction given a set of inputs (features)

How do you know it worked?
Train and Test – use some portion of your data to train the model, and then the rest to test (Seth is suggesting 80-20)
nuML has facilities to do this
It will take your data, create a model, score the model, and repeat 100 times
Then it returns the best model
You have to be careful of overfitting the problem – if you create too fine grained of a model you might overfit the model to your data and get bad predictions.
Training v. Testing means – if you have 10 rows of data, you will train on 8 rows and then test with 2

Limitation – limited amount of resources on your machine (CPU, RAM, Disk)

Drag and droppy action to set up the path through the steps above
Will give you some charts of your data so you can start to get some insight into it
Has a bunch of transformation options, also draggy
If you don’t know the 4 step process, the “wizard” can be tricky, but if you know it, it’s fairly straightforward.
You just drag and drop your flow through the steps – you can have multiple transformations (shaving the yak)
You can define the training/testing ratio by specifying the percentage to use for training
Define scoring for the trained model
Evaluate the scored models
You need to define what you are learning so that it can train the model
precision – true positive over true positive plus false positive
accuracy – tp + tn / the whole set

If you are getting 95% accuracy, you probably have something wrong – that is usually too accurate (overfitting). Ideal is in the 80-90% range.

You can run multiple scoring methods – you will get a chart comparing them

You can publish as a web service to use your model for predictions

Linear Classifiers – need to define a boundary between sets
Perceptron algorithm – draws a line between two sets
Kernel Perceptron – takes data that can’t have a line drawn between it, 3D-ifies it, draws a plane between the sets

Lots of math formulas – review the slides

“Kernel Trick” – will allow you to determine a seperator between two sets no matter what space by going into multiple dimensions with the data

If you cannot get a reasonable answer – neural networks
A network of perceptrons
Now we’re in to calculus


Codemash Friday 11:00a – Consuming Data with F# Type Providers


Consuming Data with F# Type Providers
#Rachel Reese

What are type providers?
A mechanism to provide types to the compiler – a compiler plugin.
WSDL, CSV, JSON, Other languages (R, Python, Powershell), etc.
As long as it has a schema, you can write a type provider for it

Simplify consuming data by encapsulating it in a type provider
Facilitates using data from the REPL
Allows you to play with the data in the REPL to figure things out about it

Showing us how to use existing type providers

Codemash Friday 9:45a – F# and Machine Learning – A Winning Combination


F# and Machine Learning – A Winning Combination
#Mathias Brandewinder

A fellow accidental developer with an operations research background!

Data Scientist – person who is better at statistics than any software engineer and better at software engineering than any statistician. (from a tweet)

.Net underrepresented in data science/machine learning because it is perceived (possibly correctly) as not being a very good fit. But F# can be, so we are going to talk about it.

Machine Learning
* Writing a program to perform a task
* More data, better performance
* Not explicitly programmed for this [no code change]

i.e. the program gets better all by itself

Classification and Regression
Classification – using data to classify items (spam v. ham for example)
Regression – predicting a number (price given some attributes for example)
Both are part of the concept Supervised Learning – you know what question you are trying to answer and you use data to fit a predictive model

Support Vector Machine example – classic algorithm – we need to look it up
Using the Accord.Net library

F# has a REPL, which makes it easier to iteratively tweek algorithm type problems. Load up the data set and then keep running different variations of the application against it without reloading. When working with lots of data this can be a big time saver.

Goal is to build a model that will predict new data. Need to build it with “training data”. Take your data set and split it in half. Use half to train the algorithm. Don’t expect your model to be perfect.

Math.Net allows you to do algebra in .Net
let A = matrix [ [ 1.; 2.; 3.; ];
[4,; 5.; 6.; ]
[7.; 8.; 9.;]]
Typical in Python, not possible in C#. F# makes the matrix obvious.
Linear Regression problems can be solved in this problem

F# has a timer built in to the REPL, so you can find out how long your functions take to run -> #time;;

Gamers care about algebra – graphics rendering uses vectors
GPUs are optimized for algebra processing
You can use a different LinearAlgebraProvider that uses the GPU for your work, which runs MUCH faster.
Esentially compiles your F# code to native GPU code. F# is good at cross compiling – there is of course an F# to Javascript compiler as well

Interactive experience with a REPL is a huge advantage, and .Net does actually have some decent machine learning libraries available

Unsupervised Machine Learning
You usually have to write your own libraries because a suitable one probably doesn’t exist. As you learn your domain you may need a custom model

Most ML algos
* Read Data
* Transform into Features
* Learn a Model from the Features
* Evaluate Model Quality

Maps to functional
* Read -> Read
* Transform -> Map
* Learn -> Recursion
* Evaluate -> Fold

Unsupervised example – Tell me somthing about my data
Example – Clustering – find groups of “similar” entities
Create centroids for the number of expected groups
Move them closer to group averages
Keep going until there is no more change

Implemented the K Means algo in about 20 lines of F# code (Code is on his github repo)

Type Providers
“No data, no learning” – you need to get data into your system or the algorithm is of no use (80-90% of the job is getting and cleaning data)
“Machine learning is a data janitorial job”
Dynamic languages are nice but you don’t find out about data issues until runtime
Static prevents runtime errors but requires “heavy artillery” (ORMs, etc.)
Type Providers are a compromise between the two
Type providers for csv, JSON, SQL, etc. plus customs for special data sets
There is a WorldBankData provider that has various country and economic statistics, which could be useful. It essentially wraps a public API and makes calls over the Internet to obtain the data.
Type Providers can support queries (like SQL/LINQ) as well
There are even type providers to langauges (R, for example)
Allows you to use functions from R in F# (even gives you a bit of intellisense-like functionality to discover the fuctionality)
You can use F# for what it is good at (type providers) and R for what it is good at

Codemash Friday 8:30a – Gitting More Out of Git


Gitting More Out of Git
#Jordan Kasper

Disclaimers – not for noobs, all examples will be from the command line

Git is decentralized. We all know that, but do we know what it really means?

The entire repo is on your system – not just the branch you are working on
Including all of the history
So, if the “central” goes down, you can still work. In fact, other people can get it from you as well to start working. We don’t typically do that, though.

A remote is repository outside of your current environment. Technically even in another location on you system.
When you clone, you get all branches. It also creates a remote named “origin” that points to the location you cloned from.
Your local branch is “tracking” the remote branch. This is how git knows where to send your changes to when you do a push.
git remote -v will tell you where all of your remotes point
git push -u origin new-feature will set the “upstream” remote branch to track your local branch to
Forking is a github term, not a git thing
A fork is actually just a case of setting a remote
git fetch <remotename> will pull down changes from a specific remote (default is usually origin)

fetch gets all of the changes
pull gets them AND merges into your code

git branch –no-merged will show you all of the branches that have not yet been merged
git diff master stuff is essentially how a pull request works
git diff master..stuff shows the differences between the branches from when the branch split from master

Git and Data Integrity
Git uses snapshots (not file diffs)
File diffs means differences are tracked by file
Snapshot is a picture of the entire repo when taken (all files) – marked with a hash of the entire repo at that time – even changing a single whitespace character will be noted (because you get a different hash)
Hash is actually 40 characters of hexidecimal. You usually only see 7 because that is enough to be distinct in most cases.

When things go Wrong
git commit –amend -m “corrected message” will allow you to correct a commit message, but note that it changes the hash, since you changed the metadata
Two changes with the same things changed but different hashes will cause problems, that is why you should not change a commit after you have shared (pushed) it
reflog is local to you – it is never shared
git reflog shows your changes over time
You can even add a file to a commit: git commit –amend (after staging the file) (Again, don’t do this after you have pushed)
git checkout <filename> will throw away your unstaged changes. It will not remove a newly added file, however.

Three stages of a git repo
HEAD – commited code
Staging – ready to commit
Working Directory – your current work

Committed Changes (oops)
git reset –soft HEAD^ (move head back one commit (more ^’s means more commits))
git reset –mixed HEAD^ (moves head AND staging back one commit)
git reset –hard HEAD^ (wipes out the change completely from all three stages)
you can still get your changes back if you find out you didn’t mean to use “hard”
use git reflog to find the hash for the deleted commit:
git reset HEAD@{1} will bring the change back

Once you have any changes pushed beyond your local repo you should consider it carved in stone. You should not make these kinds of changes to pushed commits.

Head Notation
HEAD^ (back one from current HEAD)
HEAD^^ (two places)
HEAD~n (back n places)
HEAD@{i} (back to reflog index ‘i’)

git stash
You made some changes, but you aren’t ready to commit. Now you need to do work in a different branch. Changing branches will bring along unstaged changes. stashing puts your changes out of your way
You actually have to stage first (most people don’t know that)
git stash will “commit” all staged (and maybe unstaged) changes to a temporary local commit and restore your working environment to the last commit
you can stash multiple times
git stash list shows your currently stashes
BUT – it doesn’t have any comment
git save lets you comment
git stash apply brings it back, but doesn’t get rid of it
git stash drop will remove them
git stash pop brings out your stash and removes it from your list of stashes

Log Options
git log –oneline shows you one line
git log –online –graph shows you a command line version of the pretty branch chart most GUI tools have
git log has lots of options to only show you want you want

git blame filename will show you the last change for each line (hash, author, line)

Playing Nice with Others
git checkout master
git merge feature
Fast forward if no divergent changes
Fast forward does not put the fact that there was a merge in the history
–no-ff forces a merge commit so you can see a merge happened
If there are divergent changes, you get a merge commit (which has 2 parents – only kind of commit with multiple parents)

Merge conflicts – changes git can’t sort out (both commits change the same line for example)
Fix the problem file
Stage the file
Commit (which will be the merge commit)

Rewrites history
Essentially changes the commit you branched from. Doing this one already pushed changes will cause problems, same as above.
You can still get comflicts in a rebase, handled the same way, except you tell rebase you want to continue (problems pause a rebase)
git rebase –abort allows you to stop a rebase if there is an issue

Cherry Picking
Allows you to bring in changes from a different branch that has some work you want to save. After you cherry pick, you should delete the branch you cherry picked from.