Codemash Friday 9:45a – F# and Machine Learning – A Winning Combination


F# and Machine Learning – A Winning Combination
#Mathias Brandewinder

A fellow accidental developer with an operations research background!

Data Scientist – person who is better at statistics than any software engineer and better at software engineering than any statistician. (from a tweet)

.Net underrepresented in data science/machine learning because it is perceived (possibly correctly) as not being a very good fit. But F# can be, so we are going to talk about it.

Machine Learning
* Writing a program to perform a task
* More data, better performance
* Not explicitly programmed for this [no code change]

i.e. the program gets better all by itself

Classification and Regression
Classification – using data to classify items (spam v. ham for example)
Regression – predicting a number (price given some attributes for example)
Both are part of the concept Supervised Learning – you know what question you are trying to answer and you use data to fit a predictive model

Support Vector Machine example – classic algorithm – we need to look it up
Using the Accord.Net library

F# has a REPL, which makes it easier to iteratively tweek algorithm type problems. Load up the data set and then keep running different variations of the application against it without reloading. When working with lots of data this can be a big time saver.

Goal is to build a model that will predict new data. Need to build it with “training data”. Take your data set and split it in half. Use half to train the algorithm. Don’t expect your model to be perfect.

Math.Net allows you to do algebra in .Net
let A = matrix [ [ 1.; 2.; 3.; ];
[4,; 5.; 6.; ]
[7.; 8.; 9.;]]
Typical in Python, not possible in C#. F# makes the matrix obvious.
Linear Regression problems can be solved in this problem

F# has a timer built in to the REPL, so you can find out how long your functions take to run -> #time;;

Gamers care about algebra – graphics rendering uses vectors
GPUs are optimized for algebra processing
You can use a different LinearAlgebraProvider that uses the GPU for your work, which runs MUCH faster.
Esentially compiles your F# code to native GPU code. F# is good at cross compiling – there is of course an F# to Javascript compiler as well

Interactive experience with a REPL is a huge advantage, and .Net does actually have some decent machine learning libraries available

Unsupervised Machine Learning
You usually have to write your own libraries because a suitable one probably doesn’t exist. As you learn your domain you may need a custom model

Most ML algos
* Read Data
* Transform into Features
* Learn a Model from the Features
* Evaluate Model Quality

Maps to functional
* Read -> Read
* Transform -> Map
* Learn -> Recursion
* Evaluate -> Fold

Unsupervised example – Tell me somthing about my data
Example – Clustering – find groups of “similar” entities
Create centroids for the number of expected groups
Move them closer to group averages
Keep going until there is no more change

Implemented the K Means algo in about 20 lines of F# code (Code is on his github repo)

Type Providers
“No data, no learning” – you need to get data into your system or the algorithm is of no use (80-90% of the job is getting and cleaning data)
“Machine learning is a data janitorial job”
Dynamic languages are nice but you don’t find out about data issues until runtime
Static prevents runtime errors but requires “heavy artillery” (ORMs, etc.)
Type Providers are a compromise between the two
Type providers for csv, JSON, SQL, etc. plus customs for special data sets
There is a WorldBankData provider that has various country and economic statistics, which could be useful. It essentially wraps a public API and makes calls over the Internet to obtain the data.
Type Providers can support queries (like SQL/LINQ) as well
There are even type providers to langauges (R, for example)
Allows you to use functions from R in F# (even gives you a bit of intellisense-like functionality to discover the fuctionality)
You can use F# for what it is good at (type providers) and R for what it is good at