We are here for
the afternoon session of tiny Machine Learning Workshop.
Our first speaker is Manik Varma from Microsoft Research.
He'll talk about The Edge of Machine Learning.
>> Cheers, thank you Venkatesh.
Right, so I'm Manik Varma from Microsoft Research India.
And I'll be talking about the edge of machine learning.
Where I'll be focusing on developing
machine learning models that can be trained in the cloud.
But can then make predictions locally on tiny IoT edge and
endpoint devices.
Which might have as little as two kilobytes of RAM.
So the models and algorithms that I will be presenting today
have been developed Prateek, Praneeth, Naga and myself.
Though in reality all the work was done by our excellent
undergraduate reasearch fellows Ashish, Chirag, Yeshwanth and
Bhargavi.
We also have a bunch of computer architecture systems and
programming languages researchers on our team.
Including Harsha, Vivek, Rahul and Raghavendra.
And they have been helping us with very
efficient implementations
of our algorithms on these tiny IoT devices.
Along with Harsha, our developer Suresh has been
writing production-level code for an open source release.
So you can download our algorithms and
play with our code as part of the edge machine learning
library that we have released on GitHub today.
And you can hopefully also find algorithms soon
in the Microsoft embedded learning library.
Which is a specialized machine learning compiler that we're
creating that can take these machine learning algorithms in
our library.
And compile them onto these heterogeneous IoT devices.
So that's also released in GitHub and hopefully you'll be
able to find our algorithms soon over there, okay?
So before I get started, I thought that some of you might
not be familiar with the Internet of Things.
So let me start by giving some context and
introduce these IoT devices that I've been talking about.
And just specify the hardware that we will be targeting.
So here is an ARM Cortex M0.
It has only 2 KB RAM, 32 KB of read only Flash,
and no support in hardware for floating point operations.
And because of this, you can see it's really, really tiny.
In fact, it's smaller than a grain of rice.
So that's a golf ball in the back, just for
a size comparison.
Here's an Arduino Uno board.
It's built around an 8 bit ATmega328P microcontroller
that's running at 16 MHz.
And it also has just 2 KB of RAM, 32 KB of read only Flash.
And again, no support in hardware for
floating point operations.
So the way that these IoT devices work is that once you've
trained your machine learning model in the cloud.
You take the trained model, the prediction code,
the feature extraction code and any associated data and
parameters that you might have.
And then you burn them onto the flash of the microcontroller
along with the boot loader, the device driver, the libraries and
any other application code that you'll have.
And the you deploy the device in the field,
at which point of time the flash becomes read only and the only
writable memory that you'll have access to is the 2 KB of RAM.
Now, billions of these devices have already been
deployed in the world today.
And there's an Internet of Things wave that is all set to
revolutionize our society.
However, it's still early days.
And so nobody's really quite sure of which application will
take off and what will be the next big thing.
So people are trying out lots of different applications in
connected cars, industrial IoT, predictive maintenance.
Fitbits, fitness variables, smart cities, smart housing,
smart appliances, and so on.
But the one thing that is common to all these applications
is the assumption that the IoT device is dumped.
It just senses its environment and
then transmits the sensor readings to the Cloud.
Where all the machine learning algorithms reside and
where all the decision making happens.
However, we feel that there are number of critical scenarios
where it is not possible to transmit the data to the Cloud.
And where the device itself needs to be made intelligent in
order for decision making to happen locally.
So these scenarios typically arise due to concerns around
latency, bandwidth, privacy and security and energy.
And range from the low latency brain implants for
predicting seizures, so that patients can call for
help as quickly as possible.
So for example, if they're driving a car they could pull
over, or if they're standing up they could sit down, etc.
To precision agriculture on disconnected farms,
to privacy preserving smart glasses.
And to the incredibly energy efficient, but really hilarious
smart fork that starts shouting at you if you eat too much or
you eat too quickly.
>> [LAUGH] >> So
a bunch of us at MSR Redmond and
MSR India have come together to address some of these concerns.
And our objective is to build a library of machine learning
algorithms that can run on these tiny IoT devices.
So let me be absolutely clear about this scenario.
Our models are going to be trained in the Cloud.
Or on your laptop, where we assume there are infinite
resources available for training.
But then once the model has been trained, then it has to be
squeezed onto the flash of this tiny microcontroller.
Where it's then expected to make predictions in milliseconds,
fit in a few kilobytes and
ensure that batteries last forever and ever.
So the way I have structured this talk is that I'll start by
discussing or presenting Bonsai.
Which is a completely new tree model which we've designed for
these low memory regimes.
And then after that I'll briefly present Proton,
which is a new compressed [INAUDIBLE] classifier that was
also presented at ICMI a couple of days ago.
And then finally, I'll conclude by presenting results and
comparing Bonsai and Proton to the state of the art.
And showing what happens when we deploy them on in our.
Okay, so let's start with trees.
Now trees are very beautiful, but
they have a number of other advantages which make
them an essential component of any machine learning library.
So they are general and can be used for classification,
regression, ranking, anomaly detection, etc.
And since we expect to tackle a diverse range of machine
learning applications in the IoT space.
It would be good to have them be part of our tool kit as well,
at the machine learning library.
Even more importantly, balanced trees have the great property
that their predictions can be made in time that is logarithmic
in the total number of training points.
Which is a rough measure of the complexity of the learning task.
Therefore, ubiquitous in many time critical applications,
including Bing, the Kinect, the HoloNet, etc.
And this also makes them ideally suited to our IoT scenario
because they can make predictions very efficiently.
Both in terms of time and in terms of energy.
However, we can't take a standard tree algorithm,
such as a decision tree, and simply hope
to deploy it out of the box on and on for 328b, right?
And that's because,
if you have only a few kilobytes of flash memory available,
we can only hope to learn a shallow decision tree.
Which might not be very accurate in making predictions.
As you can see in this binary classification task over here,
where the objective
is to separate the red points from the blue.
So the problem is that at each internal node in a decision
tree, we learn a horizontal or
vertical cut to partition the data.
And we keep repeating this procedure until we end up in
a leaf node, where we make a very weak constant prediction
based on the majority work.
And this leads to poor accuracies because there is no
way you can separate the red points from the blue
based on the small number of access aligned cuts for
this particular dataset, okay?
So you might hope to address this problem by learning deeper
trees but then you'll run out of RAM.
And it might turn out to be the case that these trees
are still not accurate enough.
And the problem is even more severe,
because in the real world, in order to get high precision,
high accuracy predictions from most real world applications.
We almost invariably have to learn a large tree ensemble,
all right, which will take a huge amount of memory.
And where I believe we still haven't addressed,
or fundamentally addressed, the issue of accuracy.
Because we're still restricted to generating these piecewise
linear decision boundaries made up of these horizontal and
vertical cuts.
Standard decision tree ensembles cannot generate truly curved
decision boundaries, which is what's really required in
this case in order to get high accuracy with a compact model.
So rather than starting from a huge tree ensemble and
then pruning it very aggressively to get it to fit
within a few kilobytes of RAM and
suffer a large loss in accuracy as a result.
Ashish Kumar, Saurabh Goyal and
I decided to come up with a completely new tree model,
called Bonsai, based on the following three key ideas.
First we're going to design Bonsai to be a single, shallow,
sparse tree.
Which will make it incredibly compact, but
then we'll also make each node in Bonsai much more powerful
than a node in a standard decision tree.
In order to Bonsai to accurately learn these non-linear decision
boundaries.
Second, in order to further reduce model size.
We are going to take the training data and
project it into a low dimensional space.
And then learn all of Bonsai's parameters in that space,
in order to get Bonsai to fit within a few kilobytes.
And finally, we're going to learn this past projection
matrix and all the nodes in the tree jointly.
So as to ensure that the matrix is, sorry, so
this ensures that the matrix has learned to
maximize Bonsai's prediction accuracy.
Whereas the tree's to maximize the memory budget utilization.
So let me flesh out each of these ideas in a little bit
more detail.
So we'll start by replacing the horizontal and
vertical cut learned in a decision tree by oblique cuts.
So we're going to, at each internal node in Bonsai,
we're going to learn a full hyperplane with normal theta.
So that we can branch points left or
right when an input comes in with feature vector x.
We can determine whether to branch it left or
right by evaluating the sign of theta transpose x.
Now this will allow us to learn slightly shallower tree.
So it'll reduce the depth of the learned tree somewhat.
Because now each node in Bonsai is much more powerful than
a node in a standard decision tree.
And will also allow us to generate these oblique
decision boundaries that you see over here.
Now this is a very common idea.
And you might have seen it many times in the past.
So think about decision jungles from MSR Cambridge,
as well as variants of the perceptron tree.
But unfortunately, it doesn't quite cut it in our case.
Because these learned trees are still too deep to fit with a few
kilobytes of flash.
And at the same time,
we still can't generate code decision boundaries.
And we're still restricting to making constant predictions in
the leaf nodes.
So this is the point in the past where everybody had abandoned
this model and started exploring other models.
But what we realized is that if you were to take each node and
make it even more powerful.
In fact, allow each node to make a nonlinear prediction,
then the model starts to pan out.
So what we're going to do is that at each node in Bonsai,
we're going to learn two vectors, v and w.
So that each node can make a nonlinear prediction by
computing w transpose x into tan hyperbolic v transpose x.
And this is what will allow Bonsai to learn these curved
nonlinear decision boundaries.
Now, the particular form of this nonlinearity is
not very relevant.
You can choose whatever nonlinearity you like.
We chose this particular form because it worked well for
us in Berkeley.
But if you think that some other nonlinearity
will work better for you in your domain,
then you should just feel free to go ahead and do that.
Everything else in Bonsai will still hold.
So Bonsai's overall prediction is the sum of the individual
node predictions made along the path traversed by a point
from the root to the leaf.
Now, I should also point out that in trees,
there is a unique correspondence between paths and leaves.
So rather than targeting each node to make a prediction,
I could have forced all the nonlinearities
down to the leaf nodes.
And I could have gotten equivalently, where the only
leaf node classifiers are making these nonlinear predictions.
And so this corresponds to taking a standard decision tree
and now having much more powerful leaf node classifiers.
And again, this is an idea that has
been tried out many times in the past.
Cho-Jui, Si Si and Inderjit have these great DC-Pred++ trees,
where each leaf node is a kernelized SVM, RBF-SVM.
And people have put neural nets and other classifiers as well.
But the point is that if you were to put a partial classifier
in each leaf node and
then let all of them be completely independent,
you'd run out of memory very quickly.
So, for instance, over here,
if you were to force all the nonlinearities to the bottom,
we would have 3 v's and 3 w's in each leaf node.
And there are 4 leaf nodes.
So that would give us a total of 12 v's and
12 w's if we were to let them be all independent.
Which is far more than what we have over here.
So the budget would get exhausted.
So path based prediction allows a lot of
parameter sharing in Bonsai.
Which is what really allows us to learn this really,
really compact model while still making nonlinear predictions.
Okay, so to recap, our first contribution,
which differentiates Bonsai from standard decision trees is
the fact that we're going to learn these three vectors,
theta, v and w.
Where theta transpose x is going to control the branching.
And w transpose x into tan hyperbolic v transpose x is
the nonlinearity that's predicted by each node.
So all this is well and good.
But for the fact that theta, v, and
w have the same dimensionality as x.
So if the input feature vector becomes very high dimensional,
then each node in Bonsai will become very bulky and
will again run out of RAM, out of flash memory.
So in order to address this limitation, we add a second
ingredient, which further differentiates Bonsai from
standard decision trees, which is a Sparse Projection matrix Z.
That takes the input and
projects it into a very low dimensional feature space.
In fact, on many of our IoT experiments,
we kept only a five-dimensional feature space.
And then all of Bonsai's parameters are now learned
in this low five-dimensional space.
So what this implies is that at each node in Bonsai,
we have to store at most 15 numbers.
Five for theta, five for v and five for w,
which hardly takes any space whatsoever.
And this is what allows Bonsai to finally fit in a few
kilobytes of RAM.
Make predictions very efficiently in milliseconds
even on these slow microcontrollers.
And make sure that batteries last longer as compared to any
other algorithm out there.
While at the same time delivering classification
accuracies that are significantly higher than
the state of the art.
So one more thing that I should mention over here.
It's an important implementation detail is the fact that
this past projection can be computed in a streaming fashion.
So this allows us to tackle lots of IoT applications
where the feature vector for
even a single point will not fit into two kilobytes of RAM.
And this is very important for
trees because they otherwise won't work with streaming data.
Okay, so I was focusing on the binary classification case all
this while, but you can also extend Bonsai fairly
straightforwardly to handle multiclass classification,
regression, ranking, other tasks in supervised machine learning.
So for instance, if you wanted 2x and
Bonsai to handle multiclass classification,
all we would need to do is to replace the vectors V and
W at each node by matrices.
And now rather than predicting a scalar for
binary classification, Bonsai will start predicting a vector
of class codes which you can use for multiclass classification.
Okay, so that's it for the model.
Let me also quickly discuss how we can train the parameters in
the model from data.
So remember that we have two groups of parameters that we
need to train.
The first group is the sparse projection matrix Z.
And I'll refer to the other group which is all the tree
parameters put together in this matrix capital Theta.
So we obtained that by taking all the nodes and
then taking the theta, V and W in each of the nodes and just
stacking them all together into one big, huge fact matrix, okay?
So here is Bonsai's objective function.
It's really, really, simple.
It has just three terms.
The first two terms are L2 regularizers on the tree
parameters and the sparse projection matrix.
And the third term is any suitable loss function that you
might have chosen for classification, regression,
ranking and so on, right?
So for binary classification,
you could have used the hinge loss for ranking.
You could have used any CG gradients, alpha gradients,
sorry, lambda gradients, etc.
And then in addition to the objective function,
we place explicit memory constraints on
the projection matrix Z and the tree parameters theta.
And that'll depend on how much budget you have on your flash
of your device, right?
So notice that this is a hard, non-convex, non-smooth problem.
And it's incredibly difficult to optimize.
But if you're a guy who learns trees every day, you can say,
well what's the big deal?
And trees have always been non-convex and non-smooth.
So I'm just going to learn Bonsai in the standard.
We are first going to learn the root node,
then I'll learn the two children, and the two
children are four grandchildren after that and so on.
So, you could do that but
then the problem with that is that you won't get a very
optimal allocation of your memory budget.
So, if you are going to learn the root node at the very
beginning, you have to decide how much budget should
you allocate to the root node.
Are you going to allocate an entire kilobyte to that or
just a little bit?
I mean, it could be the root node is the most important
because all the data flows to it.
Or you could assign an equal budget to each of your nodes.
So, rather than using a heuristic to determine how much
budget to give to each node, we thought it would be very good
if the optimal allocation of memory budget could happen
automatically as part of the optimization algorithm, right?
So in order for that to happen,
we have to learn all the nodes in the tree jointly.
I know there are many ways of doing that but
we thought we'd try and do this based on gradient descent, okay?
So trees are discontinuous.
So we have to first move them in order to get gradient
descent to work.
And the reason trees are discontinuous is because if you
take any point,
it follows a discrete path through their tree, all right?
So this particular point,
it might go and end up in the leftmost leaf node.
And now if you are to change the splitting function at the root,
you change the theta a little bit.
The point could flip over to the other side.
Go to the other extreme, and
now you'll make a very different prediction.
So there's a huge jump over there, right?
So in order to address this issue,
we use a very standard trick which is to smooth the tree.
Points can go both left and
right at each node with some probability.
And when we start off, we can let this probability be uniform.
So points are now going to travel to all the nodes in
the tree which is great, which makes the tree differentiable,
but then you lose all your efficient prediction properties
because now you have to visit every node.
So what we can do is that as we start training,
as our optimization progresses,
we can sharpen this probability distribution.
So that by the time we are nearing convergence,
this distribution has gone back to being an indicator function.
And points will take only a single path through the tree.
So you get the differentiability as well as
efficiency in this particular case.
Okay, so here is our algorithm for
training Bonsai based on gradient descent.
It has just two steps in each iteration.
And these are the two steps.
And we'll keep repeating these two steps until we converge, and
typically we converge in about 300 or
400 iterations very quickly.
A good quality solution.
So let me just outline the two steps for you.
The first step, we start off with the feasible initialization
of the budget for the parameters, and then,
we freeze the budget allocation to the various nodes, and
then we apply K steps of mini batch gradient descent.
So in this we are constantly improving the model,
we're constantly lowering the loss, but
the budget is fixed across nodes, right?
>> [INAUDIBLE] >> The memory budget, right?
So you can see the root node has, I don't know, you tell me
how many red squares, that's the budget that was allocated to it,
and then the leaf nodes have a slightly lower budget, etc.
>> The number of support.
>> Yeah, so the support is fixed and the number of non zeroes
that you will allow in each of these theta t's and W's.
So we take K steps of gradient decent with this fix support,
with the fixed budget for the nodes.
And then in order to find a better allocation of the memory
budget, we take a dense gradient step in the next iteration and
this will give you a completely dense solution.
But then you can project back onto the feasible step using
iterative hard thresholding.
So you get a better allocation of the budget.
So, for example,
some budget might move from the root node to the left child.
Another piece of the budget might move from the left most
leaf to the next child, etc.
So we keep applying these iterations, and
we find that empirically we, on many datasets, we converge
within 300 iterations to quite a good quality solution, okay?
Let me also briefly present ProtoNN,
which is a compressed k-nearest algorithm
that has been developed by Prateek and his team.
I won't go into the details of nearest neighbor classification.
All of you know that already, but suffice to say that
k-nearest neighbor algorithms are very general, they're easy
to understand and debug, and they can work very well in
scenarios when we have little amounts of training data, right?
So for typical maker or creator scenarios where it's hard to
go out and gather more data and
label it, you might prefer using k-nearest neighbor algorithm.
But just as was the case with decision trees, you can't
take a standard nearest neighbor algorithm and then hope to
deploy it on a non Cortex M0 or on Atmel chip, right?
And that's because you have to lug around the entire training
set for prediction, right?
So in order to make even a single prediction,
you need the entire set around,
which means your prediction costs are going to be really,
really high.
And at the same time,
your predictions might not be very accurate, because we don't
know what is a good distance metric in feature space.
So in order to address these issues,
ProtoNN follows a very similar approach to Bonsai.
We start by learning a sparse projection matrix Z.
And in this case, the matrix not only projects the data from
a high dimensional space to this low five dimensional space, but
it also learns the distance matrix from us according to
the given task, right?
Then in order to get further compression,
we start learning exemplars for the training site, right?
So rather than using the entire training set to serve as
exemplars for nearest neighbor, or even choosing a subset of
the training set to be our exemplars, we're going to learn
completely new points, which are not part of the training set,
as the exemplars for nearest neighbor classification.
So that buys us a lot of compression, but
then, in addition to that,
what we also do is share the exemplars between classes.
So let's say, if you were to learn m prototypes, the first
prototype code, say, I'm an exemplar for both class one and
class ten, and the second prototype could say,
I am gonna be an exemplar for, let's say, class one and
class two, and the third prototype could be,
again, class two and class ten, and so on.
So in terms of notation,
I'm going to be learning m prototypes, b1 through bm, and
these are going to be five dimensional vectors,
if my projection space is going to be five.
And at the same time, I'm going to learn the votes with which
each prototype is going to vote for the various classes.
These are going to be w1 to wm, and these are going to be l
dimensional vectors for an l class problem, okay?
So when a new point comes in,
in order to make a prediction on a device with ProtoNN, we
are first going to again project it using the sparse matrix z,
go down to the five dimensional space.
And then we're going to get each prototype to vote for
the particular point using its class scores, right, and
then we'll take a weighted majority vote, right?
So if you have,
let's say the fifth prototype, it's going to vote for
the various classes with w5, but then it's going to be weighted
by the distance of b5 to the projected point, right?
So it's a weighted majority vote.
That's how we'll make predictions.
The training procedure for
ProtoNN is very similar to that of Bonsai.
The objective function is also very similar.
We don't have the regularization terms but
the loss function is the same.
It could be any general loss that's suitable for
classification, regression, ranking, etc.
And we have the same set of budget constraints that we had
explicitly l0 constraints on the memory budget for the W,
B and Z.
And we're again going to learn this using accelerated
proximal gradient descent with iterative hard thresholding.
An additional feature with ProtoNN is that all
the algorithms we've seen today so far in this workshop,
very few of them have any theoretical guarantees.
But predictor has managed to prove some sorts of theorems for
ProtoNN, we'd say that they don't do badly in good settings,
right?
So in particular, if you had data drawn from, let's say,
2 Gaussian distributions coming from different classes which
were well separated, then you can prove that ProtoNN will
learn these cluster means, which are the truly
representative points of the prototypes over here.
Very accurately in polynomial,
in fact in linear time in many cases.
So that's something nice that's about ProtoNN.
Okay, so let's finally move on to some results.
We evaluated ProtoNN and Bonsai on a bunch of benchmark machine
learning and IoT datasets for binary classification,
multi-class classification, and ranking.
And I'm going to take the number of classes in the dataset and
append it to the dataset name, so as to distinguish between
the binary and the multi-class version, okay?
So in our first experiment, we compared Bonsai and ProtoNN's
prediction accuracies using really tiny models to state of
the art uncompressed methods that are running in the cloud
with infinite resources available for prediction.
So our prediction time are all in terms of infinite memory
prediction time, energy consumption, etc.
And these methods include gradient boostered decision tree
ensembles, or RBF-SVMs, k-nearest neighbor classifiers
as well as neural nets with a single hidden layer.
Now there are many numbers to report over here.
So in order to avoid confusion, I'll first always mention
Bonsai's number in red and then ProtoNN's number in blue and
then all the algorithms afterwards.
Okay, so the interesting thing to note in this experiment
is that there are some cases where Bonsai and ProtoNN can
actually have higher prediction accuracies as compared
to all these uncompressed methods over here, okay?
So many people before me have mentioned that, hey,
it would be great that if algorithms could be good for
not just these devices but also on the cloud.
So here is a situation where that actually does happen.
And the case is also true for the other two datasets.
Let's say first if you take the first dataset, right?
So that's, I think, the task of right whale detection.
Over there,
the best cloud-based model is gradient boosted decision trees.
But Bonsai has a 5% higher prediction accuracy as compared
to the tree ensemble.
And ProtoNN has a 3% higher prediction accuracy.
While Bonsai's model is about 900 times smaller, and
ProtoNN's is about 300 times smaller.
So the trends are very similar on the character recognition in
the whale dataset for both the binary and
the multi-class version.
But I think these results we'll see only once in a blue moon.
I think in the real IoT scenario, what we'll
expect to typically see, the results that you see for
the Berkeley wearable activity recognition dataset.
So over here, again, it turns out that by some luck, the best
cloud based model is gradient boosted decision tree ensembles.
And in this case, Bonsai's performance is about 1% lower in
terms of prediction accuracy, ProtoNN's about 2% lower.
But both Bonsai and
ProtoNN's models are 75 times smaller than the tree ensemble.
So, they can be incredibly efficient over here.
In our second experiment, we actually implemented some of
these algorithms on the Arduino Uno.
And compared the prediction costs when the model
was restricted to being no larger than two kilobytes.
So now you can see over here that Bonsai and
ProtoNN have much higher prediction accuracies.
Ignore the Cloud-GBDT bar for a moment,
we'll get to that experiment in a second.
But apart from that bar, Bonsai and
ProtoNN have much higher prediction inaccuracies now
while having very low prediction costs.
Just a few milliseconds per test point on a 60 MHz
microcontroller.
And it take only a few millijoules of energy per
prediction, okay?
Actually as I had mentioned,
we have very efficient implementations of Bonsai and
ProtoNN, which we call Bonsai off-time ProtoNN on.
And this allows us to bring down our prediction cost
even further.
And now you can see over here that Bonsai and
ProtoNN can actually sometimes have prediction cost that
are even lower than that of an optimized linear classified.
So now you have all the benefits of full blown non-linear
classification while paying virtually less than linear costs
in some cases.
A final experiment that we did in this particular slide is we
asked what would happen if we actually had Cloud connectivity
and we could transmit the feature vector to the Cloud, and
then run a gradient booster decision tree ensemble
over there, right?
So as you can see from the numbers over here,
just the cost of transmitting the feature vector could be 200
to 1,000 times more in terms of the prediction energy and
the prediction time.
So it's much more efficient to use Bonsai and ProtoNN and
to predict on device than it is to transmit data to the Cloud or
use any of the other algorithms.
In our third experiment, we compared the performance
of Bonsai and ProtoNN to state of the art algorithms for
resource efficient machine learning.
Including gradient booster decision tree ensembles and
the best technique for pruning such large ensembles,
which came out of and group.
We also compared Bonsai and ProtoNN to Decision Jungle
from MSR Cambridge, as well as Feature Budgeted Random Forest
and Pruned Random Forest from Venkatesh's group.
We also compared Bonsai and
ProtoNN to Stochastic Neighborhood Compression,
SNC, which is from Killian Weinberger's team.
It's one of the best methods for
compressing Nearest Neighbor models.
And also to Local Deep Kernel Learning and
neural network pruning, or Deep Compression from Song Han and
Bill Dally, okay?
So over here, I'm plotting graphs of
prediction accuracy versus model size.
And you can see that the gap between Bonsai and ProtoNN and
the second best method is as much as 6% for
the binary classification dataset, and
in fact more than 30% for the multi-class dataset.
And it turns out that some of these methods actually don't
even register on the graph over here because their prediction
accuracies are lower than the Y axis limit for
this model size arrangement.
And the results are representative, in fact,
the trends are almost identical on all the other dataset.
Bonsai and ProtoNN's curves dominate all the other
algorithms for this entire model range,
indicating that perhaps they are the best models for
these low regimes, low memory sized regimes, okay?
Then finally, I also wanted to demonstrate that our algorithm
is generalized beyond the IoT setting
to other resource constraint scenarios.
So the L3 ranker in Bing has to operate under very tight service
level agreements.
And what I'm showing you over here is the performance of one
of the classifiers that has been developed for
this task by Chris Burges and called FastRank.
So I'm showing you its performance as a function
of the model size.
And the interesting thing to note over here is that
with Bonsai and ProtoNN, we can get models that are smaller by
about 700 times but with almost no loss in ranking accuracy.
So as I was saying earlier,
even if you are not interested in IoT,
if you're just a cloud provider, you might consider using some of
these algorithms to bring down your operating costs.
Okay, so to conclude,
I have only a few take home messages for you.
I think that machine learning for the Internet of Things
will provide many high-impact opportunities for
transforming our society.
Based on this observation, our teams at MSR Redmond and
MSR India are creating a library of machine learning algorithms
that can run on these tiny IoT devices,
as well as a specialized machine learning cross compiler
to compile these algorithms onto different types of IoT devices.
As part of this Edge machine learning library,
which you can find on GitHub, we are releasing the code for
Bonsai and ProtoNN today.
It's completely open source.
You can download with it, play with it, do whatever you like.
And these algorithms are incredibly fast, accurate,
compact, and energy efficient.
So just to show you a teaser of results, here are some of
the results that are coming out of the Redmond lab.
So this is a deep neural network running on a Raspberry Pi live.
That's from Mathai Philipos.
I don't know, can you see the two frames moving?
I can't over here.
>> It's frozen.
>> It's frozen?
So maybe I have to click on it to get it to run.
>> There you go, it's running now.
>> Yeah.
So this is similar to the Apple demo we saw earlier in the day
but this is actually running on a Raspberry Pi.
So a lower class device than the smartphone,
than the Apple iPhone.
Okay, so that's it from my side.
I'm very happy to take questions.
>> [APPLAUSE] >> Question.
>> Thanks man, fascinating stuff.
One thing that I was not sure, maybe I missed,
you said in the beginning that some of these devices do not
have floating point units.
>> Yeah. >> But the models seem to
require floating point computations still.
Can you comment on that a little bit?
>> Yeah, I'm sorry, I should have mentioned that explicitly.
We convert everything to a fixed point.
Okay, so are you doing rounding of the weights or
you're just quantizing?
>> Yeah, you're just quantizing to 8-bits or
however many bits is your architecture supports.
>> I see.
Okay, I was also curious.
So in a different context stuff interpretability,
one word that came out maybe about six months back or so
that we found interesting was by Sharad Goel from Stanford and
Dan Goldstein from MSR New York,
and some people who were finding.
They took some large collection of data sets, talk a bunch of,
again, linear, non-linear classifiers.
And instead, they just compared to what a linear thing that was
just predicted, constrained to have waves in a few discrete
values on every coordinate would do.
And they were finding that actually, they're getting close
to the best accuracies across the board.
And yeah, so
I was curious if you've taken a look at that sort of models?
>> Yeah, so actually, from all these data sets,
the y-axis was said to be the best accuracy we could get for
a linear classifier.
So it's not just an arbitrary access.
We did L1 regularization, full linear classifier.
We tried lots of different things, and
that's how we chose this.
So in the kind of data sets we've looked at,
going on linear actually helps a lot.
But I agree, if in your case,
it's better to go with the linear classifier.
You should absolutely do that because it'll take less memory,
less time, less energy.
That should be the first thing you should try for
a particular task.
>> All right, cool, yeah, thanks.
>> Yeah.
>> Any other questions?
>> So I had a quick question on the other graph that you showed.
So what's the variability of the results?
So when you're repeating the train-
>> Yeah, so over here,
I'm just plotting the mean rather than mean and
standard deviation.
We've calculated those as well.
So their standard deviations, in most cases, are very small.
These results are statistically significant.
Otherwise, this is lot too much clutter on the graphs.
>> I was wondering if you could comment a little bit on your
size constraints as a form of regularization.
I see in several of your plots that actually, your methods
decrease in accuracy as the size goes up, which suggests to me,
there's some kind of regularization going on.
Do you have anymore intuition on that or?
>> Yeah, so in a sense, so
if you compare us to a standard decision tree rate, we have to
learn this extra projection matrix as well and stuff.
So I guess, if you don't have a lot of data,
maybe we could actually decrease in terms of accuracy.
Because I mean, if you told us a priori, which of the matrix
elements were non-zero, that would be fine.
But if you have to learn all of that from data and
you don't have enough data,
you could decrease in accuracy in some cases, I guess.
Apart from that, I've not really thought of this in terms
of providing regularization for our classifiers.
I mean, all the classifiers,
I think, tend to follow the same kind of trend.
They will reach a peak and then they will start dropping down.
I don't know whether these odds will do that earlier
than others.
I mean, certainly,
it looks like that we hit our peak accuracy very quickly.
But yeah, I haven't thought about that.
So I need to think about that in more detail, [LAUGH] yeah.
>> Do you have any data [INAUDIBLE]?
>> Yeah, so the Microsoft Embedded Learning Library,
I think it's github/ELL.
So github.com/Microsoft/ELL.
An edge machine learning library,
we'll put up a link to that.
If you just search for
edge machine learning library, you'll get it.
Or you can get it from my website as well,
though the link isn't working yet.
The link on GitHub is.
You can just get it off GitHub.
Yeah, sorry about that, yeah.
>> Yeah, I had a question, Marik.
>> Yeah. >> The ProtoNN,
>> Yes.
>> Is it some sort of sparse dimensionality
reduced svm-type thing it resembles, right?
>> Yeah, so, Ampadi would be the best guy to answer that
question, but absolutely.
So if you look at this thing over here, right?
So this is very much like a kernel,
this expression you have for- >> Yeah.
>> It's very much like a kernel, so yes, you can.
>> So yes, there is a code, I guess, also available.
I guess, this also has Like exemplar,
that may be one difference.
>> Right, so now instead of the support vectors coming from
a subset of your training data, you're going to
learn the position of these support vectors,
and you're going to restrict yourself to this kind of kernel.
So all of the is there.
And you can also interpret this as a neural network,
as you can with Bonsai.
You can also think of Bonsai as a SVM, if you wanted to, or
a near neural net if you wanted to.
You can divide the kernel map for back as well.
>> But what I meant was, so
there are either sparse SV data compare it with sparse SVM,
in some sense getting a sparsity with dimensionality reduction.
I think there is, I don't know about the exemplar.
>> Yeah, local deep kernel learning is,
we compare to lots of SVM methods as well.
Yes, and again I mentioned in the these DC plus, plus, trees.
So again, the SVM compression methods come in different
varieties.
Some in the speed up time.
They won't to reduce the model size.
Others will tend to do both.
But again, compared to them, we are much better.
>> Yes, you had a question? >> Yes, so you mentioned that in
one example with deep learning, there's only one [INAUDIBLE].
>> That's right.
>> So my question is [INAUDIBLE].
>> Yeah, I think Toby we did those experiments as well.
Bonsai and Proton are still the state-of-the-art.
We did those experiments after the ICML paper went into
publication.
But I think we'll put them in an appendix somewhere and
upload them.
So Bonsai and Proton are still very much the state-of-the-art.
In some cases, I think the accuracy comes down a little
bit so the gap will come down.
So rather than being 30% it will now be 20%, 10% etc.
So these are primarily like deep learning networks that are all
fully connected.
I think if you add convolution on top of that then
the accuracy gains will come down even further.
But in a sense then these graphs are no longer fair, right?
Because convolutions they take up almost no space in memory.
They take very little RAM.
But then they're extremely expensive in terms of
computation and battery.
So this is probably not the right way to compare them.
But in terms of just absolute accuracy, yes,
there would be a fall in the gap.
But in terms of the budget, the Bonsai and
Proton would still be more accurate.
>> So a really dumb question. How do I decide whether I want
to use Bonsai, or I wanna use Proton?
>> You try both, then you use both, and you pay us for
[LAUGH] both.
[LAUGH] It's a general question.
When do you use trees,
when do you use Nearest Neighbors?
You choose.
>> [INAUDIBLE] smaller amount of data.
>> In some situations, yes.
>> [INAUDIBLE] more data [INAUDIBLE].
>> No, of course.
But everything will be based optimal when you have
lots of data.
Yeah, I only have lots of data but
Nearest Neighbors can hold a candle to deep learning.
So where I've seen Nearest Neighbors to be really effective
and [INAUDIBLE] based to be really effective is in very
small data regimes.
There they tend do better than discriminative methods.
Again just with my [INAUDIBLE] based, right?
If all your dimensions are really independent,
you'd probably need only 10 data point to learn 1,000 dimensional
classifier.
But with the discriminative method you need 1,000 into 10.
So at least in my experience,
Nearest Neighbors has worked well over that.
Then if you want to add metric learning on top of that,
sure, you'll need a lot of data for that.
>> Okay, thank you.
>> Cool, thanks guys.
>> [APPLAUSE]
For more infomation >> U.S. To Phase Out Program For Young Immigrants - Duration: 2:44.
For more infomation >> Floridians Bracing For Hurricane Irma - Duration: 1:26.
For more infomation >> NC Emergency Management preps for Irma - Duration: 1:33. 

For more infomation >> BREAKING NEWS Today 9/5/2017 - some in japan preparing for north korean nuclear attack - CNN News - Duration: 2:36. 
Không có nhận xét nào:
Đăng nhận xét