I think Ben (who I generally think is right on) in this post misapprehends the effectiveness of generalized data for machine learning services and thus the effectiveness of this approach in Google’s strategy here. Perhaps the slide makers in Mountain View have the same misapprehension.
1 - Prediction API - you provide your own data there, so no data advantage from Google there.
2 - Cloud Natural Language API - the effectiveness really depends on what type of text you want to understand. If Google’s training data includes information about my type of text application great, but if it doesn’t then what? How do I know that?
3 - Cloud Vision API - likewise. Can I subset the training set? Provide my own examples? If they subset, can I inspect the examples?
4 - Translation API seems like the exception here, mainly because the odds are that customers of translation service are unlikely to have collected language pairs and this collection is more highly specialized. But it’s unclear that this one API would be the deciding factor for many companies choosing which cloud vendor to use.
ML services as a differentiator have yet to be proven out. I am highly suspect. Yes, some big general data sets will be better on some applications than others, but an enterprises’ own data about their problem will always be better than a huge, general data set. And if you’re using your own data anyway, you’re going to care about all the platformy things Amazon has already been winning with.
Barring proprietary breakthroughs in unsupervised learning, I don’t believe that this strategy as outlined will work in practice.
I don't think Google is marketing their Cloud NLP/Vision APIs for big enterprise customers that have very specific needs. Those APIs are meant for people that have common needs ( = want to identify which items or people are on a photo, understand queries in commonly used languages, etc.)
If you have specific needs, then you can use TensorFlow running on the app engine (as they will soon be providing hosted and GPU-accelerated instances), which at worst makes it equal to Amazon offering... but something tells me the vast majority of Google Cloud customers will be satisfied with pre-trained models that can be applied on a very large swath of problems.
They definitely are for most customers: it is extremely expensive to gather and label enough data for a deep learning model to work correctly. It's very unlikely that you'll manage to configure and train your models + generate input data that Google lacks to make your model work much better than what Google already provide with their "generalist" API.
Say you are an insurance company and you want to use build a model that uses damage photos and meta data about car as a backstop to make sure that your repair shops aren't ripping you off.
In this case you already have a bunch of historical labeled data and a pre-trained model is useless to you application. It doesn't help you that the pre-trained model can recognized 10 different types of cats, you need a model trained on photos of damaged cars. Obviously the insurance companies own photo data will be more useful here because it's data about the application domain.
Google has collected a ton of photos for the purpose of image search and consumer photo organizing and that models utility has been tuned to those application area.
The key question is what is the overlap between all applications of images models and what photos Google has collected.
There will be for some but my guess is that those are the mission critical, I can only get this performance from Google cloud are few and far between.
I'm not saying there aren't any. Ben's article suggests that Google's data is somehow going to be a mission critical asset for all applications areas. Which I think is a terribly naive idea when it comes to ML.
I think you've really pointed out a very important flaw in that as developers and startups we don't have terabytes of training data.
However, I'm not sure who's in the leading edge here while Microsoft seems to suggest as the leader, but in the end machine learning, deep learning will be commoditized with enough training already baked in.
ex. Need your web app translated flawlessly into Farsi? Just drop in farsi.js to your </head> and etc.
Information embedded in a pre-trained model (Google isn't given out that data, just access to an artifact) is helpful if your application lines up to goals of the model. I think it's an open question how broad that alignment might be. It totally depends on the application.
You can control the app, but you don't have control over the model.
It will be useful to some apps but I don't think it is going to be the secret weapon in Google's fight against AWS as Ben's article suggests. It's a neat argument but I think it ignores the reality of applications and machine learning.
1 - Prediction API - you provide your own data there, so no data advantage from Google there.
2 - Cloud Natural Language API - the effectiveness really depends on what type of text you want to understand. If Google’s training data includes information about my type of text application great, but if it doesn’t then what? How do I know that?
3 - Cloud Vision API - likewise. Can I subset the training set? Provide my own examples? If they subset, can I inspect the examples?
4 - Translation API seems like the exception here, mainly because the odds are that customers of translation service are unlikely to have collected language pairs and this collection is more highly specialized. But it’s unclear that this one API would be the deciding factor for many companies choosing which cloud vendor to use.
ML services as a differentiator have yet to be proven out. I am highly suspect. Yes, some big general data sets will be better on some applications than others, but an enterprises’ own data about their problem will always be better than a huge, general data set. And if you’re using your own data anyway, you’re going to care about all the platformy things Amazon has already been winning with.
Barring proprietary breakthroughs in unsupervised learning, I don’t believe that this strategy as outlined will work in practice.