Correct. Powerful enough embedded devices are now defacto everywhere. We just released an open source computer vision and machine learning library developed initially for a French conglomerate specialized in IOT devices.
The library is cross platform, support real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices.
This reply is completely tangential to the focus/topic of your comment, but I wanted to say: THIS is the model of how to do open source.
The developers get financial security while they're working so they can focus, everyone is funded to sit in one place (sometimes) which makes for great communication... and then everybody (society as a whole) get to benefit.
If we don't figure out how to make computers write our programs for us within the next 10 years, this is the development model of the future.
"If you wish to derive a commercial advantage by not releasing your application under the GPLv3 or any other compatible open source license, you must purchase a non-exclusive commercial SOD license. By purchasing a commercial license, you do not longer have to release your application's source code." --
At Snips we are running all our Voice AI models on embedded devices (like a Raspberry Pi 3) and we can also work on MCUs, and we believe that embedded ML will be the preferred way to solve privacy and efficiency challenges in the future (disclaimer: I'm a co-founder)
If you are interested, you can start building your own Voice AI for free and make it run on embedded devices in under an hour: https://snips.ai
Fully agreed. I think the privacy angle is particularly compelling, and doing on-device analytics using models that have low memory requirements and acceptable (although not academically impressive) accuracy will be the norm.
The case for doing centralized data collection and model training seems to be increasingly related to corporate greed and moat-building rather than actually providing a good experience for users.
Technically, on-device processing is clearly the way forward (it's interesting how Apple is currently pioneering the field in a way).
The pessimist in me already sees how three letter agencies worldwide will welcome this change in order to push down their selectors to the device as well. Recording only the one percent of potentially relevant conversations will make backdoors exponentially easier to hide in the background traffic as well as being much lighter to process.
One has to distinguish between training and inference when talking about "machine learning".
Training a model is a long and and resource intensive process, even if transfer learning is used.
Inference is much less energy intensive and could be done on small chips.
Regardless, I'm not as certain as the author about the future of ML on small devices. Some ML models are huge and needs to be updated frequently, therefore there is little sense in downloading those to small devices. In such cases, it makes much more sense to send feature data to a remote server that can generate a prediction within milliseconds, and then transmit that prediction back to the device.
Good point on the fit/predict difference. However, there are some models and techniques (e.g. logistic regression with hashing trick) where the fit and predict steps aren't all that different:
A big benefit for doing everything on-device is that a lot of privacy concerns can be mitigated. I also agree that sending data to a server for learning is an option, and the privacy problems can be addressed with something like client-side feature hashing as I mention in:
However, doing that in a very power-conscious environment dose pose difficulties with radio usage, which is comparatively power hungry. It's probably a case-by-case situation.
There has been a lot of research into using very low bit depth weights in neural nets, pruning, etc. I am pretty confident that this research, combined with purpose-designed silicon, will allow us to evaluate quite powerful neural net models on embedded systems.
> In a lot of cases, it makes much more sense to send feature data to a remote server that can generate a prediction within milliseconds, and then transmit that prediction back to the device.
It may make sense now, but not having to power up the radio for every decision is a huge gain, as lined out in the article. The current model of dumb (as in ML) devices is coming to an end, see also CoreML from Apple.
The more I learn about machine learning, the more I learn its really all training. After training and a model is available, it seems ready to be commoditized to me. ML as a service seems the only reasonable way the industry can evolve.
I still think the defining moment for ML inference (and maybe even training!) on embedded devices will come when there are viable special-purpose, low-power ML chips.
As much as I hate to do this, I'm going to make a comparison to Bitcoin mining.
Mining is all about optimizing hashes/joule to get the best ROI. We watched it go from CPU -> GPU -> FPGA -> ASIC in the quest for efficiency.
But I think the final leap will come by going from digital execution to application-specific analog computing. If you don't need high precision, you can compute extremely quickly and efficiently using properly-configured analog circuits.
I remain unconvinced we'll see ASICs dominating inference. Part of the problem is that even if we're just talking about neural networks, there's a variety of architectures, activation functions, etc. to consider. At this stage, from my own benchmarking Nvidia is close enough to the TPU with the V100 card while allowing much more flexibility in the software stack used.
For inference, GPUs are also pretty damn efficient since it's an embarrassingly parallel task w/ minimal synchronization (no gradient updates needed). In this case, FPGAs are a far better choice since you can push updates to accommodate new network architectures, activation functions, ,etc. The TPU instead relies on a matrix-multiplier unit which supports more use cases but won't be as performant on something like an RNN.
After some investigation, you are correct! Knowing that some of TrueNorth's creators previously worked on mixed-mode systems, I made the assumption that this one was too.
It seems the TrueNorth is indeed fully digital, but takes advantage of the event-driven architecture and peer-to-peer communication between many tiny cores to keep things low-power.
Few folks have been preaching this a lot but my understanding is that devices/MCUs are getting more powerful overtime and the need to specialize for low range devices would reduce, not increase. People use the argument in the article to spawn large teams who do nothing but optimize for low end devices assuming devices won't progress over time. I do ask if this is good use of their time and talent.
Small slow processors are likely to have much lighter power requirements too. Barring a breakthrough in battery tech or wireless power, that’s going to be important for a long time for many applications (especially IoT).
Being able to run on battery or energy harvesting versus needing a power cable can be a killer feature. It typically makes deployment much easier, and opens new possibilities.
> A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered.
Exactly what product the author is referring to? I am having a hard time thinking one, but maybe is just me living in my bubble...
Running a neural algo using already formed net is easy-peasy. Doing actual learning on an MCU for anything serious is still impossible.
Learning can be ran on commodity GPUs/DSPs, and they will not be that much worse than dedicated hardware. But on embedded side, a small, low-power ASIC is the only thing that makes makes 99% voice recognition a possibility.
This is why I think that learning startups will not go anywhere far in comparisons to companies that will be using results of that learning that can be done in DCs using commodity hardware.
You can give the illusion of edge-learning by shipping datasets to large number-crunchers in the cloud and receiving altered nets from it. That even gives one the benefit of learning from the collective experience of fellow devices.
I wonder if we could (of course we can, I'm wondering if someone already did it) split the training workload across a number of small embedded devices with their tiny NEON units and have them share the resulting trained models. Making nodes self-coordinate the shared workload and assemble the results would be interesting.
That Cyber-Hans thing is already happening in industry. It was already happening 15 years ago. At least that's the first time I saw something like this for a device that would find out if roof shingles had a defect by tapping it acoustically. They had a Hans doing it previously that would knock them and listen, and they replaced it with a Cyber-Hans.
In this case it wasn't a neural net, I think it was simple multiple linear regression + Fourier Transform.
I use a similar trick with bike wheels, when re-spoking or checking a wheel for integrity I strum the spokes. Good rim and tight spokes sound different than broken rim or loose or overstressed spokes. The difference is easily noticeable.
I think there are movies, with train platform scenes, where you can see the railway guy going by with a mallet, giving the wheels a light tap and listening to the sound.
Current state of the art in embedded/IoT ML is to train ML algos in the cloud on large datasets, then run it on gateway class devices (usually Linux/MSFT boxes, but can get down to RPi levels of mem and compute). Most companies today use docker to package and deploy the models, hence the need for a larger footprint box. Check out AWS Greengrass, Azure Edge, and Foghorn for examples.
Neurones are electrical but mostly chemical when they work. The average speed of a connection from one neurone 0.1-0.5 m/s. So if you hit your toe on a chair and you are 1.8m meter high (and pardon my rough math/science here) it would take almost 1 second to reach your brain (of course this is why reflexes are handled close to the spine and not the brain).
And now imagine the complexe processing that is required to view/ear and recognize something. It is done quite quickly and yet the basic processing unit of the brain is slow. One might think it is the massive parallelism of the brain that makes this possible so quickly, but even there if you think about it it all that processing done in such small amount of time cannot be more than a thousand operations...
The author has some very good points. Also, modern MCUs like STM32 are powerful enough to run a whole big operating system like Linux while keeping power usage relatively low and being as cheap as 8-bit MCUs, so using them for ML tasks on different devices is a natural step forward.
Which? I'd be hard pressed to find an MCU that a) can run Linux, unless it's MMU-less Linux (e.g. uClinux) or your definition of MCU includes architectures like the Cortex A with MMUs b) has the RAM needed to run Linux, unless external SDRAM or similar is provided on the PCB c) is as cheap as an 8-bit MCU like an AVR.
If Cortex A-class, The iMX6UL from NXP comes to mind for a) and b), but no way it also addresses c)
I meant uClinux running on Cortex-M3/M4, but I really hope to run real Linux on STM32MP recently added to the Linux kernel - the actual hardware is not released yet though.
What are good STM32 devs kit that can run Linux? Preferably toward the cheaper end of the spectrum, like the Raspberry Pi of STM32s. (Or even other architectures.)
>"This makes deep learning applications well-suited for microcontrollers, especially when eight-bit calculations are used instead of float, since MCUs often already have DSP-like instructions that are a good fit."
Can someone shed some light on what the author means by "DSP-like instructions"? What are characteristics of DSP instructions? Is there something that makes these unique compared to general purpose CPUs or GPUs?
With utensor.ai, you can probably try this out today. We are currently working on integrating CMSIS-NN with uTensor.
CMSIS-NN are these MCU SIMD optimized functions.
Slightly tangential question. I ride my electric uniwheel on the side walks but sidewalks in my city sometimes have huge potholes so I have to be constantly watch potholes so I don't trip over and lose half of my teeth.
Is it possible for me to embed a camera on my uni that can see potholes 10 feet away a beep my headphones? I am not sure where to even start with this.
As a cyclist, I'd be interested in such a technology too.
Unfortunately, despite the lip service many US cities give to cyclists, when it comes to practical issues like road quality, cities tend to not care. Here in Austin there are quite a few bike lanes/cycletracks that are so bad that I refuse to use them. Usually it's a combination of poor visibility of cyclists in the lane (making being hit by turning drivers more likely) and poor road quality (e.g., chip seal resulting in some of these lanes basically being gravel). I've seen it claimed that the city regularly cleans out this gravely, but I can only recall a few times over the past 5 years when I thought the gravel might have been removed. I don't need machine learning to tell me to avoid these roads, but the potholes would be helpful.
Start by mounting a camera on your bike. Record for a couple of months and you would have good enough data to start experimenting with. Next step would be having your friends mount camera on their bikes.
How would the smaller units handle larger ops and convolutions, RNNs and others? Even assuming custom chips, all that heat that is generated (which gpus use large fans and heat sinks for dissipation) has to be removed somehow. Won’t that be a problem?
There are no "larger" ops. However, things like RNNs can require more memory to execute because of the longer chain of data they need to execute the operations on.
As noted in the article you can alleviate this by halving the size of the model at the cost of accuracy.
The heat in large GPUs is because of the large number of cores they have operating simultaneously.
Wow. Deep learning would certainly not be my technique of choice on constrained architectures, but there are situations where you don't really have viable alternatives right now, so I'm glad to see that's actually doable.
It really depends on what you call "constrained". For about £5 you can get a Linux-powered RISC machine with 512MB of RAM, a GPU, and rich IO capabilities. I have worked on large multi-user environments smaller than that powering dozens of serial terminals on everyone's desktops. That's a lot of compute power.
What I wouldn't like to do is to run the training part on such small devices. If there is a good way to do incremental learning after you trained your model so it could continuously fine tune itself using the embedded hardware on a reasonable power budget, I'd go for it.
And while you won't run large networks, you can probably get away with many smaller, more specialized ones.
Price for compute is less and less constraining every year. However if running on battery the energy budget can be severely constraining.
Also people just end up wanting to do more. Real-time video at decent framerate is still challenging for sub-100 USD devices. When that's easy, time for real-time 3d data (LIDAR etc)
Decision trees, random forests, logistic regression and most of the boring old statistical classifiers work on anything down to an 8-bit micro with <1k RAM. SVMs are highly effective and don't need much more RAM than that if you're careful.
Yeah, that was my take as well. I started by implementing Random Forests, really fast and compact for even the smallest of microcontrollers. Will probably add some variant of boosting trees in the future. https://github.com/jonnor/emtrees
I'm not as familiar w/ the principles, but is there convergence behind the principles of these chips and the neuromorphic chips proposed by Carver Mead?
It's funny how incredibly bad news this is. And it does seem like it's correct.
> For example, the MobileNetV2 image classification network takes 22 million ops ... 110 microwatts, which a coin battery could sustain continuously for nearly a year.
So making a tiny mine that blows up if and only if it sees a particular person (or worse, a particular race or ...) is now theoretically possible and essentially a few hardware revisions away from being doable.
This isn't taking the consumption of the camera into account. But of course there could be a PIR or other motion sensor (months of battery) that would launch the camera on-demand and then evaluate the target.
That's cute. I personally would be more worried about quadrocopter drones strapped with grenades that use face recognition to act autonomously, possibly without GPS to prevent jamming attacks.
The idea is thought-provoking, but would be another useless sink of tax money.
1-If the mine targets people from some race, then will attack your own soldiers, local allies and spies of the same race
2-Clothes and makeup are common to all human cultures. After a few strikes the people would learn how to blend in the landscape and avoid being taken by a target.
3-The system would need a sort of eye over the soil, detectable by human eyes and software, or a sort of wifi, detectable with software.
4-This "eye" part would be vulnerable to dust, leaves and debris falling over this eye. Something that happens very quickly at soil level in deserts, snowy areas and rainforests.
4-If the mine is inactivated until people of some colour appears, your enemies could use a disguise to take it safely and reuse the weapon in their own army.
5-Such mines could be modified to target presidents, military high commands, policemen or politicians, all easily distinguishable by their "feathers", well known bagdes, official uniforms... At this point, the project of a mine aiming to VIPS would be closed and deeply buried pretty fast.
Or just mine an entire area of someone else's country and walk away, like we do now.
The problem with most of these ideas is that if you're willing to do it, you probably are willing to just shoot/explode/ethnically cleanse an area anyway.
The question as always is better framed as "what does this enable that they couldn't do before?"
The library is cross platform, support real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices.
https://sod.pixlab.io
https://github.com/symisc/sod