Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Machine learning can run on tiny, low-power chips (petewarden.com)
216 points by sebg on June 11, 2018 | hide | past | favorite | 72 comments


Correct. Powerful enough embedded devices are now defacto everywhere. We just released an open source computer vision and machine learning library developed initially for a French conglomerate specialized in IOT devices.

The library is cross platform, support real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices.

https://sod.pixlab.io

https://github.com/symisc/sod


This reply is completely tangential to the focus/topic of your comment, but I wanted to say: THIS is the model of how to do open source.

The developers get financial security while they're working so they can focus, everyone is funded to sit in one place (sometimes) which makes for great communication... and then everybody (society as a whole) get to benefit.

If we don't figure out how to make computers write our programs for us within the next 10 years, this is the development model of the future.


"If you wish to derive a commercial advantage by not releasing your application under the GPLv3 or any other compatible open source license, you must purchase a non-exclusive commercial SOD license. By purchasing a commercial license, you do not longer have to release your application's source code." --


I argue this is even better: anybody who plans to benefit commercially must pass on the fruits of their success to upstream.


At Snips we are running all our Voice AI models on embedded devices (like a Raspberry Pi 3) and we can also work on MCUs, and we believe that embedded ML will be the preferred way to solve privacy and efficiency challenges in the future (disclaimer: I'm a co-founder)

If you are interested, you can start building your own Voice AI for free and make it run on embedded devices in under an hour: https://snips.ai


Fully agreed. I think the privacy angle is particularly compelling, and doing on-device analytics using models that have low memory requirements and acceptable (although not academically impressive) accuracy will be the norm.

I wrote about the privacy perspective a bit:

https://adamdrake.com/scalable-machine-learning-with-fully-a...

and I recently gave a lecture more focused on the performance aspect:

https://adamdrake.com/big-data-small-machine.html

The case for doing centralized data collection and model training seems to be increasingly related to corporate greed and moat-building rather than actually providing a good experience for users.


Technically, on-device processing is clearly the way forward (it's interesting how Apple is currently pioneering the field in a way).

The pessimist in me already sees how three letter agencies worldwide will welcome this change in order to push down their selectors to the device as well. Recording only the one percent of potentially relevant conversations will make backdoors exponentially easier to hide in the background traffic as well as being much lighter to process.


One has to distinguish between training and inference when talking about "machine learning". Training a model is a long and and resource intensive process, even if transfer learning is used.

Inference is much less energy intensive and could be done on small chips.

Regardless, I'm not as certain as the author about the future of ML on small devices. Some ML models are huge and needs to be updated frequently, therefore there is little sense in downloading those to small devices. In such cases, it makes much more sense to send feature data to a remote server that can generate a prediction within milliseconds, and then transmit that prediction back to the device.


Good point on the fit/predict difference. However, there are some models and techniques (e.g. logistic regression with hashing trick) where the fit and predict steps aren't all that different:

https://adamdrake.com/big-data-small-machine.html

A big benefit for doing everything on-device is that a lot of privacy concerns can be mitigated. I also agree that sending data to a server for learning is an option, and the privacy problems can be addressed with something like client-side feature hashing as I mention in:

https://adamdrake.com/scalable-machine-learning-with-fully-a...

However, doing that in a very power-conscious environment dose pose difficulties with radio usage, which is comparatively power hungry. It's probably a case-by-case situation.


There has been a lot of research into using very low bit depth weights in neural nets, pruning, etc. I am pretty confident that this research, combined with purpose-designed silicon, will allow us to evaluate quite powerful neural net models on embedded systems.


> In a lot of cases, it makes much more sense to send feature data to a remote server that can generate a prediction within milliseconds, and then transmit that prediction back to the device.

It may make sense now, but not having to power up the radio for every decision is a huge gain, as lined out in the article. The current model of dumb (as in ML) devices is coming to an end, see also CoreML from Apple.


The more I learn about machine learning, the more I learn its really all training. After training and a model is available, it seems ready to be commoditized to me. ML as a service seems the only reasonable way the industry can evolve.


And it's coming: https://lobe.ai


I still think the defining moment for ML inference (and maybe even training!) on embedded devices will come when there are viable special-purpose, low-power ML chips.

As much as I hate to do this, I'm going to make a comparison to Bitcoin mining.

Mining is all about optimizing hashes/joule to get the best ROI. We watched it go from CPU -> GPU -> FPGA -> ASIC in the quest for efficiency.

In ways, we're seeing the same thing in ML model training and inference. CPU -> GPU -> TPU. We're even seeing some special-purpose coprocessors deployed, as in the iPhone X. (https://www.wired.com/story/apples-neural-engine-infuses-the...)

But I think the final leap will come by going from digital execution to application-specific analog computing. If you don't need high precision, you can compute extremely quickly and efficiently using properly-configured analog circuits.

IBM is working on this kind of system with their TrueNorth line (https://techcrunch.com/2017/06/23/truenorth/)

It hasn't been proven yet, but I think there is huge potential.


I remain unconvinced we'll see ASICs dominating inference. Part of the problem is that even if we're just talking about neural networks, there's a variety of architectures, activation functions, etc. to consider. At this stage, from my own benchmarking Nvidia is close enough to the TPU with the V100 card while allowing much more flexibility in the software stack used.

For inference, GPUs are also pretty damn efficient since it's an embarrassingly parallel task w/ minimal synchronization (no gradient updates needed). In this case, FPGAs are a far better choice since you can push updates to accommodate new network architectures, activation functions, ,etc. The TPU instead relies on a matrix-multiplier unit which supports more use cases but won't be as performant on something like an RNN.


I think Microsoft's experience with FPGAs for inference would agree with you.

Currently, they are only allowing external customers use ResNet-50 with their FPGA-enabled Azure ML.


TrueNorth is 100% digital.


After some investigation, you are correct! Knowing that some of TrueNorth's creators previously worked on mixed-mode systems, I made the assumption that this one was too.

It seems the TrueNorth is indeed fully digital, but takes advantage of the event-driven architecture and peer-to-peer communication between many tiny cores to keep things low-power.

( http://paulmerolla.com/merolla_main_som.pdf for some details )

Thank you for the correction!


Few folks have been preaching this a lot but my understanding is that devices/MCUs are getting more powerful overtime and the need to specialize for low range devices would reduce, not increase. People use the argument in the article to spawn large teams who do nothing but optimize for low end devices assuming devices won't progress over time. I do ask if this is good use of their time and talent.


It is always a cost analysis : even if small powerful processors are available, should you use them?

If you're shipping a million units a 30ct difference can pay for a year of a good dev.


Small slow processors are likely to have much lighter power requirements too. Barring a breakthrough in battery tech or wireless power, that’s going to be important for a long time for many applications (especially IoT).


Not to mention that sometimes a 30ct difference opens up whole new markets.


Probably yes. Besides savings on the emitting side, they'll always have a faster/lighter experience advantage on the receiving side.


Being able to run on battery or energy harvesting versus needing a power cable can be a killer feature. It typically makes deployment much easier, and opens new possibilities.


> A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered.

Exactly what product the author is referring to? I am having a hard time thinking one, but maybe is just me living in my bubble...


I guess all the voice assistants use Deep Learning as it is the state of the art in voice recognition and NLP.

I'm almost certain they all offload the processing to a server though.


There is the exception of https://snips.ai/.

I think that offloading the processing is not functionally required, but having the data is valuable for the big corps.


Well, I would argue that they are not so successful yet. I don't use them, nor any people around me do.

We, eventually, will get there, but they are not my definition of a successful product yet...


Every single visual and sound classification app around. Here's a random link: https://www.scientificamerican.com/article/pogue-8-recogniti...

All the person tagging features in Apple Photos, Google Photos, Facebook.


There is a showstopper:

Running a neural algo using already formed net is easy-peasy. Doing actual learning on an MCU for anything serious is still impossible.

Learning can be ran on commodity GPUs/DSPs, and they will not be that much worse than dedicated hardware. But on embedded side, a small, low-power ASIC is the only thing that makes makes 99% voice recognition a possibility.

This is why I think that learning startups will not go anywhere far in comparisons to companies that will be using results of that learning that can be done in DCs using commodity hardware.


You can give the illusion of edge-learning by shipping datasets to large number-crunchers in the cloud and receiving altered nets from it. That even gives one the benefit of learning from the collective experience of fellow devices.

I wonder if we could (of course we can, I'm wondering if someone already did it) split the training workload across a number of small embedded devices with their tiny NEON units and have them share the resulting trained models. Making nodes self-coordinate the shared workload and assemble the results would be interesting.


That Cyber-Hans thing is already happening in industry. It was already happening 15 years ago. At least that's the first time I saw something like this for a device that would find out if roof shingles had a defect by tapping it acoustically. They had a Hans doing it previously that would knock them and listen, and they replaced it with a Cyber-Hans.

In this case it wasn't a neural net, I think it was simple multiple linear regression + Fourier Transform.


> find ... a defect by tapping it acoustically

Why train wheels used to be tapped with a hammer at platforms: https://en.wikipedia.org/wiki/Wheeltapper


I use a similar trick with bike wheels, when re-spoking or checking a wheel for integrity I strum the spokes. Good rim and tight spokes sound different than broken rim or loose or overstressed spokes. The difference is easily noticeable.


That's really cool!


I think there are movies, with train platform scenes, where you can see the railway guy going by with a mallet, giving the wheels a light tap and listening to the sound.


Current state of the art in embedded/IoT ML is to train ML algos in the cloud on large datasets, then run it on gateway class devices (usually Linux/MSFT boxes, but can get down to RPi levels of mem and compute). Most companies today use docker to package and deploy the models, hence the need for a larger footprint box. Check out AWS Greengrass, Azure Edge, and Foghorn for examples.


Of course it should... :P

Neurones are electrical but mostly chemical when they work. The average speed of a connection from one neurone 0.1-0.5 m/s. So if you hit your toe on a chair and you are 1.8m meter high (and pardon my rough math/science here) it would take almost 1 second to reach your brain (of course this is why reflexes are handled close to the spine and not the brain).

And now imagine the complexe processing that is required to view/ear and recognize something. It is done quite quickly and yet the basic processing unit of the brain is slow. One might think it is the massive parallelism of the brain that makes this possible so quickly, but even there if you think about it it all that processing done in such small amount of time cannot be more than a thousand operations...


> The average speed of a connection from one neurone 0.1-0.5 m/s.

You're about an order of magnitude out there, according to https://en.wikipedia.org/wiki/Nerve_conduction_velocity#Norm...

(And just for a simple cross-check, it doesn't take a second for you to perceive sensations from your toes. Not even close.)


The author has some very good points. Also, modern MCUs like STM32 are powerful enough to run a whole big operating system like Linux while keeping power usage relatively low and being as cheap as 8-bit MCUs, so using them for ML tasks on different devices is a natural step forward.


Which? I'd be hard pressed to find an MCU that a) can run Linux, unless it's MMU-less Linux (e.g. uClinux) or your definition of MCU includes architectures like the Cortex A with MMUs b) has the RAM needed to run Linux, unless external SDRAM or similar is provided on the PCB c) is as cheap as an 8-bit MCU like an AVR.

If Cortex A-class, The iMX6UL from NXP comes to mind for a) and b), but no way it also addresses c)


I meant uClinux running on Cortex-M3/M4, but I really hope to run real Linux on STM32MP recently added to the Linux kernel - the actual hardware is not released yet though.


Got it - thanks for the clarification. Any idea on the price point for the STM32MP?


I have no idea and I don't want to speculate, I just hope to be pleasantly surprised with a Cortex-M7 <€10 range.


What are good STM32 devs kit that can run Linux? Preferably toward the cheaper end of the spectrum, like the Raspberry Pi of STM32s. (Or even other architectures.)



Interesting, thanks!

(Sorry for late reply)


The larger STM32F4 and F7 parts have external memory controllers for adding DRAM. Although it is difficult to get dev boards that have external DRAM.


The article states:

>"This makes deep learning applications well-suited for microcontrollers, especially when eight-bit calculations are used instead of float, since MCUs often already have DSP-like instructions that are a good fit."

Can someone shed some light on what the author means by "DSP-like instructions"? What are characteristics of DSP instructions? Is there something that makes these unique compared to general purpose CPUs or GPUs?


Here it means SIMD/vectorized operations. Arm Cortex M4F with DSP extensions can do 4 operations on 8 bit data.

For traditional signal processing (FIR/IIR filters) dedicated opcode for multiply accumulate is also common, since it is used so much.

Saturated add/subtract is another typical 'DSP' kind of feature.

EDIT: such opcodes exists on many CPU/GPUs also, but we are talking sub-milliwatt capable devices here.


With utensor.ai, you can probably try this out today. We are currently working on integrating CMSIS-NN with uTensor. CMSIS-NN are these MCU SIMD optimized functions.


Thanks for the detailed response. I appreciate it.


Slightly tangential question. I ride my electric uniwheel on the side walks but sidewalks in my city sometimes have huge potholes so I have to be constantly watch potholes so I don't trip over and lose half of my teeth.

Is it possible for me to embed a camera on my uni that can see potholes 10 feet away a beep my headphones? I am not sure where to even start with this.


As a cyclist, I'd be interested in such a technology too.

Unfortunately, despite the lip service many US cities give to cyclists, when it comes to practical issues like road quality, cities tend to not care. Here in Austin there are quite a few bike lanes/cycletracks that are so bad that I refuse to use them. Usually it's a combination of poor visibility of cyclists in the lane (making being hit by turning drivers more likely) and poor road quality (e.g., chip seal resulting in some of these lanes basically being gravel). I've seen it claimed that the city regularly cleans out this gravely, but I can only recall a few times over the past 5 years when I thought the gravel might have been removed. I don't need machine learning to tell me to avoid these roads, but the potholes would be helpful.


Start by mounting a camera on your bike. Record for a couple of months and you would have good enough data to start experimenting with. Next step would be having your friends mount camera on their bikes.


How would the smaller units handle larger ops and convolutions, RNNs and others? Even assuming custom chips, all that heat that is generated (which gpus use large fans and heat sinks for dissipation) has to be removed somehow. Won’t that be a problem?


I don't know why you were downvoted.

There are no "larger" ops. However, things like RNNs can require more memory to execute because of the longer chain of data they need to execute the operations on.

As noted in the article you can alleviate this by halving the size of the model at the cost of accuracy.

The heat in large GPUs is because of the large number of cores they have operating simultaneously.


Downvotes without explanations?


Wow. Deep learning would certainly not be my technique of choice on constrained architectures, but there are situations where you don't really have viable alternatives right now, so I'm glad to see that's actually doable.


It really depends on what you call "constrained". For about £5 you can get a Linux-powered RISC machine with 512MB of RAM, a GPU, and rich IO capabilities. I have worked on large multi-user environments smaller than that powering dozens of serial terminals on everyone's desktops. That's a lot of compute power.

What I wouldn't like to do is to run the training part on such small devices. If there is a good way to do incremental learning after you trained your model so it could continuously fine tune itself using the embedded hardware on a reasonable power budget, I'd go for it.

And while you won't run large networks, you can probably get away with many smaller, more specialized ones.


Price for compute is less and less constraining every year. However if running on battery the energy budget can be severely constraining.

Also people just end up wanting to do more. Real-time video at decent framerate is still challenging for sub-100 USD devices. When that's easy, time for real-time 3d data (LIDAR etc)


What would be your goto techniques? Curious since I'm researching machine learning on microcontrollers.


Decision trees, random forests, logistic regression and most of the boring old statistical classifiers work on anything down to an 8-bit micro with <1k RAM. SVMs are highly effective and don't need much more RAM than that if you're careful.


Yeah, that was my take as well. I started by implementing Random Forests, really fast and compact for even the smallest of microcontrollers. Will probably add some variant of boosting trees in the future. https://github.com/jonnor/emtrees


I'm not as familiar w/ the principles, but is there convergence behind the principles of these chips and the neuromorphic chips proposed by Carver Mead?


I remember reading a spec sheet on a chip based neural network Intel had developed in the 80's. Maybe we are just facing another AI Winter.


It's funny how incredibly bad news this is. And it does seem like it's correct.

> For example, the MobileNetV2 image classification network takes 22 million ops ... 110 microwatts, which a coin battery could sustain continuously for nearly a year.

So making a tiny mine that blows up if and only if it sees a particular person (or worse, a particular race or ...) is now theoretically possible and essentially a few hardware revisions away from being doable.


This isn't taking the consumption of the camera into account. But of course there could be a PIR or other motion sensor (months of battery) that would launch the camera on-demand and then evaluate the target.


That's cute. I personally would be more worried about quadrocopter drones strapped with grenades that use face recognition to act autonomously, possibly without GPS to prevent jamming attacks.


The idea is thought-provoking, but would be another useless sink of tax money.

1-If the mine targets people from some race, then will attack your own soldiers, local allies and spies of the same race

2-Clothes and makeup are common to all human cultures. After a few strikes the people would learn how to blend in the landscape and avoid being taken by a target.

3-The system would need a sort of eye over the soil, detectable by human eyes and software, or a sort of wifi, detectable with software.

4-This "eye" part would be vulnerable to dust, leaves and debris falling over this eye. Something that happens very quickly at soil level in deserts, snowy areas and rainforests.

4-If the mine is inactivated until people of some colour appears, your enemies could use a disguise to take it safely and reuse the weapon in their own army.

5-Such mines could be modified to target presidents, military high commands, policemen or politicians, all easily distinguishable by their "feathers", well known bagdes, official uniforms... At this point, the project of a mine aiming to VIPS would be closed and deeply buried pretty fast.


Or just mine an entire area of someone else's country and walk away, like we do now.

The problem with most of these ideas is that if you're willing to do it, you probably are willing to just shoot/explode/ethnically cleanse an area anyway.

The question as always is better framed as "what does this enable that they couldn't do before?"


You're missing the most horrible capability of this mine - when it sees its target, it could start chasing it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: