A Look at HP's “The Machine”

static_noise · on Aug 28, 2015

To me "The Machine" seems like a vision which is used to explore the possible problems and possible solutions that surface when one has to rethink computing with a really big address space.

If one looks at how current computer technology has developed, especially evident in the x86 architecture, semiconductor processing, some popular programming languages and the popular operating systems, it becomes obvious that we got here by many many small incremental improvements. Rarely someone pulled the rabbit out of the hat and built a new kind of hardware, wrote a new kind of software and addressed a new kind of market with it.

Why were these incremental improvements working? Because they solved the most pressing problems at the time using solutions which could be tested and implemented to work properly in the respective market scenario.

That said, there is value in testing new ideas and making revolutionary experiments which never reach the market. The knowledge gained on the way sometimes can be used to solve problems in other existing systems or introduce new paradigms which prevent current developments from getting stuck.

StillBored · on Aug 28, 2015

In the end, I suspect they will have reinvented an AS/400/ibm i attached via fibre channel to a flash array...

Except that was back when they were writing their own OS to run on it. By doing that, the possibility seemed to exist for all kinds of crazy ideas. Once they try to conform it to a posix like environment I suspect that a lot of possible hardware advantages will start to evaporate. Either the external ram will be maintained in a (semi)consistent (reinventing of any number of single system image clustering mechanisms) manner, or they will end up using a filesystem or MPI type layer on top of it, thereby reducing its potential advantages.

nickpsecurity · on Aug 28, 2015

I think something like this will have to be clean-slate. The trick used by SGI and Cray, though, was to have some nodes run a full OS with compute nodes run a tiny OS. Then, there were storage or I/O nodes to handle that stuff. Most of the parallelism was in the operation of the compute nodes, interconnect, and storage. So, this worked out pretty well in practice for many MPP-style system. Today's work is trying to make denser, more efficient MPP's. So, they could use similar techniques as what worked in the past.

The use of ARM processors threw me, though. Usually problems needing this much memory and I/O take similarly beefy, multi-core, multi-CPU nodes to handle it. That wasn't good enough so many added GPU's and FPGA's to nodes. So, kind of wondering if their little ARM chips will cut it. Maybe something like Cavium's ThunderX could do...

sciencesama · on Aug 28, 2015

The first official HPQ memristor announcement was apparently in 2008, and at that time there was conjecture it would be used both as analog and digital recording media:

http://bit.ly/1fyBrWi

HPQ said in 2011 that they would have memristor products within 18 months:

http://bit.ly/rtVfno

In 2013, HP's Martin Fink was saying they would have memristor based HDD replacements by 2018. At that time, they claimed those 'early' memristors would be faster than FLASH, but not speed-competitive with RAM:

http://bit.ly/1RO05ke

Then in 2014, HP's Martin Fink was telling media that memristors would be faster than dram. He produced a chart showing memristors at 10 ns (no range) while dram technology ranged from 10 to 50 ns:

http://bit.ly/1fyBrWk

In October of 2014, HP's Martin Fink was saying they would deliver a 'clean sheet' operating system in 'The Machine' with 157 petabytes of addressable memory and 150 compute nodes by the end of 2016:

http://bit.ly/1fyBtO6

As late as December of last year, they were promising a 'Revolutionary' new operating system in 2015.

http://bit.ly/1fyBrWo

Now, there is supposed to be a show-and-tell prototype of 'The Machine' to be demonstrated sometime in 2016. It will not have memristors, and will run a slightly modified form of Linux. It is supposed to have 2,500 cpu cores and 320 TBytes of conventional RAM memory. That is about 500 times less memory than the 157 PBytes they were talking about last year. I am guessing without evidence that the processing distribution will be about 1250 Xeons with 20 cores in each, but really don't know. That amount of conventional memory suggests it won't be in a trade show, but will be in a controlled environment for invited guests, where HP can better control the message and defer all the questions:

http://bit.ly/1fyBtO6

Essentially, the point of all this is that HPQ has a long history of failing to deliver on and/or lowering the expectations of while pushing out delivery dates of announced products. There is no real reason that I can see to believe that HP Enterprise or whatever will suddenly establish the follow-through integrity that has clearly been lost in HPQ. Those fast/slow/soon/late/high yield/low yield memristors might never be available from a financial corporate descendant of HPQ.

I think there is a small fundamental flaw in the message HP is promoting about distributing their processors far apart from one another in an effort for each processor to be closer to local memory while having a seemingly contradictory flat memory model. That is that most of the tasks one would want to execute on such a machine would certainly be multithreaded. In many multithreaded tasks, parent threads need to quickly know when daughter threads have completed their tasks so that they can use these interim results to continue. The speed of light in glass fiber is typically about 2/3 that of light in a vacuum. It is reasonable to assume a typical thread process will execute (at maximum efficiency) at more than 6 billion instructions per second in the year 2020. So, let's say that a daughter thread is running 50 feet away (not physically, but as the optical cable bends) from a parent. This means that the daughter process could have executed over 450 instructions while the "thread complete" message is in transit to the parent (in addition to the overhead in more conventional architectures) and would have to wait out more than 450 additional instruction executions while waiting for the parent thread to begin issuing something new. Translated, the concept of 'The Machine' would lose its speed advantage in a programming environment where thread execution instruction counts are relatively low, such as the majority of in-memory database object manipulation.

In contrast to 'The Machine', successful architectures will, I think, increasingly move their processing cores together more akin to directions of the Nvidia Tesla, Intel Phi, and AMD FirePro, to name a few. Speculating further, massive amounts of memory may likely be arrayed spherically outward, perhaps on a 3-D radiant structure with cooling fluid [such as helium gas] running within the fingers. If each of these limbs comes with threading, contacts, and a seal system so it can be removed and replaced, then the mean-time-to-repair can be kept low.

My guess is that even though there has not been any disappointing news about 'The Machine' in over a month now, there will be a lot more as the dog-and-pony show that is supposed to occur late next year nears, and there will be more still prior to 2020

semi-extrinsic · on Aug 28, 2015

Minor nitpick: given their huge focus on memory size, they also have to care a lot about memory bandwidth. That means they want Xeons with fewer cores per CPU, since each CPU (socket) has just 4 memory channels. Furthermore, if they can get away with just 4 DIMMs per CPU, they avoid the ~20% memory throughput performance penalty from memory channel interleaving.

Crunching the numbers, the thing that makes most sense is 128 GB per socket, with 10 core CPUs in that, giving 2500 CPUs, probably dual socket nodes so 1250 nodes. (I'm assuming 2,500 cores above is a typo.) So each CPU would talk to 4x 32 GB DDR4 DIMMs, which are indeed available. This is also just speculation, but I couldn't see any other options giving as good performance.

Given that this speculation is correct, it's hard to see how this is much different from existing HPC hardware tailored to high-memory applications.

As for what your musings on interconnect latency: I believe you are correct that you can't really run with the multithreaded paradigm once you start computing at significant physical scales. Perhaps they want to move to something MPI-like for their applications?

tedunangst · on Aug 28, 2015

It would be deliciously ironic if the final shipping machine uses itanium processors.

cmiles74 · on Aug 28, 2015

...And it would be nice if HP finally hit some kind of payoff for investing so much in that doomed processor.

kinghajj · on Aug 28, 2015

Maybe by the time this is ready for production, the Mill will also be!

static_noise · on Aug 28, 2015

Great post disseminating the (likely) grossly exaggerated promises made by HP about a technology there is little evidence of!

Only thing to improve is to include a numbered list of the full URLs at the end instead of or inaddition to relying on some external URL shortening service.

white-flame · on Aug 28, 2015

So each CPU has its own 256GB of local memory, with a many-TB shared pool that any of them can load/store into, using a new interconnect.

How is this any different bird's-eye architecturally from current builds involving compute nodes talking to a shared RAM-backed datastore? Is it fundamentally different to put it directly into the address space, with the collision problems that presents?

jdnier · on Aug 28, 2015

"to allow addressing up to 32 zettabytes (ZB)" -- I think that's the first time I've seen ZB in print.

Wikipedia link: "1 ZB = 10007bytes = 1021bytes = 1000000000000000000000bytes = 1000exabytes = 1billionterabytes = 1trilliongigabytes."

bronson · on Aug 29, 2015

ZFS was originally named the Zettabyte File System. Not sure why Sun changed it. The acronym doesn't mean anything now.

DigitalJack · on Aug 28, 2015

Sounds like the are reinventing the z/architecture.

zxcvcxz · on Aug 28, 2015

>There will not be a single OS (or even distribution or kernel) running on a given instance of the The Machine—it is intended to support multiple different environments.

What does this mean? Doesn't it need some kind of base system so it knows what to do when someone loads a new OS?

static_noise · on Aug 28, 2015

This sounds just like virtualization where some hypervisor exists to guard hardware access. This probably could be implemented in hardware or running in a separate processor. So calling that an OS is a bit of an exaggeration.

luckydude · on Aug 29, 2015

Am I the only person who thinks this looks sort of like the ETA-10 (circa 1987ish)? With the difference being it is load/store rather than bcopy but if I read it correctly they both had the coherency problem.

nickpsecurity · on Aug 28, 2015

I'm not trusting this one bit. Sounds like many descriptions of developments that later ended in bankruptcy or acquisition. Just give me a bunch of nodes with Octeon III's, TOMI's, and/or Achronix FPGA's in a SGI UV-style system (esp interconnect). Should meet almost any need I have for quite a while. Hopefully, the exascale designs are worked out by the time that isn't true. :)

Gotta leverage what we have a bit better even in designs pushing the envelope. Can always add better tech as it gets proven.

Hardenedsoft · on Aug 28, 2015

I remember someone telling me they already had all this built decades ago, and were bringing people interested in computer science to see the system, and perhaps see if they could have any input.

Most who were competent and not fooled by the attempted facade of this world, would simply show up, kick it(manually boot), and leave laughing at the stupidity...\

That was decades ago...

dang · on Aug 28, 2015

It sounds like you might have a good story here, but there isn't enough information for the reader to tell. Your comment would be more informative if you added the details, which people here would likely be interested in, and took out the insult.

kragen · on Aug 28, 2015

It's hard to tell, but it sounds to me like maybe the grandparent commenter is suffering from paranoid delusions, like the Time Cube guy, and thinks we're all "educated stupid", and that the only reason computers are still getting faster gradually is some kind of conspiracy among computer companies. But it's not a whole lot of text to base a full diagnosis on.