A team from Davis University, California, has designed a processor with 1000* cores, boasting a throughput rate of 1.78 trillion instructions per second and containing 621 million transistors.

As opposed to a number of other attempts, some reaching 300 or so processors, the KiloCore chip has been fabricated and run; it was built by IBM (who else) using its 32-nm PD-SOI CMOS technology (what else).

The basic architecture used is MIMD (multiple instruction/multiple data) and each of the seven-stage-pipelined cores has a 72-instruction set, single instruction/cycle. None of the instructions is ‘algorithm-specific’ – setting the KiloCore apart from GPU-class devices. The terrific throughput is achieved at a clock speed of a mere 1.78 GHz, at 1.1 V. Running at 0.84 V and 1 GHz the beast consumes 13.1 W, while peak power efficiency of 5.8 pJ/Op is quoted at 0.56 V and 115 MHz.

Each core is independently powered and can shut down to leakage-only power if it has no task to perform. Rather than a cache architecture, every processor can store instructions and data in a hierarchy of locations; local memory, one or more nearby processors, on-chip independent memory modules, or off-chip memory.

The ‘wormhole’ routing employed implies, among others, that messages from an adjacent or nearby core will be routed via the ‘circuit’ network; those from further away in the processor matrix will travel via the packet network. If that’s a veritable can of worms to programmers remains to be seen. Each core has north-south-east-west comms buffers plus a fifth channel for host-processor traffic; maximum throughput is 45.5 Gbps per router and 9.1 Gbps per port at 1.1 V.

* as a niggling detail, K in my computerized editor's dictionary is for kilo = 1024. Sure, k is also for kilo, but meaning 1000 in old money, like in kHz.