Ceva adds homogeneous AI acceleration to third-generation engine

// php echo do_shortcode (‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’)?>

Ceva has renewed its NeuPro AI accelerator engine IP and added specialized co-processors for Winograd transforms and sparse operations, as well as a general vector processing unit along with the engine’s MAC array. The new generation engine, the NeuPro-M, can increase performance 5-15X (depending on the exact workload) compared to Ceva’s second generation NeuPro-S core (released September 2019). For example, ResNet-50 performance was improved 4.9X without the use of specialized motors – boosted to 14.3X using specialized co-processors, according to Ceva. Results for Yolo-v3 showed similar speedups. The current efficiency of the core is expected to be 24 TOPS / Watt for 1.25 GHz operation.

Performance of Ceva’s NeuPro-M (NPM) engine versus Ceva’s previous generation engine, the NeuPro-S (Source: Ceva)
Something NeuPro-M Core
Ceva’s NeuPro-M architecture includes a shared local memory for the various accelerators in the engine (Source: Ceva)

The NeuPro-M engine architecture allows for parallel processing on two levels – between the engines (if multiple engines are used) and within the engines themselves. The main MAC array has 4000 MACs capable of mixed precision operation (2-16 bits). Alongside this, there are new, specialized co-processors for some AI tasks. Local memory in each engine breaks the dependence on the shared core memory and on external DDR; the co-processors in each motor can work in parallel on the same memory, although they sometimes transfer data directly from one to the other (without going through the memory). The size of this local memory can be configured based on network size, input image size, number of motors in the design, and customer DDR latency and bandwidth.

One of the specialized co-processors is a Winograd transformation accelerator (the Winograd transformation is used to approximate folding operations using smaller computation). Ceva has structured this to accelerate 3 × 3 folds – the most common in today’s neural network. Ceva’s Winograd transformation can roughly double the performance of 8-bit 3 × 3 folding layers with only a 0.5% reduction in prediction accuracy (using the Winograd algorithm out of the box / untrained). It can also be used with 4, 12 and 16-bit data types. The results are more pronounced for networks with multiple 3 × 3 folds present (see performance graph above for ResNet-50 vs Yolo-v3).

Ceva’s unstructured sparsity engine can take advantage of zeros present in neural network weights and data, although it works particularly well if the network is pre-trained using Ceva’s tools to promote sparsity. Winnings of 3.5X can be achieved under certain conditions. Unstructured sparse techniques help maintain prediction accuracy relative to structured charts.

Tool chain

The Ceva Deep Neural Network (CDNN) compiler and toolkit enable hardware-conscious training. A system architecture planning tool configures criteria such as memory size in the engine and optimizes the number of NeuPro-M engines required for the application. CDNN’s compiler has asymmetric quantization functions. Overall, Ceva’s stack can support neural networks of all different types with many hundreds of layers. CDNN-Invite offers the ability to connect customers’ own custom accelerator IP to design. Networks or network layers can be kept private from CDNN if required.

Security and safety

The customer’s neural network models can be carefully protected IP, so there is a need to keep weights and data secure. The NeuPro-M architecture supports secure access in the form of optional root of trust, authentication, and cryptographic accelerators. NeuPro’s security IP comes from Intrinsix, a company Ceva acquired in May 2021, which is involved in the development of chipsets and secure processors for aviation and defense customers, including Darpa. It is crucial that it is applicable to both SoC and die-to-die security.

For the automotive market, the NeuPro-M cores together with Ceva’s CDNN compilers and toolkits comply with the ISO26262 ASIL-B standard and meet the quality assurance standards IATF16949 and A-Spice.


Two preconfigured cores are available now: NPM11 with a single NeuPro-M motor, which can achieve up to 20 TOPS at 1.25 GHz, and NPM18 with eight NeuPro-M motors, which can achieve up to 160 TOPS at 1, 25 GHz.

Leave a Comment