Processing-in-Memory

Demand for more compute power is ever-increasing to empower emerging applications. Take the example of machine learning, where the computing power required is doubling every 100 days for the last decade [1]. On the other hand, silicon-based hardware is not getting faster at a similar rate. Intel, one of the leading manufacturers of microprocessors, has only halved the size of its transistors from 22 nm to 10 nm in the past nine years (2011-2020) [2].

Vertical scaling, i.e., increasing the performance of single machines has stalled. Horizontal scaling, i.e., using multiple machines to fulfill the computing requirements, is increasingly utilized. Using many distributed compute units certainly gets the job done at the expense of greater system complexity and high energy consumption. For example, machine learning applications to operate self-driving cars require a very large amount of power on the order of 2500 Watts [3].

Stalled miniaturization and higher energy consumption make the search for silicon alternatives alluring. Carbon-based transistors and quantum computing are seen as the rescuers but the carbon and quantum leap, is perhaps, many years away. In the near term, analysis of the current computing architectures and eliminating performance bottlenecks for specific applications seems to be a plausible idea.

How Processing in CPU Works?

Contemporary computers are based on the Von Neumann architecture, shown in the figure below. In the Von Neumann architecture, the instruction cycle consists of three phases. First, the instruction to be executed is fetched from the random access memory (RAM) into the CPU registers. Second, the instruction is decoded to locate the operands of the instruction and the operands are fetched from the RAM into the CPU registers. This is followed by executing the instructions in the CPU and writing the result of the instructions back to the RAM. Read and write operations to the RAM are slower compared to on-chip memory. Therefore, a small amount of expensive and fast on-chip cache memory is used to temporarily store frequently used data. It turns out that for data-intensive workloads, such as machine learning, caching is not very effective and a large number of RAM read-write operations are performed during such workloads.

Von Neumann Architecture Image Source: William Lau, CC BY-SA 4.0, via Wikimedia Commons

Apart from the latency issues associated with the off-chip memory such as static and dynamic RAMs, memory operations overwhelmingly dominate the overall energy consumption of the system. As shown in the figure below, reading data from DRAM consumes at least three orders of magnitude more energy compared to an addition operation.

Image Source: How to Evaluate Efficient Deep Neural Network Approaches

Processing-in-Memory

If bringing data to the compute units is so expensive then it is reasonable to wonder if we could bring the compute to the data. Processing-in-memory (PIM) is a paradigm that aims to do exactly this. In PIM the compute is brought to the data rather than bringing the data to the compute. This is akin to edge intelligence, where the intelligence is brought to the data rather than bringing data to the intelligence. Data rules! There are two variants of PIM, processing using memory and processing near memory.

Processing-using-memory relies on the properties of the material of the cells comprising the memory. For example, the most frequent multiply and accumulate (MAC) operation could be computed by using Ohm’s law (Ii=ViGi) and Kirchhoff’s current law (I=I1 +I2 + …). Where voltage and current act as the input and output and the conductances Gi act as weights. Processing using memory is in experimental phases and suffers from the problems associated with the non-idealities of analog processing.

Processing-near-memory utilizes logic layers on the memory die or integrates logic in the memory controller. Processing-near-memory is enabled by recent advances in 3D stacked memories, such as the commercially used High Bandwidth Memory (HBM) [4] and Hybrid Memory Cube (HMC) [5]. In these 3D stacked memories, multiple layers of DRAMs are stacked with wide-bandwidth channels realized using Through-Silicon Vias as shown in the figure below. The bottom layer in these memories comprises a logic layer. The logic layer has access to the data in the DRAM layers using TSVs. Processing-near-memory utilizes this logic layer to implement computations on the data in the DRAM and passing the processed data to the CPU using the narrow bandwidth memory channel.

Image Source: Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

Processing-in-Memory in Action

As an example of the use of PIM in accelerating computing, we will look at the inference acceleration of a convolutional neural network (CNN) in TensorFlow Lite [6]. TensorFlow Lite is a lightweight framework designed to run TensorFlow models on resource-limited devices such as mobile and IoT for inference on the edge. A CNN consists of several layers; the input is fed to the first layer and the output of the first layer goes as input to the next layer and so on. Each layer performs several MAC operations on the input. To reduce complexity on edge devices TF Lite quantizes the inputs to each layer to 8-bit integers. This quantization is performed by reading a matrix of the input from memory, quantizing, and then writing the 8-bit integer matrix result back to the memory as shown in the following figure on the left. This results in a large data movement.

Image Source: Google workloads for consumer devices: mitigating data movement bottlenecks

The MAC operations involved in each CNN layer is complex to fit in the logic layer of the memory, however, the quantization is a simpler operation and can be implemented in memory to avoid the high data movement between the CPU and the memory. This is shown on the left in the figure above, where the quantization is performed in memory and only the 8-bit matrix is transferred to the CPU where it is convolved with the weights to produce the 32-bit matrix. This 32-bit matrix is again quantized in memory to generate the 8-bit input for the next layer. Such a combination of CPU and PIM acceleration results in up to two times speedup using only half the energy consumed by a CPU-only implementation.

Limitations

While processing-in-memory seems a promising technology that could be really useful for data-intense applications, there are several limitations that need to be overcome. The first being the limitation of the underlying technology, the logic layer in the 3D stacked DRAM is fairly small and thus allow for only very simple logic to be implemented. Increasing the size of the logic layer might be challenging due to the heat dissipation issues as DRAM operates under certain temperature limits. The logic layer can access data efficiently from cells directly connected using the TSVs. Thus, non-local operations in-memory might not be as efficient compared to local operations.

Identifying parts of an application that could benefit from PIM is expensive as programmer effort is required to partition an application into such parts. Availability of SDKs and compilers that automatically utilize PIM would be crucial to the success of the PIM technology. Steps in this direction are already underway with some commercially available solutions such as the UPMEM hardware programmable using the C language [7].

References

  1. AI and Compute
  2. Process Technology History – Intel
  3. Self-Driving Cars Use Crazy Amounts of Power, and It’s Becoming a Problem
  4. High Bandwidth Memory
  5. High Memory Cube
  6. Google workloads for consumer devices: mitigating data movement bottlenecks
  7. UPMEM Technology

Leave a Reply

Your email address will not be published. Required fields are marked *