Neural networks are a sizable area for MAC-based hardware acceleration research. Widely employed in machine learning, neural networks abstract the human brain neuron network from the information processing perspective, and builds various models to form different networks according to different connections[45-48]. Deeper and more complex neural networks are needed to enhance the self-learning and data processing capabilities, and neural networks are becoming more intelligent, such as from supervised to unsupervised learning, from image processing to dynamic time-series information processing, etc. Importantly, MAC operation is always one of the most frequent computing units in various neural network models. In some published tools and methods for the evaluation and comparison of deep learning neural network chips, such as Eyeriss’s benchmarking, Baidu DeepBench, and Fathom, MAC/s and MAC/s/w are the important indexes to measure the overall computing performance. Thus, the highly efficient MAC operation is a major basis for the hardware acceleration of neural networks. Setting sights on the huge potential of parallel MAC computing in memristive arrays, the memristive neural networks have gotten fierce development.
Artificial neural network (ANN)
The fully connected multi-layer perceptron (MLP) is one of the most basic artificial neural networks (ANNs), without a biological justification. In addition to the input and output layers, it can have multiple hidden layers. The simplest two-layer MLP contains only one hidden layer and is capable of solving nonlinear function approximation problem, as shown in Fig. 3(a). For memristive neural networks, the key is the hardware mapping of the weight matrices into the memristive array, as shown in Fig. 4(b), while a large amount of MAC calculation can be executed in an efficient parallel manner for acceleration. Typically, a weight with a positive or negative value requires a differential connectio of two memristive devices: W = G+ – G–, which means two memristive arrays are needed to load one weight matrix.
Thanks to the capability of the memristive array to perform VMM operations in both forward and backward directions, it can naturally implement a on-chip error-backpropagation (BP) algorithm, the most successful learning algorithm. The forward pattern information and the backward error signal can both be encoded as the corresponding voltage signal input to the array, taking the MAC computing advantage to proceed with both inference and update phases of the neural network algorithm.
In the early stages of research, many works were devoted to improving the performances of memristive devices[29, 52-57], exploring the dependence of network performance on different device properties[58-62], etc. As a result, several consensuses have also been reached on memristive ANN application:
(1) For the multi-level analog property of memristors, 5–6 bits are generally required for basic full-precision multi-layer perceptron[63-65]. However, with adoption of the algorithm optimization of quantization, the strict requirement weight precision is lowered (4 bits or less, except binary or ternary neural networks)[66-68]. Hence, rather than pursuing continuous tuning of the device conductance, stable and distinguishable conductance states are more important for hardware implementations of memristive ANN. Moreover, reducing the lower conductance of the memristors is important for peripheral circuit design and overall system power consumption while ensuring a sufficient dynamic conductance window.
(2) The linearity and symmetry of the bidirectional conductance tuning behavior are indeed important, both in terms of network performance and peripheral circuit friendliness. Due to the existence of device imperfections, such as read/write noises, uncontrollable dynamic conductance range, poor retention, and low array yield, the analog conductance tuning behaviors still need to be improved for better reliability. For memristor-based neural network inference engines, the accurate write-in method and the retention property of multi-level states become significant.
(3) A simple crossbar array can cause many practical problems, including IR drop, leakage current, etc. These cannot be ignored in hardware design, especially the voltage sensing errors caused by IR drop.
Until recently, there have been many breakthroughs in the on-chip hardware implementation of memristive ANN. As shown in Figs. 5(a)–5(c), Bayat et al. demonstrated a mixed-signal integrated hardware chip for a one-hidden layer perceptron classiﬁer with a passive 0T1R 20 × 20 memristive crossbar array. The memristors in the array showed relatively low variations of I–V characteristics by counting the SET and RESET threshold, and I–V nonlinearity provided sufﬁcient selector functionality to limit leakage currents in the crossbar circuit. Equally important, the pulse width coding method was another strategy to prove accurate read-out and weak sneak paths in this work. Off-chip and on-chip training of memristive ANN were performed for simple pixel images. This work demonstrates the excellent fabrication technology of memristive array and the great potential of memristive ANN on-chip implementation. It is worth noting that I–V nonlinearity for a passive memristive array, while helping to cut the sneak paths, also has an impact on the accurate linear read of the devices, which requires a trade-off.
A memristive ANN chip for face recognition classification was also presented by Yao et al.. As shown in Figs. 5(d) and 5(e), the chip consisted of 1024 1T1R cells with 128 rows and 8 columns and demonstrated 88.08% learning accuracy for grey-scale face images from the Yale Face Database. The transistor of 1T1R cells facilitates hardware implementation by acting as a selector, while also providing an efficient control line that allows the precise tuning of memristors. Compared with an Intel Xeon Phi processor, apart from the high recognition accuracy, this memristive ANN chip with analog weight consumed 1000 times less energy, which strongly exhibited the potential of the memristor ANN to run complex tasks with high efficiency. However, for complex applications, the coding of input information becomes an issue that cannot be ignored. The pulse width coding used in this work is obviously not a good strategy and can cause serious delays and peripheral circuitry burdens. The commonly used pulse amplitude coding, on the other hand, imposes stringent requirements on the linear conductance range of the devices[56, 72]. Recently, the same group further attempted to address two considerable challenges posed by the memristive array: the IR drop that decreases the computing accuracy and further limits the parallelism, and the inefficiency due to the power overhead of the A/D and D/A converters. By designing the sign-weighted 2T2R array and a low-power interface with resolution-adjustable LPAR-ADC, an integrated chip with 158.8 kB 2-bit memristors, as shown in Fig. 5(f), was implemented, which demonstrated a fully connected MLP model for MNIST recognition with high recognition accuracy (94.4%), high inference speed (77 μs/image), and 78.4 TOPS/W peak energy efficiency.
Taking the functional completeness of the memristive ANN chips into account, a fully integrated, functional, reprogrammable memristor chip was proposed, including a passive memristor crossbar array directly integrated with all the necessary interface circuitry, digital buses, and an OpenRISC processor. Thanks to the re-programmability of the memristor crossbar and the integrated complementary metal–oxide–semiconductor (CMOS) circuitry, the system was highly flexible and could be programmed to implement different computing models and network structures, as shown in Fig. 6, including a perceptron network, a sparse coding algorithm, and a bilayer PCA system with an unsupervised feature extraction layer and a supervised classification layer, which allowed the prototypes to be scaled to larger systems and potentially offering efficient hardware solutions for different network sizes and applications.
In total, from device array fabrication, core architecture design, peripheral circuit solutions, and overall system functionality improvement, the development of memristive ANN chips is maturing. With the summation property of neural networks, non-ideal factors such as the unmitigated intrinsic noise of memristor arrays will not completely constrain the development of memristive ANN chips, which suggests the adaptability of memristors to low-precision computing tasks. Based on non-volatile and natural MAC parallel properties of memristive arrays, the memristive ANN chips benefit from high integration, low power consumption, high computational parallelism, and high re-programmability, which have great promise in the field of analog computing.
As the amount of data information explodes, traditional fully-connected ANNs exhibit their information processing limitations. For example, there are 3 million parameters when processing a low-quality 1000 × 1000 RGB image, which is very resource-intensive. The proposal of the convolutional neural network (CNN) greatly improves this problem. The CNN performs two main features: firstly, it can effectively reduce a large amount of parameters, including simplifying the input pattern and lowering the weight volume in the network model; then, it can effectively retain the image characteristics, in line with the principles of image processing.
CNN consists of three main parts: the convolutional layer, the pooling layer, and the fully connected layer. The convolutional layer is responsible for extracting local features in the image through the filtering of the convolutional kernel; the pooling layer is used to drastically reduce the parameter magnitude (downscaling), which not only greatly reduces the amount of computation but also effectively avoids overfitting; and the fully connected layer is similar to the part of a traditional neural network and is used to output the desired results. A typical CNN is not just a three-layer structure as mentioned above, but a multi-layer structure, such as the structure of LeNet-5 as shown in Fig. 7(a). By continuously deepening the design of the basic functional layers, deeper neural networks such as VGG, ResNet, etc. can also be implemented for more complex tasks.
Based on the investigation of memristive ANN, memristive CNN can also be accelerated due to the parallel MAC operations, and the effect of memristive devices on CNN has similar conclusions, such as ideal linearity, symmetry, smaller variation, better retention and endurance[77-80]. However, the difference is that the CNN structure is more complex. The convolutional layer adopts a weight-sharing approach, and the connections between neurons are not fully connected, which cannot be mapped directly on a 2D memristive array. This is the primary problem that needs to be solved for the implementation of memristive CNN. Further, the characteristics of the device have different effects on the convolution layer and the fully connected layer. Generally, the convolutional layer has higher requirements for the characteristics of the device, including device variation and weight precision[67, 81-83]. Due to the cascading effect, the errors generated in the previous layer will always accumulate, causing greater disturbance to the subsequent layer. Therefore, it is further proved that for memristive CNN, the precise mapping and implementation of convolutional layers is one of the most important parts.
As shown in Fig. 7(b), it is the basic principle of the image convolution operation. By sliding the convolution kernels over the image, the pixel value of the image is multiplied by the value on the corresponding convolution kernels, and then all the multiplied values are added as the grayscale value of the corresponding pixel point in the feature map until the entire convolution process is done. The most commonly used mapping method on memristive arrays is to store the weights of the convolutional kernels in the array. Specifically, as shown in Fig. 7(c), a column of the memristive array is used to store a convolutional kernel, the two-dimensional image is unrolled as a one-dimensional input voltage signal, and the information of the convolutional feature image is obtained as the output current value of the array.
As shown in Fig. 8(a), Gao et al. firstly implemented convolution operation on a 12 × 12 memristor crossbar array in 2016. Prewitt kernels were used as a proof-of-concept demonstration to detect horizontal and vertical edges of the MNIST handwritten digits. Huang et al. have also attempted to implement convolutional operations in three-dimensional memristive arrays with a Laplace kernel for edge detection of images (Fig. 8(b)). More recently, Huo et al. preliminary validated 3D convolution operations on a HfO2/TaOx-based eight-layer 3D VRRAM to pave the way for 3D CNNs (Fig. 8(c)).
Although the preliminary implementation of convolution operation on 2D and 3D memristive arrays has been achieved, this mapping approach still has significant concerns. First, the conversion of a 2D matrix to 1D vectors losses the structural information of the image, which is still important in the subsequent process, and also causes very complex data processing in the back-propagation process. Secondly, if the one-shot MAC operation of one-dimensional image information is required for convolution, the memristive array is sparsely stored for convolution kernels, and too many unused cells could cause serious sneak path issues. While compact kernels on arrays without any redundancy space require more complex rearrangements of the input image and sacrifice significant time delays and peripheral circuits for convolution operation. In one word, the problem of convolutional operation raises challenges that need to be properly addressed while training memristive CNNs.
Recently, to solve the severe speed mismatch between the memristive fully connected layer and convolutional layer, which comes from the time consumption during the sliding process, Yao et al. proposed a promising way of replicating the same group of weights in multiple parallel memristor arrays to recognize an input image efficiently in a memristive CNN chip. A five-layer CNN with three duplicated parallel convolvers on the eight memristor PEs was successfully established in a fully hardware system, as shown in Figs. 9(a) and 9(b), which allowed the processing of three data batches at the same time for further acceleration. Moreover, a hybrid training method was designed to circumvent non-ideal device characteristics. After ex-situ training and close-loop writing, only the last fully connected layer was trained in situ to tune the device conductance. In this way, not only the existing device imperfections could be compensated, but also the complex on-chip operations of backpropagation process for convolutional layers were eliminated. Hence, the performance benchmark of the memristor-based CNN system showed 110 times better energy efficiency (11 014 GOP s−1 W−1) and 30 times better performance density (1164 GOP s−1 mm−2) compared with Tesla V100 GPU, which also suffered a rather low accuracy loss (2.92% compared to software testing result) for MNIST recognition. However, in practice, transferring the same weights to multiple parallel memristor convolvers calls for high uniformity of different memristive arrays, otherwise it would induce unavoidable and random mapping error to hamper the system performance. Besides, the interconnection among memristor PEs could consume a lot of peripheral circuitry.
A more recent work by Lin et al. has demonstrated a unique 3D memristive array to break through the limitations of 2D arrays that can only accomplish simplified interconnections. As shown in Figs. 9(c)–9(e), the unique 3D topology is implemented by a non-orthogonal alignment between the input pillar electrodes and output staircase electrodes that form dense but localized connections, and different 3D row banks are physically isolated from each other. And thanks to locally connected structure, it can be extended horizontally with high sensing accuracy and high voltage delivery efficiency, independent of the array issues such as sneak path and IR drop. By dividing the convolution kernels into different row banks, pixel-wise parallel convolutions could be implemented with high compactness and efficiency. The 3D design handles the spatial and temporal nature of convolution so that the feature maps can be directly obtained at the output of the array with a minimal amount of post-processing. For complex neural networks, the row banks are highly scalable and independent so that they can be flexibly programmed for different output pixels, filters, or kernels from different convolutional layers, which offers substantial benefits in simplifying and shortening the massive and complex connections between convolutional layers. Such a customized three-dimensional memristor array design is a critical avenue towards the CNN accelerator with more complex function and higher computation efficiency.
It can be seen that to improve the efficiency of a memristive CNN, various mapping methods for memristive arrays are being actively explored, including multiplex and interconnection of multiple small two-dimensional arrays, or specially designed 3D stacking structures. In addition to considering the mapping design of the memristive array cores, the peripheral circuit implementation of memristive CNN is another important concern, which also determines the performance and efficiency of the system to a large extent. While memristive arrays are conducive to efficient analog computing, the consumed ADCs and DACs come at a cost. Moreover, due to the severe resistive drift, the accurate readout circuit is also worthy of further investigation.
Chang et al. have placed their effort on circuit optimization for on-chip memristive neural networks. They proposed an approach of efficient logic and MAC operation on their fabricated 1Mb 1T1R binary memristive array. As shown in Figs. 10(a) and 10(b), the structure of the fully integrated memristive macro included a 1T1R memristor array, digital dual-mode word line (WL) drivers (D-WLDRs), small-offset multi-level current-mode sense amplifiers (ML-CSAs), and a mode-and-input-aware reference current generator (MIA-RCG). Specifically, D-WLDRs, which replaced DACs, were used to control the gates of the NMOS transistors of 1T1R cells sharing the same row. Two read-out circuit techniques (ML-CSAs and MIA-RCG) were designed. Thus, high area overhead, power consumption, and long latency caused by high-precision ADCs could be eliminated; reliable MAC operations for the small sensing margin caused by device variability and pattern-dependent current leakage could be enhanced. Based on such circuit optimization, a 1-MB memristor-based CIM macro with 2-bit inputs and 3-bit weights for CNN-based AI edge processors was further developed, which overcame an area-latency-energy trade-off for multibit MAC operations, pattern dependent degradation in the signal margin, and small read margin. These system-level trials verified that high accuracy and high energy-efficiency could be achieved using a fully CMOS-integrated memristive macro for CNN. However, in general, the input information and weight precision are much more complex, at which point the design and optimization of peripheral circuits becomes a more problematic issue, and must be addressed when the memristive CNN goes deeper.
Other network models
Based on the parallel MAC computing in an array, more memristive neural network models have been investigated. One example is the generative adversarial network (GAN), which is a kind of unsupervised learning by having two neural networks play against each other to learn itself. GAN has two subnetworks: a discriminator (D) and a generator (G), as illustrated in Fig. 11(a). Both D and G typically are modeled as deep neural networks. In general, D is a classifier that is trained by distinguishing real samples from generated ones and G is optimized to produce samples that can fool the discriminator. On the one hand, two competing networks are simultaneously co-trained, which significantly increases the need for memory and computation resources. To address this issue, Chen et al proposed ReGAN, a memristor-based accelerator for GAN training, which achieved 240× performance speedup compared to GPU platform averagely, with an average energy saving of 94×. On the other hand, GAN suffers from mode dropping and gradient vanishing issues, but adding continuous random noise externally to the inputs of the discriminator is very important and helpful, which takes advantage of the non-ideal effects of memristors. Thus, Lin et al. experimentally demonstrated a GAN based on a 1 kB analog memristor array to generate a different pattern of digital numbers. The intrinsic random noises of analog memristors were utilized as the input of the neural network to improve the diversity of the generated numbers.
Another example is the long short-term memory (LSTM) neural network, which is a special kind of recurrent neural network. LSTM is proposed to solve the "gradient disappearance" problem, and is suitable for processing and predicting events with relatively long intervals and delays in a time series. By connecting a fully connected network to a LSTM network, a two-layer LSTM network is illustrated in Fig. 11(b). Traditional LSTM cells consist of a memory cell to store state information and three gate layers that control flow of information within cells and network. The LSTM network with significantly increased complexity and a large number of parameters have a bottleneck in computing power resulting from both limited memory capacity and bandwidth. Hence, besides the implementation of the fully connected layer, memristive LSTM pays more attention to store a large number of parameters and offer in-memory computing capability for the LSTM layer, as shown in Fig. 11(c). Memristive LSTMs have been demonstrated for gait recognition, text prediction, and so on[92-97]. Experimentally, on-chip evaluations were performed on a 2.5M analog phase change memory (PCM) array and a 128 × 64 1T1R memristor array, which have also proved strongly that the memristive LSTM platform would be a promising low-power and low-latency hardware implementation.