Vanishing of clock power consumption by using provisional pulse enhancement scheme

In this paper, first a low-power pulse-triggered flip-flop (FF), a simple two-transistor AND gate is designed to reduce the circuit complexity. Second, a conditional pulse-enhancement technique is devised to speed up the discharge along the critical path only when needed. As a result, transistor sizes in delay inverter and pulse-generation circuit can be reduced for power saving. Various post layout simulation results based on United Microelectronics Corporation and Complementary metal– oxide–semiconductor (UMC CMOS) 50-nm technology reveal that the proposed design features the best power-delay-product performance in several FF designs under comparison. Its maximum power saving against rival designs is up to 18.2% and the average leakage power consumption is also reduced by a factor of 1.52.


INTRODUCTION
Flip-flops (FFs) are the basic storage elements used extensively in all kinds of digital designs. In particular, digital designs nowadays often adopt intensive pipelining techniques and employ many FF-rich modules. It is also estimated that the power consumption of the clock system, which consists of clock distribution networks and storage elements, is as high as 20 to 40% of the total system power [Hwang,et el, 2012]. Pulse-triggered FF (P-FF) has been considered a popular alternative to the conventional master-slave-based FF in the applications of high-speed operations [Rasouli SH,et el, 2005]. Besides the speed advantage, its circuit simplicity is also beneficial to lowering the power consumption of the clock tree system. A P-FF consists of a pulse generator for generating strobe signals and a latch for data storage. Since triggering pulses generated on the transition edges of the clock signal are very narrow in pulse width, the latch acts like an edge-triggered FF. The circuit complexity of a P-FF is simplified since only one latch, as opposed two used in conventional master-slave configuration, is needed (Shu, 2006). P-FFs also allow time borrowing across clock cycle boundaries and feature a zero or even negative setup time. P-FFs are thus, less sensitive to clock jitter.
Despite these advantages, pulse generation circuitry requires delicate pulse width control in the face of process variation and the configuration of pulse clock distribution network [Shu,2006].Depending on the method of pulse generation, P-FF designs can be classified as implicit or explicit [Zhao. et el.2004]. In an implicit-type E-mail: saisudheeervlsisd@gmail.com.
Author agree that this article remain permanently open access under the terms of the Creative Commons Attribution License 4.0 International License P-FF, the Pulse generator is a built-in logic of the latch design, and no explicit pulse signals are generated as shown in Figure 1. In an explicit-type P-FF, the designs of pulse generator and latch are separate. Implicit pulse generation is often considered to be more power efficient than explicit pulse generation. This is because the former merely controls the discharging path while the latter needs to physically generate a pulse train. Implicit-type designs, however, face a lengthened discharging path in latch design, which leads to inferior timing characteristics. The situation deteriorates further when low-power techniques such as conditional capture, conditional precharge, conditional discharge, or conditional data mapping are applied. As a consequence, the transistors of pulse generation logic are often enlarged to assure that the generated pulses are sufficiently wide to trigger the data capturing of the latch. Explicit-type P-FF designs face a similar pulse width control issue, but the problem is further complicated in the presence of a large capacitive load, example, when one pulse generator is shared among several latches (Yin-Tsung Hwang, 2010;Mahmoodi et al., 2009;Teh et al., 2006). In this paper, we will present a novel low-power implicit-type P-FF design featuring a conditional pulse-enhancement scheme.
Three additional transistors are employed to support this feature. In spite of a slight increase in total transistor count, transistors of the pulse generation logic benefit from significant size reductions and the overall layout area is even slightly reduced. This gives rise to competitive power and power-delay-product performances against other P-FF designs.

Conventional implicit type P-Ff designs
Some conventional implicit-type P-FF designs, which are used as the reference designs in later performance comparisons, are first reviewed. The pulse generator takes complementary and delay skewed clock signals to generate a transparent window equal in size to the delay by inverters. Two practical problems exist in this design. First, during the rising edge, p-channel metal-oxide semiconductor (PMOS) transistors N2 and N3 are turned on. If data remains high, node X will be discharged on every rising edge of the clock. This leads to a large switching power. The other problem is that node X controls two larger MOS transistors (P2 and N5). The large capacitive load to node X causes speed and power performance degradation. Figure 1(b) shows an improved P-FF design, named MHLLF, by employing a static latch structure presented. Node X is no longer precharged periodically by the clock signal. A weak pull-up transistorP1 controlled by the FF output signal Q is used to maintain the node X level at high when Q is zero. This design eliminates the unnecessary discharging problem at node X. However, it encounters a longer Data-to-Q (Dto-Q) delay during "0" to "1"transitions because node x is not pre-discharged. Larger transistors N3 and N4 are required to enhance the discharging capability. Another drawback of this design is that node X becomes floating when output Q and input Data both equal to "1". Extra DC power emerges if node X is drifted from an intact"1".  The discharge path contains MOS transistors N2 and N1 connected in series. In order to eliminate superfluous switching at node X, an extra NMOS transistor N3 is employed. Since N3 is controlled by Q_fdbk, no discharge occurs if input data remains high. The worst case timing of this design occurs when input data is "1" and node X is discharged through four transistors in series, that is, N1 through N4, while combating with the pull up transistorP1. A powerful pulldown circuitry is thus needed to ensure node X can be properly discharged. This implies wider N1 and N2 transistors and a longer delay from the delay inverter I1 to widen the discharge pulse width.

Proposed P-FF design
The proposed design, as shown in Figure 2, adopts two measures to overcome the problems associated with existing P-FF designs. The first one is reducing the number of NMOS transistors stacked in the discharging path. The second one is supporting a mechanism to conditionally enhance the pull down strength when input data is "1." Refer to Figure 2, the upper part latch design is similar to the one employed in SOCCER design .As opposed to the transistor stacking design in Figure 1(a) transistor N2 is removed from the discharging path. Transistor N2, in conjunction with an additional transistor N3, forms a two-input pass transistor logic (PTL)-based AND gate to control the discharge of transistor N1. Since the two inputs to the AND logic are mostly complementary (except during the transition edges of the clock), the output node Z is kept at zero most of the time. When both input signals equal to "0" (during the falling edges of the clock), temporary floating at node Z is basically harmless.
At the rising edges of the clock, both transistors N2 and N3 are turned on and collaborate to pass a weak logic high to node Z, which then turns on transistor N1by a time span defined by the delay inverter I1. The switching power at node Z can be reduced due to a diminished voltage swing. Unlike the MHLLF design, where the discharge control signal is driven by a single transistor, parallel conduction of two NMOS transistors (N2and N3) speeds up the operations of pulse generation. With this design measure, the number of stacked transistors along the discharging path is reduced and the sizes of transistors N1 to N5 can be reduced also. In this design, the longest discharging path is formed when input data is "1" while the Qbar output is "1." To enhance the discharging under this condition, transistor P3 is added. Transistor P3 is normally turned off because node X is pulled high most of the time. It steps in when node X is discharged to VLAN Trunk Protocol (VTP) below the Vagrant Drupal Development (VDD). This provides additional boost to node X (from VDD-VTH to VDD). The generated pulse is taller, which enhances the pull-down strength of transistor N1. After the rising edge of the clock, the delay inverter I1 drives node Z back to zero through transistor N3 to shut down the discharging path. The voltage level of Node X rises and turns off transistor P3 eventually. With the intervention of P3, the width of the generated discharging pulse is stretched out. This means to create a pulse with sufficient width for correct data capturing, a bulky delay inverter design, which constitutes most of the power consumption in pulse generation logic, is no longer needed. It should be noted that this conditional pulse enhancement technique takes effects only when the FF output Q is subject to a data change from 0 to 1.

Simulation results
To demonstrate the superiority of the proposed design, post layout simulations on various P-FF designs were  conducted to obtain their performance figures. These designs include the two P-FF designs shown in Figure 1 (MHLLF, SCCER). The target technology is the UMC 90nm CMOS process. The operating condition used in simulations is 500 MHz/1.0 V. In general, the MHLLF design has the worst PDP DQ performance due to the drawback of its latch structure. Figure 3 shows the best PDP DQ performance of each design under different data switching activities. The proposed design takes the lead in all types of data switching activity. The SOCCER and the MHLFF designs almost tie in the second place. Figure 4 shows the PDP DQ performance of these designs at different process corners under the condition of 50% data switching activity. The performance edge of the proposed design is maintained as well. Notably, the MHLLF design has the worst PDP DQ performance especially at the SS process corner due to a large D-to-Q delay and the poor driving capability of its pulse generation circuit. Table 1 also summarizes some important performance indexes of these P-FF designs. These include transistor count, layout area, setup time, hold time, min D-to-Q delay, optimal PDP and the clock tree power. Although the transistor count of the proposed design is not the lowest one, its actual layout area is the smaller than all but the TGFF design. The MHLLF design exhibits the largest layout area because of an oversized pulse generation circuit. Following the measurement methods in [6], curves of D-to-Q delay versus setup time and C-to-Q delay versus hold time are simulated first. Setup time is defined as the point in the curve where Dto-Q delay is the minimum. Hold time is measured at the point where the slope of the curve equals -1. The proposed design features the shortest minimum Dto-Q delay. Its hold time is longer than other designs because the transistor (P3) for the pulse enhancement requires a prolonged availability of data input. The power drawn from the clock tree is calculated to evaluate the impact of FF loading on the clock jitter. Although the proposed FF design requires clock signal connected to the drain of transistor N2, the drawn current is not significant. Due to complementary switching behavior of N2 and N3, there exists no signal path from the entry of the clock signal to either V DD or GND. As shown in Figure  5. Significantly better than other designs. The simulation results show that the clock tree power of the proposed design is close to those of the two leading designs (MHLFF and SCCER) and out performs of MHLLF, SCCER where clock signals connected to gates of the transistors only. The setup time is measured as the point where the minimum PDP value occurs. The setup times of these designs vary from -67 to +47 ps. Note that although the optimal setup time of the proposed design is -53.9 ps, its PDP value is lowest in all designs for any setup time greater than-60 ps. The D-to-Q delay and the hold time are calculated subject to the optimal setup time.
The D-to-Q delay of the proposed design is second to the SCCER design only and out performs the conventional TGFF design by a margin of 44.7%. The hold time requirement seems to be slightly larger due to a negative setup time. This number reduces as the setup time moves toward a positive value. Table 2 gives the leakage power consumption comparison of these FF designs in a standby mode (clock signal is gated). For a fair comparison, we assume the output Q as "0" when input data is "1" to exclude the extra power consumption coming from the discharging of the internal node X. For different clock and input data combinations, the proposed design enjoys the minimum leakage power consumption, which is mainly attributed to the reduction in the transistor sizes along the discharging path. The SAFF design experiences the worst leakage power consumption when clock equals "0" because it's two precharge PMOS transistors are always turned on. Compared to the conventional TGFF design, the average leakage power is reduced by a factor of 3.52. Finally, to show the robustness of the proposed design against the process variations, Table 3 compiles the changes in the width and the height of the generated discharge pulses under different process corners. Although, significant fluctuations in pulse width and height are observed, the unique conditional pulse-enhancement scheme works well in all cases.

CONCLUSION
In this paper, we design a novel low-power pulsetriggered FF design by employing two new design measures. The first one successfully reduces the number of transistors stacked along the discharging path by incorporating a PTL-based AND logic. The second one supports conditional enhancement to the height and width of the discharging pulse so that the size of the transistors in the pulse generation circuit can be kept minimum. Simulation results indicate that the proposed design excels rival designs in performance indexes such as power, D-to-Q delay, and PDP. Coupled with these design merits is a longer hold-time requirement inherent in pulse-triggered FF designs. However, hold-time violations are much easier to fix in circuit design compared with the failures in speed or power.