Intel’s Tiger Lake Processors Are Made With 10nm SuperFin – Massive Clock Speed Boost And Other Architectural Improvements

Intel's Tiger Lake CPUs are going to be released soon but it turns out they have an ace up their sleeve when compared to Ice Lake. Unlike ICL, Tiger Lake CPUs are manufactured on the company's 10nm SuperFin process which allows 14nm-level clocks and other major architectural improvements. As it turns out, a Tiger Lake CPU can actually hit up to 5.0 GHz clocks - which is something that only Intel's most mature 14nm processes were able to do before.

Intel's 10nm SuperFin TigerLake CPUs can hit up to 5.0 GHz

See, Intel's original 10nm architecture Sunny Cove had great IPC but could not sustain high enough clock rates. To fix this, Intel developed a new type of transistor called a SuperFin (previously called 10nm+). Don't be misled by the + though because unlike previous plusses this iteration delivered roughly the same level of improvement as a node shrink in one go. You can read more about Intel's 10nm SuperFin transistor over here.

Intel Tiger Lake with 10nm SuperFin architectural deep dive

Tiger Lake also utilizes Willow Cove cores which doubles the bandwidth and shifts to a double ring architecture. It is essentially a vastly improved version of Sunny Cove and combined with the Intel SuperFin process, it turns Tiger Lake into a truly formidable beast. Tiger Lake also ships with the company's first Xe iGPU which can achieve up to 2.6 TFLOPs of performance - absolutely insane for such a tiny chip.

Without any further ado, here is Intel's Boyd Phelps, Vice President of Client Engineering Group explaining the architectural improvements with their new Tiger Lake CPUs:

We added a new high-performance transistor that increases drive current with an improved gate process enabling higher mobility while also lowering the source drain resistance and we did this all with lower capacitance. Not only do we add a new device for high performance. But we also took our existing high VT devices used in our non high frequency critical IPS, like type-c, pcie, imaging and made them more efficient. We were able to speed those devices up while lowering their leakage and this gave us the ability to lower their operating voltage returning yet more power headroom to be available for our high-performance IPS.

But as every good designer knows it takes more than transistors. As Moore's Law continues to shrink feature sizes, the metal stacks interconnect performance is as vital as the transistor itself. We invested significant engineering focus and resources to redesign the metal stack as well. We greatly improve the resistance, the availability, and yield ability of the mid layers. We also added two additional high-performance layers at the top and dramatically enhance the mimcap capabilities by greater than 4X to ensure a rapid and solid power delivery response for high CPU intensity workloads.

Now the combination of the new transistor technology as well as the improved metal stack is what we call SuperFin technology the results of these engineering Investments have greatly exceeded our expectations. I would like to explain how we use it in Tiger Lake.

In designing the CPU for Tiger lake called Willow Cove. We were driven by three main goals: 1) build upon the foundation of this Sunny Cove architecture with all of its deeper wider smarter and [indistinguishable] hardware for AI but make it significantly faster and do it at lower power. 2): redesign the caching architecture to be non-inclusive with the mid-level cache size increase from 512 kilobytes to 1.25 megabytes in order to handle the increased performance targets and emerging workloads of tomorrow and 3) make it secure with features like CET technology that helps protect against control flow oriented attacks. To be honest. We debated whether to focus our efforts on additional IPC or to redesign the fundamental circuits to take advantage of the SuperFin process enhancements.

And in the end, we believe we made the right engineering choice and we were able to deliver a greater than generational improvement in performance by not only dramatically lowering the voltage at which Willow Cove achieves its operating frequencies versus Sunny Cove, but we were also able to extend the range. Willow Cove is better faster and more efficient enabling generational CPU gains and not only TDP limited performance, but also an unconstrained performance across the board. Willow Cove was designed to optimize the entire range of the VF curve. So let me illustrate what this dynamic range of performance looks like. At a given voltage Willow Cove delivers a significant frequency increase. It can also operate at any fixed frequency with a significant lower voltage. It is performance across the full VR curve, a greater dynamic range from be V-Min to V-Max. This is substantial uplift. Now today, I'd love to show you what this looks like in an actual workload.

At the top of your screen we show Willow Cove and Sunny Cove side-by-side running web expert 3, a browser-based workload using HTML5 and JavaScript, which has high frequency sensitivity and represents the bursty nature of web browsing performance that is important in today's work loads. In the bottom part of the screen, you can see the real-time frequencies for Willow Cove and Sunny Cove while running the workload. As the workload runs, there are multiple sub workloads running in series so you can see the CPU bursting up as workloads run and down as workloads complete and the next one fires up. You can clearly see Willow Cove running with significantly higher frequencies at the same or lower power over Sunny Cove which results in better performance and responsiveness.

This was our goal in designing the Willow Cove CPU where there is no s-curve downside in the CPU at all. We are leveraging the SuperFin process enhancements to change how power is allocated to graphics. We are able to deliver more power headroom and with that headroom and architectural improvements, we were able to increase our execution units from 64 to 96 and drive them faster within the same power envelope. We also added additional hardware and data types for increased AI capabilities and David Blythe will discuss our Xe groundbreaking Graphics architecture in the next presentation.

With the increased number of EUs came a demand for more bandwidth and to open up that bandwidth, we had to redesign Tiger Lake and feed the XE engine. Tiger Lake was designed for high bandwidth and to support a wide variety of memory technologies. As mentioned before in Tiger Lake, our high-speed coherent fabric called the ring is used to connect our high performance cores and graphics. We doubled the ring bandwidth over Ice Lake by implementing a dual ring microarchitecture. We also leveraged our SuperFin technology to improve the voltage and frequency scaling capability. We also enlarge our last level cache by 50% now at the range of 12 megabytes to 24 megabytes depending on the product to capture more working sets while maintaining the same low hit latency.

Now to exploit dram efficiency and better utilize memory bandwidth, we reorganize the memory into 8 by 16 bundles and added a second memory controller with deeper cues for better scheduling efficiency. The max bandwidth to memory in the Tiger Lake architecture can scale up to 86 GB/s. Initial Tiger Lake configurations will support DDR4 3200 LP 4X up to 4267 and is future proof for a later version of Tiger Lake to support LP5 Technologies up to 5400.

We also added the total memory encryption engine (TME), which applies the XDS AES encryption decryption algorithm to memory traffic to protect the system DIMMs. With Tiger Lake we offer a variety of AI capabilities for different workloads from CPU AI acceleration via the NNI instructions to GPU to low-power accelerators like GNA 2.0. The TDP power for Tiger Lake’s GNA 2.0 is 1 GigaOp per millaWatt with the capability up to 38 GigaOps. Now this is targeted for algorithms like noise cancellation or applications like meeting transcription translation, and of course context and conversation tracking.

This low power capability is increasingly more important in today's modern mobile CPUs, especially with the emphasis on high-quality distance-based collaboration. Tiger Lake also provides advancements in areas of displaying and imaging. For the display we wanted to increase not only the number of displays that could be supported but also handle the greater resolutions and quality emerging in future displays. So this translates into much higher bandwidth requirements with the need to preserve quality of service. To do that we re-architected Tiger Lake to handle that demand; we plugged in a 64-byte direct data path from memory to display we call this the display ISOC port to bypass all of the arbitration layers of the SoC fabric.

Now the display ISOC port easily supports up to 64 gigabytes per second depending on product implementation. Now with imaging with our IPU architecture, there are several new Tiger Lake camera capabilities that are brought to life in our new technologies. The image pipeline is now fully implemented in hardware for lower power and faster responsiveness. There are up to six sensors capable of supporting video up to 4K 90 resolutions with still image resolutions up to 42 megapixels. Our initial product offerings will support 4K 30 and 27 megapixels, respectively.

Our IPU6 architecture also supports a host of new sensor technologies and quality enhancements. Tiger Lake has a very rich set of IO capabilities implemented for a mobile CPU and enables a new array of platform capabilities and form factors. Tiger Lake introduces integrated Thunderbolt 4 and USB 4 that are fully specification compliant. The integrated display via the type C system builds on the prior DP tunneling via Thunderbolt, but importantly it adds DPN ports for discrete card DisplayPort outputs to be muxed over the integrated type-c ports depending on the SKU configuration. Now for PCIe in order to increase responsiveness we added PCI Gen4 lanes that enable direct SSD attached to the CPU without having to go through the PCH.

Not only is this great for high-speed storage devices for which we are seeing a hundred nanoseconds less latency versus connecting them via the PCH but it also allows for other interesting configurations. Say for example, being able to attach graphics cards to it. The number of lanes will be dependent on core count configuration and power levels. Now in order to improve power and performance to achieve our aggressive goals, we worked on two separate streams of power management work. 1) Tiger Lake was designed to dynamically match the frequency and voltage to the bandwidth needed by the workloads being run.

Our autonomous dbfs capabilities achieve low latency scaling of voltage and frequency to run at the most power efficient point based on workload bandwidth for both the SoC fabric and the memory subsystem. In addition to our dbfs capabilities, we've targeted several areas of the design for reduced power consumption even with all of the added features and performance. Our improved HVT transistors are vital for improving power in our type c, PCIe and DVR IO subsystems.

We lowered the fixed rail voltages over Ice Lake where possible while also improving the efficiency of our fully integrated voltage regulators. We also reduce the amount of logic needed to live on our deep sleep C State sustained rail. Now in the end both of these work streams resulted in greater power efficiency, which translates into higher performance at the same power envelope with decreased energy consumption.

Overall versus Ice Lake and leveraging the great process technology improvements we've described the Tiger Lake SOC architecture delivers significant advancements across a wide set of soc IPs with more than a generational increase in CPU performance with Willow Cove massive improvements in graphics power efficiency in the XE Graphics IP, scalable AI for emerging client workloads, increased memory and fabric efficiency to support high bandwidth and rich IO capabilities and much more. I hope you can see why we are excited about our Tiger Lake journey. Intel teams redesign the SoC on the foundation of our SuperFin technology delivering greater performance across CPU, GPU and AI. With the improved power efficiency we were not only able to refresh and update all existing IP, but we were able to integrate additional functionality for a greater user experience across the spectrum. - Boyd Phelps, Vice President Client Enigneering Group, Intel, Architecture Day 2020.