1 of 82
2 of 82Publisher Information Published by:EGK Microelectronic Solutions Group Sdn. Bhd.8, Lintang Beringin 8, Diamond Valley Industrial Park, 11960 Batu Maung, Penang, Malaysia Tel: +604-505 9700 Website: www.egkhor.com.my© Copyright 2025 EGK Microelectronic Solutions Group Sdn. Bhd. All rights reserved. No part of this book may be reproduced, stored, or transmitted in any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without prior written permission from the publisher. First Print: December 2025 eISBN 978-629-94581-7-3
3 of 82Author Isaac Khor Eng GianFounder & Chief Executive Officer EGK Microelectronic Solutions Group Sdn. Bhd. Company Registration No.: 20250102992 (1604405-X) Penang, Malaysia
4 of 82Foreword For more than half a century, computing progress has been driven by a simple premise: if we make processors faster, smaller, and cheaper, intelligence will follow. That assumption has held remarkably well—from mainframes to microprocessors, from CPUs to GPUs, and from generalpurpose machines to highly specialized accelerators. Yet today, we find ourselves at a transition point where scaling alone no longer delivers proportional returns. This book enters the conversation at precisely the right moment. What limits modern artificial intelligence is not ambition, nor algorithmic ingenuity, nor even data availability. It is the physics of computation itself—energy dissipation, data movement, and the irreversible nature of most classical logic. These constraints are not bugs to be engineered away; they are fundamental properties of how we compute today. Much public discourse frames the future as a linear race: GPUs giving way to ASICs, and ASICs eventually replaced by quantum computers. That framing is incomplete. Progress does not come from swapping engines mid-flight. It comes from rethinking the architecture of the aircraft. The central insight of this work is that heterogeneous computing is not about faster hardware, but about better decisions upstream. GPUs, ASICs, and quantum processors are not competing paradigms. They are complementary substrates whose value depends entirely on how—and whether—they are invoked. Acceleration amplifies intent; it does not create it. Quantum computing, in particular, is often mischaracterized as a universal successor to classical systems. In reality, its power lies in narrow, well-defined regimes: optimization, sampling, and certain classes of search. Treating quantum devices as drop-in accelerators misunderstands both their strengths and their limitations. Treating them as components within a carefully orchestrated system, however, opens a more realistic and more powerful path forward. What distinguishes this book is its refusal to indulge in hype while still taking the future seriously. It does not promise to defeat NP-hardness, eliminate noise, or bypass thermodynamics. Instead, it embraces engineering reality: heuristic performance, probabilistic outcomes, and system-level tradeoffs. By grounding its architecture in what current and near-term hardware can actually deliver, it advances the conversation from speculation to design. Equally important is the emphasis on orchestration. The most consequential layer in any future AI system will not be silicon, but control: the logic that determines which computations are admissible, which substrates are appropriate, and which costs are worth paying before resources are committed. When that logic is sound, scale becomes stable. When it is not, no amount of acceleration can compensate. This book does not argue that quantum computing will save artificial intelligence. It argues something more subtle—and more important: that the future of AI depends on integrated systems that respect physical limits, exploit specialization intelligently, and treat computation as a scarce resource to be governed, not an infinite one to be consumed. For researchers, engineers, and decision-makers working at the intersection of AI, hardware, and systems architecture, this work offers a clear-eyed map of where we are—and a disciplined vision of where we might go next. — Foreword ContributorA Senior Researcher in Quantum Computing and Systems Architecture
5 of 82Preface Artificial intelligence did not stall because algorithms ran out of ideas. It stalled because physics began to matter again. Over the last decade, AI progress has been driven by relentless scaling: more parameters, more data, more GPUs, more energy. This strategy worked—until it didn’t. Power density, memory bandwidth, fabrication cost, and thermodynamic limits now shape what is possible more than model architecture alone. At the same time, quantum computing has been widely discussed—often inaccurately—as a replacement for classical systems. This book takes a different position. The premise of this work is simple but precise: The future of AI acceleration is heterogeneous.GPUs, ASICs, and quantum processors will coexist—each applied where physics, economics, and computational structure justify their use. This book does not claim that quantum computing is production-ready. It does not promise to defeat NP-hardness. It does not suggest abandoning classical hardware. Instead, it argues for a layered, pragmatic architecture that explicitly accounts for the limits of irreversible computing, the realities of current quantum devices, and the engineering challenges of integration. This is a book written for: • Hardware architects • AI system designers • Semiconductor engineers • Applied researchers • Technology leaders responsible for long-term infrastructure decisions •If you are looking for hype, this book will disappoint you. If you are looking for clarity, realism, and architectural direction, it was written for you.
6 of 82Intended Audience • AI infrastructure architects • Semiconductor and hardware system engineers • Researchers in AI optimization and quantum computing • CTOs, founders, and technical decision-makers • Graduate-level readers in computer engineering or applied physics
7 of 82How This Book Is Structured This book is organized into three progressive parts, moving from constraint → opportunity → architecture. • Part I establishes why classical scaling is reaching limits • Part II defines where quantum acceleration may apply — and where it does not • Part III presents how a realistic GPU–ASIC–Quantum architecture can be built Each part is designed to stand independently, while collectively forming a unified systems argument. How This Book Is Structured This book is organized into three progressive parts, moving from constraint → opportunity → architecture. • Part I establishes why classical scaling is reaching limits • Part II defines where quantum acceleration may apply — and where it does not • Part III presents how a realistic GPU–ASIC–Quantum architecture can be built Each part is designed to stand independently, while collectively forming a unified systems argument.
8 of 82Table of Contents: Hybrid AI Acceleration Part/Chapter Topic/Sections PagesPart I The Limits of Classical Scaling 10-15Chapter 1 The AI Acceleration Imperative 10The rise of specialized hardware 11GPU dominance and parallelism 13ASICs and efficiency-driven design 14The scaling pressure of modern AI models 15Chapter 2 The End of Brute-Force Scaling 17-21Power density and energy limits 17Memory bandwidth and the memory wall 18Economic feasibility and fabrication costs 19Computational complexity and NP-hardness 20Chapter 3 Introducing the Quantum Complement 22Beyond classical computation 22Quantum fundamentals (superposition, entanglement) 22The NISQ era and near-term reality 24Augmentation, not replacement 26Part II Quantum Acceleration and Practical Challenges 30–50Chapter 4 The Quantum Advantage in AI Subroutines 31Optimization workloads 32Sampling and generative models 34Search algorithms 35Hybrid classical–quantum workflows 36Chapter 5 Hard Realities: The Challenges of Quantum Hardware 37-50NP-hardness and heuristic limits 37Connectivity constraints 38Minor embedding overhead 40Chain breaking effects 42Suboptimal solution quality 43Part III The New Integrated Architecture 51–55Chapter 6 The Layered Hybrid Architecture 52
9 of 82GPU → ASIC → Quantum model 54Roles of each compute layer 54The orchestration layer 55Workload partitioning and data flow 55Chapter 7 The Software Stack Challenge 56-61Unified programming models 57Compilers and runtime systems 58Hardware–software co-design 61Chapter 8 Realizing Benefits and Setting Expectations 62-64Pragmatic engineering principles 62Key Performance Indicators (KPIs) 63Statistical evaluation vs guarantees 63Projected Impact and Limitations 64Chapter 9 The Future of AI Infrastructure 65Integrated heterogeneous systems 65Development roadmap 66Standardization challenges 68Conclusion From brute-force scaling to intelligent integration 69Back Matter References 70Index 72Glossary 77
10 of 82Part I: The Limits of Classical Scaling Overview This part establishes the foundation for Hybrid AI Acceleration. It recognizes the extraordinary progress enabled by classical accelerators—GPUs and ASICs—while rigorously defining the physical, economic, and computational constraints that make continued brute-force scaling untenable. The conclusion is not pessimistic; rather, it motivates a shift toward heterogeneous systems where classical and non-classical hardware are intelligently integrated. Chapter 1: The AI Acceleration Imperative Figure 1.1 (Conceptual): Evolution of AI Hardware Acceleration.A timeline showing the transition from CPU-centric computing to GPUdominated training, followed by the emergence of ASICs for efficiency. The figure highlights increasing specialization as a response to scaling pressure.
11 of 82Equation 1.1 (Compute Scaling):Training Compute ∝ N_params × N_tokens × N_epochs This empirical relationship illustrates why hardware acceleration became mandatory as model size and dataset scale grew exponentially. 1.1 The Rise of Specialized Hardware The modern AI revolution is inseparable from hardware acceleration. While algorithmic innovations (such as the development of backpropagation, convolutional networks, and transformers) and the explosion of data availability have been essential pillars, the decisive enabler has been the ability to execute vast numbers of mathematical operations efficiently and at scale. Deep learning workloads are overwhelmingly dominated by dense linear algebra—primarily matrix multiplications (GEMMs), convolutions, and tensor contractions—that inherently demand massive parallelism to achieve practical training times. This computational profile triggered a profound paradigm shift away from general-purpose CPUs—optimized for low-latency, branch-heavy sequential processing—toward specialized architectures prioritizing throughput (raw operations per second) over single-thread latency. Early deep learning experiments on CPUs were prohibitively slow; training even modest networks could take weeks or months. The breakthrough came with the adoption of Graphics Processing Units (GPUs). Originally designed for rendering complex graphics through highly parallel pixel operations, GPUs proved remarkably suited to the SIMD (Single Instruction, Multiple Data) nature of neural network computations. A pivotal moment was the 2012 ImageNet competition, where Alex Krizhevsky's AlexNet—trained on two NVIDIA GTX 580 GPUs—achieved a dramatic reduction in error rate, ushering in the deep
12 of 82learning era and demonstrating GPU acceleration's transformative potential. This shift ignited a hardware–algorithm co-evolution. Model architectures began to explicitly assume the availability of massively parallel compute: attention mechanisms in transformers rely on efficient matrix multiplies, low-precision formats (e.g., FP16) were adopted to exploit GPU hardware features, and frameworks like CUDA and later ROCm democratized access to this power. As demands grew, the next wave emerged: Application-Specific Integrated Circuits (ASICs). Trading flexibility for extreme efficiency, ASICs like Google's Tensor Processing Units (TPUs, introduced in 2016) optimized fixed workloads (e.g., inference in production) with superior performance-per-watt and cost at scale. Today, this specialization continues to accelerate AI progress, but it also exposes the limits of classical approaches—setting the stage for exploring heterogeneous and non-classical paradigms
13 of 821.2 GPU Dominance Graphics Processing Units (GPUs) emerged as the dominant AI accelerator due to their inherently parallel design. Originally engineered for graphics rendering—where tasks like pixel shading require applying the same operations to millions of data points simultaneously—GPUs consist of thousands of lightweight cores capable of executing the same instruction across many data elements in lockstep. This Single Instruction, Multiple Data (SIMD) paradigm, often implemented through broader SIMT (Single Instruction, Multiple Threads) in modern GPUs, maps almost perfectly onto the tensor operations that dominate neural networks: matrix multiplications, convolutions, and element-wise activations. The pivotal demonstration came in 2012 with AlexNet, where training on GPUs reduced computation time from weeks on CPUs to days, enabling far larger models and sparking the deep learning boom. Since then, GPU vendors—primarily NVIDIA—have iteratively introduced increasingly AI-specific features: dedicated tensor cores for accelerated mixedprecision matrix math (introduced in the Volta architecture in 2017 and refined in subsequent generations), support for lower-precision formats like FP16 and BF16 to boost throughput while managing numerical stability, high-bandwidth memory (HBM) stacks for reduced data movement bottlenecks, and advanced interconnects like NVLink for multiGPU scaling. These enhancements have further entrenched GPUs as the backbone of large-scale AI training and inference. Today’s frontier models—such as those with trillions of parameters—are inseparable from massive GPU clusters operating as tightly coupled distributed systems, often comprising tens or hundreds of thousands of GPUs interconnected with high-speed fabrics to minimize communication overhead in data-parallel and modelparallel training. While alternatives like ASICs have gained traction for specific deployments, GPUs remain the flexible, programmable workhorse driving most innovation in AI research and development.
14 of 821.3 ASICs and Efficiency As AI workloads matured and standardized—particularly for production inference and large-scale training of well-established architectures—Application-Specific Integrated Circuits (ASICs) became economically viable. By trading general programmability for deep specialization, ASICs deliver superior performance per watt and per dollar for fixed or slowly evolving workloads, often achieving 2–10× better energy efficiency than programmable GPUs on targeted tasks. Prominent examples include Google's Tensor Processing Units (TPUs), first deployed internally in 2015 and made publicly available via Google Cloud in 2018. TPUs feature systolic array architectures optimized for matrix multiplication, massive on-chip memory to mitigate data movement costs, and domain-specific arithmetic pipelines—enabling dramatic reductions in power consumption for tasks like neural network inference in search and translation services. Other notable ASICs encompass inference accelerators from companies like AWS (Inferentia), Meta (MTIA), and a growing ecosystem of startups, often deployed at hyperscale in data centers where recurring workloads justify the high non-recurring engineering (NRE) costs of custom silicon. ASICs excel where the algorithmic structure is well understood and stable, enabling aggressive optimization of data paths, memory hierarchies, arithmetic units (e.g., support for bfloat16 or int8 quantization), and elimination of unnecessary flexibility. Their success underscores a central theme of this book: as brute-force scaling encounters diminishing returns, efficiency gains increasingly come from specialization rather than raw transistor scaling or general-purpose improvements. This trend toward domain-specific hardware sets the foundation for exploring even more radical heterogeneity in subsequent chapters.
15 of 82Figure 1.2 (Scaling Curve): Model Size vs. Hardware Efficiency.A log-scale plot illustrating the exponential growth in AI model parameters (and effective training compute) over time, contrasted against the slower, sublinear improvements in hardware efficiency (e.g., FLOPS per watt or per dollar). The widening gap between the two curves highlights the diminishing returns of brute-force classical scaling.
16 of 821.4 The Scaling ProblemDespite the remarkable advances in GPUs and ASICs, AI model complexity continues to grow faster than hardware efficiency improvements. Parameter counts in frontier models have increased by orders of magnitude—from millions in early networks to trillions today—while dataset sizes (measured in tokens) and training epochs contribute to an overall exponential rise in required compute, often approximated as proportional to N_params × N_tokens. In contrast, hardware improvements—such as denser transistors, better architectures, and process node shrinks—follow sublinear trends. Metrics like FLOPS per watt or effective compute per dollar improve steadily but at a decelerating pace, constrained by physical limits (e.g., Dennard scaling breakdown) and economic factors. The result is diminishing returns: each new generation of hardware yields progressively smaller relative gains in supporting larger models, at ever-higher absolute costs in energy, capital, and infrastructure. Training state-of-the-art models now requires supercomputer-scale clusters consuming megawatts of power, raising concerns about environmental sustainability and economic feasibility. This widening mismatch—exponential demand versus sublinear supply—sets the stage for the central question of Part I: what happens when classical acceleration can no longer scale economically, energetically, or physically to meet AI's insatiable appetite for compute? The exploration of heterogeneous and non-classical paradigms becomes not just advantageous, but necessary.