Yao Z, Dong Z, Zheng Z, Gholami A, Yu J, Tan E, Wang L, Huang Q, Wang Y, Mahoney MW, Keutzer K. HAWQV3: Dyadic Neural Network Quantization. arXiv preprint arXiv:2011.10680. 2020 Nov 20. Singh G, Chelini L, Corda S, et al. Near-memory computing: past, present, and future. Microprocessors Microsyst, 2019, 71: 102868

Ahn J, Hong S, Yoo S, et al. A scalable processing-in-memory accelerator for parallel graph processing. In: Proceedings of 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, 2015. 105–117 Stokes J (2008) Analysis: more than 16 cores may well be pointless. In: Ars Technica. Condé Nast Digital. Available http://arstechnica.com/hardware/news/2008/12/analysis-more-than-16-cores-may-well-be-pointless.ars, Dec. 2008 The cost constraints of memory bandwidth and capacity show up in Nvidia's A100 GPUs constantly. The A100 tends to have very low FLOPS utilization without heavy optimization. FLOPS utilization measures the total computed FLOPS required to train a model vs. the theoretical FLOPS the GPUs could compute in a model's training time.

Ahmed Amine Jerraya and Wayne Wolf (2005). Multiprocessor Systems-on-chips. Morgan Kaufmann. pp.90–91. ISBN 9780123852519. Archived from the original on August 1, 2016 . Retrieved March 31, 2014. The Emergence of Practical MRAM "Crocus Technology | Magnetic Sensors | TMR Sensors" (PDF). Archived from the original (PDF) on 2011-04-27 . Retrieved 2009-07-20.a b c d "History: 1990s". SK Hynix. Archived from the original on 5 February 2021 . Retrieved 6 July 2019.

Burger D, Goodman J, Kägi A (1996) Memory bandwidth limitations of future microprocessors. In: Proceedings of the 23rd International Symposium on Computer Architecture, Philadelphia, 22–24 May 1996. IEEE/ACM, Los Alamitos/New York, pp 78–89Imani M, Gupta S, Rosing T. Ultra-efficient processing in-memory for data intensive applications. In: Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, 2017. 1–6 Just like with training ML models, knowing what regime you're in allows you to narrow in on optimizations that matters. For example, if you're spending all of your time doing memory transfers (i.e. you are in a memory-bandwidth bound regime), then increasing the FLOPS of your GPU won't help. On the other hand, if you're spending all of your time performing big chonky matmuls (i.e. a compute-bound regime), then rewriting your model logic into C++ to reduce overhead won't help.

