from unmodified binaries while keeping hardware overheads
to a minimum. This is achieved by extracting backwards
slices containing address-generating instructions through a
novel iterative algorithm that can be implemented efficiently
in hardware. These backward slices are then executed on a
second in-order pipeline, enabling them to bypass instructions
blocked by pending loads. The Load Slice Core design im-
proves on existing work by providing a lightweight, hardware-
based method of executing past pending loads while avoiding
re-execution.
Based on detailed timing simulations and estimates of area
and power, we demonstrate that the Load Slice Core is substan-
tially more area- and energy-efficient than traditional solutions:
average performance is 53% higher than an in-order, stall-on-
use core, with an area overhead of only 15% and an increase
in power consumption of just 22%. This enables a power-
and area-constrained many-core design based on the Load
Slice Core to outperform both in-order and out-of-order based
alternatives, by 53% and 95%, respectively. We therefore
believe that for today’s context of constrained multi-core pro-
cessors, the Load Slice Core strikes a good balance between
single-thread performance and energy efficiency.
Acknowledgments
We thank the anonymous reviewers for their valuable feed-
back. We would also like to thank Arvind for inspiring this
research and Magnus Själander for his feedback to improve
this work. This work is supported by the European Research
Council under the European Community’s Seventh Frame-
work Programme (FP7/2007-2013) / ERC Grant agreement
no. 259295.
References
[1]
ARM, “2GHz capable Cortex-A9 dual core processor implementation,”
http://www.arm.com/files/downloads/Osprey_Analyst_Presentation_
v2a.pdf, archived at the Internet Archive (http://archive.org).
[2]
ARM, “ARM Cortex-A7 processor,” http://www.arm.com/products/
processors/cortex-a/cortex-a7.php.
[3]
V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. Jones, and
B. Parady, “SPEComp: A new benchmark suite for measuring
parallel computer performance,” in OpenMP Shared Memory Parallel
Programming, R. Eigenmann and M. Voss, Eds., Jul. 2001, vol. 2104,
pp. 1–10.
[4]
R. D. Barnes, E. M. Nystrom, J. W. Sias, S. J. Patel, N. Navarro,
and W. W. Hwu, “Beating in-order stalls with "flea-flicker" two-pass
pipelining,” in Proceedings of the 36th International Symposium on
Microarchitecture (MICRO), Jan. 2003, pp. 18–33.
[5]
R. D. Barnes, S. Ryoo, and W. W. Hwu, “"Flea-flicker" multipass
pipelining: An alternative to the high-power out-of-order offense,” in
Proceedings of the 38th annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), Nov. 2005, pp. 319–330.
[6]
T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring
the level of abstraction for scalable and accurate parallel multi-core
simulations,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis (SC), Nov.
2011, pp. 52:1–52:12.
[7] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, “An
evaluation of high-level mechanistic core models,” ACM Transactions
on Architecture and Code Optimization (TACO), vol. 11, no. 3, pp.
28:1–28:25, Aug. 2014.
[8]
S. Chaudhry, P. Caprioli, S. Yip, and M. Tremblay, “High-performance
throughput computing,” Micro, IEEE, vol. 25, no. 3, pp. 32–45, 2005.
[9]
S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip,
H. Zeffer, and M. Tremblay, “Simultaneous speculative threading: A
novel pipeline architecture implemented in sun’s rock processor,” in
Proceedings of the 36th Annual International Symposium on Computer
Architecture (ISCA), Jun. 2009, pp. 484–495.
[10]
Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture optimizations
for exploiting memory-level parallelism,” in Proceedings of the
International Symposium on Computer Architecture (ISCA), Jun. 2004,
pp. 76–87.
[11]
G. Chrysos, “Intel
®
Xeon Phi coprocessor (codename Knights Cor-
ner),” in Proceedings of the 24th Hot Chips Symposium, Aug. 2012.
[12]
J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, “Dynamic
speculative precomputation,” in Proceedings of the 34th Annual
ACM/IEEE International Symposium on Microarchitecture (MICRO),
Dec. 2001, pp. 306–317.
[13]
J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery,
and J. P. Shen, “Speculative precomputation: Long-range prefetching
of delinquent loads,” in Proceedings of the 28th Annual International
Symposium on Computer Architecture (ISCA), May 2001, pp. 14–25.
[14]
N. C. Crago and S. J. Patel, “OUTRIDER: Efficient memory latency
tolerance with decoupled strands,” in Proceedings of the 38th Annual
International Symposium on Computer Architecture (ISCA), Jun. 2011,
pp. 117–128.
[15]
J. Dundas and T. Mudge, “Improving data cache performance by
pre-executing instructions under a cache miss,” in Proceedings of the
11th International Conference on Supercomputing (SC), Jul. 1997, pp.
68–75.
[16]
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” in
Proceedings of the 38th Annual International Symposium on Computer
Architecture (ISCA), Jun. 2011, pp. 365–376.
[17]
W. Heirman, T. E. Carlson, K. Van Craeynest, I. Hur, A. Jaleel,
and L. Eeckhout, “Undersubscribed threading on clustered cache
architectures,” in Proceedings of the IEEE 20th International
Symposium on High Performance Computer Architecture (HPCA), Feb.
2014.
[18]
A. Hilton, S. Nagarakatte, and A. Roth, “iCFP: Tolerating all-level
cache misses in in-order processors,” in 15th International Symposium
on High Performance Computer Architecture (HPCA), Feb. 2009, pp.
431–442.
[19]
H. Jin, M. Frumkin, and J. Yan, “The OpenMP implementation of NAS
Parallel Benchmarks and its performance,” NASA Ames Research
Center, Tech. Rep., Oct. 1999.
[20]
D. Kim, S. S. Liao, P. H. Wang, J. del Cuvillo, X. Tian,
X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen, “Physical
experimentation with prefetching helper threads on Intel’s hyper-
threaded processors,” in International Symposium on Code Generation
and Optimization (CGO), Mar. 2004, p. 27.
[21]
D. Kim and D. Yeung, “A study of source-level compiler algorithms
for automatic construction of pre-execution code,” ACM Transactions
on Computer Systems (TOCS), vol. 22, no. 3, pp. 326–379, Aug. 2004.
[22]
A. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg,
“A large, fast instruction window for tolerating cache misses,” in
Proceedings of the 29th Annual International Symposium on Computer
Architecture (ISCA), May 2002, pp. 59–70.
[23]
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi,
“CACTI-P: Architecture-level modeling for SRAM-based structures
with advanced leakage reduction techniques,” in 2011 IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), Nov.
2011, pp. 694–701.
[24]
A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, “Slice-
processors: An implementation of operation-based prediction,” in
Proceedings of the 15th International Conference on Supercomputing
(SC), Jun. 2001, pp. 321–334.
[25]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution:
An alternative to very large instruction windows for out-of-order
processors,” in Proceedings of the Ninth International Symposium on
High-Performance Computer Architecture (HPCA), Feb. 2003, pp.
129–140.
[26]
S. Nekkalapu, H. Akkary, K. Jothi, R. Retnamma, and X. Song, “A
simple latency tolerant processor,” in IEEE International Conference
on Computer Design (ICCD), Oct. 2008, pp. 384–389.
[27]
S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-
effective superscalar processors,” in Proceedings of the 24th Annual
International Symposium on Computer Architecture (ISCA), Jun. 1997,
pp. 206–218.