Analysis of the efficiency of atomic operations in multi-core shared-memory computer systems | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2020. № 51. DOI: 10.17223/19988605/51/12

Analysis of the efficiency of atomic operations in multi-core shared-memory computer systems

One of the most relevant problems of parallel programming, both for shared-memory and distributed-memory systems, is the development of effective thread synchronization tools. The main synchronization methods are locks (spinlocks, mutexes), non-blocking (lock-free, wait-free, obstruction-free) concurrent data structures and software transactional memory. Atomic operations (atomics), implemented in hardware as processor instructions, are used in all these synchronization techniques. An operation is called atomic if it completes in one indivisible step relative to other threads. None of the threads can see the modification of a variable partially completed. In other words, if two or more threads perform operations on a shared variable and at least one writes into it, then both threads should use atomics. For efficient parallel programming with atomics, we should consider the caching algorithms. In particular, we have to study the influence of the cache coherence protocol on the efficiency of atomic operations, as well as the effect of buffer size, the number of threads, the distance of data from the processor core (shared cache level). In this work we analyze the efficiency of atomic operations compare-and-swap (CAS), fetch-and-add (FAA), swap (SWP), load and store on modern multi-core processors. These operations are the most widely used and implemented as atomic instructions at the hardware level on most microarchitectures. We study the impact of the cache coherence protocol, buffer size and data locality on the efficiency of atomic operations. In particular, we study the impact of the cache line state within the cache coherence protocol (M, E, S, O, I, F) on the efficiency of atomic operations. M (Modified) - a cache line is modified and contains actual data. O (Owned) - a cache line contains actual data and is their sole owner. E (Exclusive) - the cache line contains actual data that correspond to the state of the main memory. S (Shared) - cache line contains actual data and other processor cores have actual copies of this data in a shared state. F (Forwarded) - the cache line contains correct data, while other cores in the system may have copies of the data in a shared state. I (Invalid) - the cache line contains incorrect data. At present, the efficiency of atomic operations has not been sufficiently analyzed. This paper discusses the CAS, FAA, SWP, load, store operations and experimental conditions are as close as possible to the actual parallel programs: the core frequency is not fixed, pages are not resized, and data prefetching is enabled. In addition, we consider different variants of CAS operation (successful and unsuccessful), as well as atomic operations using data from the local and remote caches (with respect to the core performing the operations). We designed the algorithms and software toolkit for evaluation efficiency of atomic operations. We experimentally shown that the operations "unsuccessful CAS", FAA and SWP have comparable latency. For operations load and CAS, the smallest and largest latencies are obtained, respectively. A "successful CAS" with any buffer size and cache line conditions runs faster on the local core. The unsuccessful CAS operation is performed faster on the local core in any states, if the buffer size does not exceed the sizes of L1 and L2 caches. SWP operation has the lowest execution latency on the local core for any size of a buffer in the Modified and Exclusive cache line states. We analyze the experimental results and give the recommendations to increase the throughput and minimize latency of atomic operations on the given processor architectures. For example, based on these recommendations, we can increase the throughput of atomics on processors with Piledriver microarchitecture from 1.1 to 3.9 times, for K10 processors from 1.1 to 1.6 times, for Nehalem-EP processors from 2.1 to 6.1 time, for Westmere-EP processors from 1.1 to 7.2 times. Thus, the obtained results show that the latency of atomic operations can vary in wide range, depending on the conditions of their execution (cache line state, locality and buffer size). These evidences should be considered in designing concurrent data structures and synchronization primitives. Keywords: atomic operations; cache coherenc

Download file

Counter downloads: 150

Keywords

атомарные операции, протокол когерентности кэша, многопоточные программы

Authors

Name	Organization	E-mail
Goncharenko Evgemy Andreevtch	Saint Petersburg Electrotech-nical University "LETI"	kesvesna@rambler.ru
Paznikov AleXey AleXandrovtch	Saint Petersburg Electrotech-nical University "LETI"	apaznikov@gmail.com

Всего: 2

References

Herlihy M., Shavit N. The art of multiprocessor programming. Morgan Kaufmann, 2012. 537 p.

Пазников А.А. Оптимизация делегирования выполнения критических секций на выделенных процессорных ядрах // Вестник Томского государственного университета. Управление, вычислительная техника и информатика. 2017. № 38. С. 52-58.

Аненков А.Д., Пазников А.А. Алгоритмы оптимизации масштабируемого потокобезопасного пула на основе распределя ющих деревьев для многоядерных вычислительных систем // Вестник Томского государственного университета. Управление, вычислительная техника и информатика. 2017. № 39. С. 73-84.

Кулагин И.И., Курносов М.Г. Оптимизация обнаружения конфликтов в параллельных программах с транзакционной памятью // Вестник Южно-Уральского государственного университета. Сер. Вычислительная математика и информатика. 2016. Т. 5, № 4. С. 46-60.

Shavit N., Touitou D. Software transactional memory // Distributed Computing. 1997. V. 10, 2. P. 99-116.

Morrison A., Afek Y. Fast Concurrent Queues for x86 Processors // PPoPP'13. 2013. P. 103-112.

Harris T.L. A Pragmatic Implementation of Non-blocking Linked-Lists // DISC'01. 2001. P. 300-314.

Elteir M., Lin H., Feng W. Performance characterization and optimization of atomic operations on amd gpus // IEEE Int. Conf. on Cluster Computing. 2011. P. 234-243.

Schweizer H., Besta M., Hoefler T. Evaluating the cost of atomic operations on modern architectures // 2015 Int. Conf. on Parallel Architecture and Compilation (PACT). 2015. P. 445-456.

Dey S., Nair M.S. Design and Implementation of a Simple Cache Simulator in Java to Investigate MESI and MOESI Coherency Protocols // Int. J. of Computer Applications. 2014. V. 87, №. 11. P. 6-13.

Molka D., Hackenberg D., Schone R., Muller M.S. Memory performance and cache coherency effects on an intel nehalem multiprocessor system // 18th Int. Conf. on Parallel Architectures and Compilation Techniques. 2009. P. 261-270.