PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION
K. Raju
rajuk@nitte.edu.inDepartment of CSE, NMAM Institute of Technology, Nitte (India)
Niranjan N Chiplunkar
Department of CSE, NMAM Institute of Technology, Nitte (India)
Abstract
The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCI-Express bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU. The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches.
Keywords:
CPU-GPU, High-performance computing, Kernel, Data transfer, CUDA streamsReferences
Antoniadis, N., & Sifaleras, A. (2017). A hybrid CPU-GPU parallelization scheme of variable neighborhood search for inventory optimization problems. Electronic Notes in Discrete Mathematics, 58, 47–54. https://doi.org/10.1016/j.endm.2017.03.007
DOI: https://doi.org/10.1016/j.endm.2017.03.007
Google Scholar
Dhake, A.A., & Walunj, S.M. (2019). Transfer Time Optimization Between CPU and GPU for Virus Signature Scanning. In A. Luhach, D. Jat, K. Hawari, X.Z. Gao & P. Lingras (Eds.), Advanced Informatics for Computing Research. ICAICR 2019. Communications in Computer and Information Science (vol. 1076, pp. 70–78). Springer Singapore. https://doi.org/https://doi.org/10.1007/978-981-15-0111-1_6
DOI: https://doi.org/10.1007/978-981-15-0111-1_6
Google Scholar
Fang, J., Chen, H., & Mao, J. (2018). Understanding data partition for applications on CPU-GPU integrated processors. In Communications in Computer and Information Science (vol. 747). Springer Singapore. https://doi.org/10.1007/978-981-10-8890-2_32
DOI: https://doi.org/10.1007/978-981-10-8890-2_32
Google Scholar
Fu, C., Wang, Z., & Zhai, Y. (2017). A CPU-GPU Data Transfer Optimization Approach Based on Code Migration and Merging. Proceedings - 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, DCABES 2017, 2018-Septe (pp. 23–26). IEEE. https://doi.org/10.1109/DCABES.2017.13
DOI: https://doi.org/10.1109/DCABES.2017.13
Google Scholar
Gowanlock, M., & Karsin, B. (2019). A hybrid CPU/GPU approach for optimizing sorting throughput. Parallel Computing, 85, 45–55. https://doi.org/10.1016/j.parco.2019.01.004
DOI: https://doi.org/10.1016/j.parco.2019.01.004
Google Scholar
Gregg, C., & Hazelwood, K. (2011). Where is the Data ? Why You Cannot Debate CPU vs. GPU Performance Without the Answer. IEEE International Symposium on Performance Analysis of Systems and Software. (pp. 134–144). IEEE. https://doi.org/10.1109/ISPASS.2011.5762730
DOI: https://doi.org/10.1109/ISPASS.2011.5762730
Google Scholar
Hascoet, T., Zhuang, W., Febvre, Q., Ariki, Y., & Takiguchi, T. (2019). Reducing the Memory Cost of Training Convolutional Neural Networks by CPU Offloading. Journal of Software Engineering and Applications, 12(08), 307–320. https://doi.org/10.4236/jsea.2019.128019
DOI: https://doi.org/10.4236/jsea.2019.128019
Google Scholar
Huang, W., Yu, L., Ye, M., Chen, T., & Hu, T. (2012). A CPU-GPGPU scheduler based on data transmission bandwidth of workload. Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings (pp. 610–613). IEEE. https://doi.org/10.1109/PDCAT.2012.15
DOI: https://doi.org/10.1109/PDCAT.2012.15
Google Scholar
Lázaro-Muñoz, A.J., González-Linares, J.M., Gómez-Luna, J., & Guil, N. (2017). A tasks reordering model to reduce transfers overhead on GPUs. Journal of Parallel and Distributed Computing, 109, 258–271. https://doi.org/10.1016/j.jpdc.2017.06.015
DOI: https://doi.org/10.1016/j.jpdc.2017.06.015
Google Scholar
Lee, C., Woo, W.R., & Gaudiot, J. (2014). Boosting CUDA Applications with CPU – GPU Hybrid Computing. International Journal of Parallel Programming, 42, 384–404. https://doi.org/10.1007/s10766-013-0252-y
DOI: https://doi.org/10.1007/s10766-013-0252-y
Google Scholar
Lee, J., Samadi, M., Park, Y., & Mahlke, S. (2015). SKMD: Single kernel on multiple devices for transparent CPU-GPU collaboration. ACM Transactions on Computer Systems, 33(3). https://doi.org/10.1145/2798725
DOI: https://doi.org/10.1145/2798725
Google Scholar
Li, T., Dong, Q., Wang, Y., Gong, X., & Yang, Y. (2017). Dual buffer rotation four-stage pipeline for CPU – GPU cooperative computing. Soft Computing, 23, 859–869. https://doi.org/10.1007/s00500-017-2795-0
DOI: https://doi.org/10.1007/s00500-017-2795-0
Google Scholar
Luley, R.S., & Qiu, Q. (2016). Effective utilization of CUDA hyper-Q for improved power and performance efficiency. Proceedings – 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 (pp. 1160–1169). IEEE. https://doi.org/10.1109/IPDPSW.2016.154
DOI: https://doi.org/10.1109/IPDPSW.2016.154
Google Scholar
Lutz, C., Breß, S., Zeuch, S., Rabl, T., & Markl, V. (2020). Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1633–1649). ACM Digital Library. https://doi.org/10.1145/3318464.3389705
DOI: https://doi.org/10.1145/3318464.3389705
Google Scholar
NVIDIA TITAN V. (n.d.). NVIDIA Corporation. Retrieved May 8, 2021 from https://www.nvidia.com NVIDIA. (2015). CUDA C Programming Guide v 9.1. NVIDIA.
Google Scholar
Pandit, P., & Govindarajan, R. (2014). Fluidic kernels: Cooperative execution of openCL programs on multiple heterogeneous devices. Proceedings of the 12th ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2014 (pp. 273–283). ACM Digital Library. https://doi.org/10.1145/2544137.2544163
DOI: https://doi.org/10.1145/2544137.2544163
Google Scholar
Patil, S.V., & Kulkarni, D.B. (2021). Data transfer optimization in CPU/GPGPU Communication. Turkish Journal of Computer and Mathematics Education, 12(13), 1920–1923.
Google Scholar
Piao, X., Kim, C., Oh, Y., Li, H., Kim, J., Kim, H., & Lee, J.W. (2015). JAWS: A JavaScript framework for adaptive CPU-GPU work sharing. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 2015-Janua (pp. 251–252). ACM Digital Library. https://doi.org/10.1145/2688500.2688525
DOI: https://doi.org/10.1145/2858788.2688525
Google Scholar
Raju, K., & Chiplunkar, N.N. (2018). A survey on techniques for cooperative CPU-GPU computing. Sustainable Computing: Informatics and Systems, 19, 72–85. https://doi.org/10.1016/j.suscom.2018.07.010
DOI: https://doi.org/10.1016/j.suscom.2018.07.010
Google Scholar
Sabet, A.H.N., Zhao, Z., & Gupta, R. (2020). Subway: Minimizing data transfer during out-of-GPU-memory graph processing. Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020 (pp. 1–16). ACM Digital Library. https://doi.org/10.1145/3342195.3387537
DOI: https://doi.org/10.1145/3342195.3387537
Google Scholar
Siklosi, B., Reguly, I.Z., & Mudalige, G.R. (2019). Heterogeneous CPU-GPU execution of stencil applications. Proceedings of P3HPC 2018: International Workshop on Performance, Portability and Productivity in HPC, Held in Conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 71–80). IEEE. https://doi.org/10.1109/P3HPC.2018.00010
DOI: https://doi.org/10.1109/P3HPC.2018.00010
Google Scholar
Werkhoven, B. Van, Maassen, J., Seinstra, F.J., & Bal, H.E. (2014). Performance models for CPU-GPU data transfers. Proceedings – 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2014 (pp. 11–20). IEEE. https://doi.org/10.1109/CCGrid.2014.16
DOI: https://doi.org/10.1109/CCGrid.2014.16
Google Scholar
Yang, W., Li, K., & Li, K. (2017). A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. Journal of Parallel and Distributed Computing, 104, 49–60. https://doi.org/10.1016/j.jpdc.2016.12.023
DOI: https://doi.org/10.1016/j.jpdc.2016.12.023
Google Scholar
Authors
Niranjan N ChiplunkarDepartment of CSE, NMAM Institute of Technology, Nitte India
Statistics
Abstract views: 316PDF downloads: 53
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Piotr WITTBRODT, Iwona ŁAPUŃKA, Gulzhan BAYTIKENOVA, Arkadiusz GOLA, Alfiya ZAKIMOVA, IDENTIFICATION OF THE IMPACT OF THE AVAILABILITY FACTOR ON THE EFFICIENCY OF PRODUCTION PROCESSES USING THE AHP AND FUZZY AHP METHODS , Applied Computer Science: Vol. 18 No. 4 (2022)
- Lei Liu, Eric B. Blancaflor, Mideth Abisado, A LIGHTWEIGHT MULTI-PERSON POSE ESTIMATION SCHEME BASED ON JETSON NANO , Applied Computer Science: Vol. 19 No. 1 (2023)
- Zahid Zamir, CAN THE SYSTEM, INFORMATION, AND SERVICE QUALITIES IMPACT EMPLOYEE LEARNING, ADAPTABILITY, AND JOB SATISFACTION? , Applied Computer Science: Vol. 19 No. 1 (2023)
- Robert KARPIŃSKI, Józef JONAK, Jacek MAKSYMIUK, MEDICAL IMAGING AND 3D RECONSTRUCTION FOR OBTAINING THE GEOMETRICAL AND PHYSICAL MODEL OF A CONGENITAL BILATERAL RADIO-ULNAR SYNOSTOSIS , Applied Computer Science: Vol. 14 No. 1 (2018)
- Rafał KWOKA, Janusz KOZAK, Michał MAJKA, TESTS OF HTS 2G SUPERCONDUCTING TAPES USING THE LABVIEW ENVIRONMENT , Applied Computer Science: Vol. 14 No. 1 (2018)
- Wojciech DANILCZUK, THE USE OF SIMULATION ENVIRONMENT FOR SOLVING THE ASSEMBLY LINE BALANCING PROBLEM , Applied Computer Science: Vol. 14 No. 1 (2018)
- Lucian LUPŞA-TĂTARU, CUSTOMIZING AUDIO FADES WITH A VIEW TO REAL-TIME PROCESSING , Applied Computer Science: Vol. 15 No. 4 (2019)
- Wafaa Mustafa HAMEED, Asan Baker KANBAR, USING GA FOR EVOLVING WEIGHTS IN NEURAL NETWORKS , Applied Computer Science: Vol. 15 No. 3 (2019)
- Quirino ESTRADA, Dariusz SZWEDOWICZ, Julio C. VERGARA, José SOLIS, Miguel A. PAREDES, Lara WIEBE, Jesús M. SILVA, NUMERICAL SIMULATIONS OF SANDWICH STRUCTURES UNDER LATERAL COMPRESSION , Applied Computer Science: Vol. 15 No. 2 (2019)
- Mariano LARIOS, Perfecto M. QUINTERO-FLORES , Mario ANZURES-GARCÍA , Miguel CAMACHO-HERNANDEZ , APPLICATION OF THE REAL-TIME FAN SCHEDULING IN THE EXPLORATION-EXPLOITATION TO OPTIMIZE MINIMUM FUNCTIONS OBJECTIVES , Applied Computer Science: Vol. 19 No. 2 (2023)
<< < 7 8 9 10 11 12 13 14 15 16 > >>
You may also start an advanced similarity search for this article.