This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.Doi Number OpenACC Errors Classification and Static Detection Techniques AHMED MOHAMMED ALGHAMDI AND FATHY ELBOURAEY EASSA Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia Corresponding author: Ahmed Mohammed Alghamdi (e-mail:
[email protected]). This work was supported by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under Grant DG-015-611-1440. ABSTRACT With the continued increase of usage of High-Performance Computing (HPC) in scientific fields, the need for programming models in a heterogeneous architecture with less programming effort has become important in scientific applications. OpenACC is a high-level parallel programming model used with FORTRAN, C, and C++ programming languages to accelerate the programmers' code with fewer changes and less effort, which reduces programmer workloads and makes it easier to use and learn. Also, OpenACC has been increasingly used in many top supercomputers around the world, and three of the top five HPC applications in Intersect360 Research are currently using OpenACC. However, when programmers use OpenACC to parallelize their code without correctly understanding OpenACC directives and their usage or following OpenACC instructions, they can cause run-time errors that vary from causing wrong results, performance issues, and other undefined behaviors. In addition, building parallel systems by using a higher level programming model increase the possibility to introduce errors, and the parallel applications thus have non-determined behavior, which makes testing and detecting their run-time errors a challenging task. Although there are many testing tools that detect run-time errors, this is still inadequate for detecting errors that occur in applications implemented in high-level parallel programming models, especially OpenACC related applications. As a result, OpenACC errors that cannot be detected by compilers should be identified, and their causes should be explained. In this paper, our contribution is introducing new static techniques for detecting OpenACC errors, as well as for the first time classifying errors that can occur in OpenACC software programs. Finally, to the best of our knowledge, there is no published work to date that identifies or classifies OpenACC-related errors, nor is there a testing tool designed to test OpenACC applications and detect their run-time errors. INDEX TERMS OpenACC, OpenACC run-time errors, OpenACC error classifications, OpenACC testing tool, Static approach for OpenACC application. I. INTRODUCTION using OpenACC to parallelize their code without fully Over the past few years, OpenACC has become increasingly understanding OpenACC instructions. used in many high-performance computers, including the top Testing parallel applications built by using programming supercomputer at the World Summit. Also, OpenACC models is a difficult task because if badly programmed they attracts more non-computer science specialists for can have a non-determined behavior, which makes it accelerating their systems in several scientific fields, challenging to detect their errors when they occur and including weather forecasting and simulations. OpenACC is determine the causes of these errors, whether from the user a high-level parallel programming model designed for source code or the programming model directives. Also, it is heterogeneous systems, first released in 2011. This difficult to see if the errors have been corrected or are still programming model is used for supporting parallelism in present but hidden, even when these errors have been sequential programming languages by adding OpenACC detected and the source code modified. directives without low-level details and is easy to learn and Many studies have investigated several programming use. Therefore, programmers could cause some errors when models-related applications for detecting and identifying run- time errors, as well as other semantic and syntax errors. VOLUME XX, 2017 1 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access However, OpenACC has not been investigated or identified Also, the published survey in [3] presented 15 different clearly, as the other programming models have. Although mistakes and how to avoid them by recommending best there are many testing tools that detect run-time errors, this is practices in these cases. In terms of MPI errors, the still not enough to detect errors that occur in applications communication deadlock was investigated and identified in implemented in high-level parallel programming models [4], while in [5] the authors introduced a classification for used for a heterogeneous system architecture, especially summarizing the error types that apply to MPI’s non- OpenACC-related applications. blocking collectives. Finally, CUDA run-time errors were In this paper, our contribution is introducing new static identified and ways to avoid these errors were published in techniques for detecting OpenACC errors, and for the first [6], in which CUDA issues were classified into three time classifying errors that can occur in OpenACC software categories: errors in using CUDA directives, general parallel programs. We briefly mention some OpenACC run-time errors, and algorithmic errors. errors in our previous study, which was published in [1], but In terms of testing parallel applications, our previous work in this paper, we broadly cover OpenACC run-time errors published in [7] found that several programming model- and explain their causes with examples. Part of our study related applications have been investigated to identify their focuses on OpenACC errors that cannot be detected by run-time errors, and many testing tools have been designed to compilers. Our experiments have been conducted using C++ detect these errors. Our study covered more than 50 testing applications with OpenACC directives, and we used the PGI tools and several run-time errors, including deadlock, race 19.4 compiler (community edition) to conduct our condition, livelock, and data mismatching. These testing experiments and compile our applications. Also, we propose tools are varied in their purposes and scopes, including a solution for detecting these errors by building a static testing tools that detect specific type of errors, the targeted testing tool for detecting run-time errors in OpenACC programming models used, and the testing techniques. applications. Finally, to the best of our knowledge, we are the There are many tools that have been created to detect a first to identify and classify OpenACC run-time errors, as specific type of run-time error, including data race detection well as building a static testing tool designed to detect these [8] and deadlock detection such as UNDEAD [9]. Also, errors in OpenACC applications. many testing tools have been designed to test a specific The rest of this paper is structured as follows. Sections 2 targeted programming model such as testing tools for MPI and 3 will discuss related work and briefly give an overview [10], [11], OpenMP [12], [13], CUDA [14] and OpenCL of OpenACC. Our classification of OpenACC run-time [15]. In terms of OpenACC testing, there is no testing tool errors will be discussed in Section 4. In Section 5, we will dedicated to testing OpenACC applications or detecting their explain our testing tool architecture design, and in Section 6 related run-time errors. However, there are some studies that our static approach in detecting OpenACC errors will be relate to compilers' evaluation published in [16], in which test explained in detail. In Section 7, we will discuss some cases have been created for evaluating OpenACC 2.0. Also, aspects of our study, and our tool will be evaluated in Section another study that evaluated CAPS, PGI, and CRAY 8. Finally, conclusions and future work will be discussed in compilers [17] and OpenACC 2.5 was evaluated in [18] for Section 9. validating and verifying the compiler implementation of OpenACC’s new features. II. RELATED WORK Despite efforts made to test parallel applications, there is Building massively parallel applications is challenging and still more work to be done with respect to high-level has several difficulties that can affect the system's efficiency programming models used in heterogeneous systems. We and accuracy. One of these difficulties is that parallel have noticed that OpenACC has several advantages and applications if badly programmed they can have non- benefits and has been used widely in the past few years, but determined behavior, which makes it hard to detect parallel its errors have not been investigated or identified as clearly as errors or test parallel applications. Also, run-time errors vary the other programming models or targeted by any testing from one programming model to another, and it is not easy to tool. Finally, to the best of our knowledge, there is no parallel determine whether errors have been corrected or are hidden testing tool to test applications programmed by OpenACC or when modifying the source code. designated to test OpenACC applications and detect their There are many studies that have investigated parallel run-time errors. applications errors and identified them and their causes. Also, many programming models-related errors have been III. OVERVIEW OF OPENACC identified, including MPI, OpenMP, CUDA and OpenCL. OpenACC is a high-level parallel programming model However, OpenACC has not been investigated or identified designed for heterogeneous systems that is used to add as thoroughly as the other programming models. parallelism to the FORTRAN, C, and C++ programming OpenMP application common errors have been identified languages. In November 2011, OpenACC was released at the and classified in [2], where OpenMP errors were divided into International Conference for High-Performance Computing, defects and failures with explanations and some examples. Networking, and Storage. OpenACC stands for open VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access FIGURE 1. OpenACC Error Classifications. accelerators, which was developed by Cray, CAPS, NVIDIA, programs to simulate these errors to discover their behavior and PGI, and the latest version was released in November and effects on OpenACC-related applications, as shown in 2018. In addition, OpenACC has been used widely in the Figure 1. Programmers can cause these errors when they try past few years by non-computer science specialists due to to parallelize their applications by using OpenACC. Also, decreased programming efforts needed to accelerate their there are some directions and instructions that OpenACC original code, as well as the simplicity of learning and using documents indicated to avoid some errors, and when OpenACC. Also, OpenACC has been increasingly used in programmers do not follow these instructions, that can cause many top supercomputers around the world and in five out of errors. In the following, we classify OpenACC run-time 13 applications in the Summit Supercomputer, which is the errors that cannot be detected by the compilers and explain top supercomputer in the world, as published in [19]. each class and discuss their causes. OpenACC has several directives and clauses used to accelerate source code without many changes. OpenACC A. OPENACC DEVICE-BASED DATA TRANSMISSION directives have been divided into the data region, which is ERRORS responsible for the data movements between host and device, Data management is one of the features supported by and the compute region, which is used for executing the code OpenACC, which uses data clauses to conduct data on the device [20]–[22]. OpenACC’s data region is defined movement between CPU and GPU and vice versa. The by data directives that determine data lifetime on the device programmers must be aware of the usage of each data clause and is divided into structured and unstructured data regions, and how to use them in both structured and unstructured data both used for data movements but in different aspects and regions. In this classification, we identify run-time errors behaviors. OpenACC can have multiple data and compute resulting from the mishandling of OpenACC data clauses and regions within the same source code. directives, which leads in turn to non-deterministic behaviors OpenACC has several advantages and features that give it or wrong results. OpenACC data regions are divided into the ability to parallelize the code, including portability. structured and unstructured data regions, as we explained in Unlike other programming models like CUDA, which only Section 3. The following identifies and explains OpenACC work on NVIDIA, OpenACC is portable across platforms data clause-related errors that cannot be detected by the and different types of GPU [23], [24]. Also, OpenACC can compiler. work with various compilers and requires less programming 1) ERRORS LEAD TO RUN-TIME ERROR MESSAGES effort, which gives it the ability to add parallelism to existing AFTER EXECUTING THE PROGRAM code with less code, decreasing programmers' workloads and Sometimes when programmers are misusing OpenACC data improving their productivity. Finally, OpenACC supports clauses, the compiler will not detect the presence of errors, three levels of parallelism by using three OpenACC clauses, but after compilation and during runtime, an error message including Gang, Worker, and Vector, which are coarse, will be issued indicating the invalid value without giving medium, and fine-grained parallelism, respectively. extra information about the error type or the cause of this error. Several scenarios explain this type of error and its IV. OPENACC ERROR CLASSIFICIATIONS causes. This error can occur in both structured and After analyzing the recent OpenACC version 2.7 documents unstructured OpenACC data regions. [20], consulting OpenACC-related books, and conducting In the unstructured OpenACC data region, several several experiments, we classify and identify several run- scenarios can cause this type of error. One scenario arises time errors based on our experiments and build several when there is a variable in the copyout clause in the exit data VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access region without having the same variable in any data clause in directive, which affects the application in the same way as in the enter data region, which will cause an error at the runtime Figure 2b. without being detected by the compiler. As shown in Figure In the structured OpenACC data region, when 2a, the array "a" at the copyout clause will cause this error programmers use OpenACC data clauses to manage the data because you simply ask the GPU to copy the array "a" to the movement between GPU and CPU or vice versa, and the CPU, but this variable is not present in the GPU, which will array is not present in the GPU or partially present for any then result in an error message during runtime of "invalid reason, this will be not detected by the compiler, and the value". Also, another case is when programmers write the error message will be issued during runtime. One reason for data clause in the exit data region without writing the enter this situation is multiple routines controlling the data data region, as shown in Figure 2b. movements in the same application; therefore, the programmers should be aware of tracing the variables’ movements both from and to the GPU. This error can be avoided by using the OpenACC API function to test for the presence of the variable in the GPU before further execution and preventing this error from occurring. (a) Different copyin and copyout variables FIGURE 3. Syntax error causing a run-time error in the unstructured data region. 2) ERRORS LEADING TO WRONG RESULTS WITHOUT PROGRAMMERS' AWARENESS Programmers using OpenACC data clauses without paying attention to their characteristics and features could lead to wrong results and cause the application to fail to meet the (b) Forget writing the enter data region variables final user requirements. Also, programmers and compilers might not be aware that there is an error and therefore cannot detect the error without using the testing tool, which currently is unavailable, as we discussed in our survey published in [7]. Several scenarios cause this type of error, including structured and unstructured OpenACC data regions. In unstructured OpenACC data regions, if the programmers want to copy data from CPU to GPU but mistakenly use the data clause create instead of copy or copyin, this will lead to wrong results without the (c) Delete the array before the copyout programmers' or compilers’ awareness. Figure 4a shows an Figure 2. Errors leading to run-time error messages in the unstructured example of this error, in which the programmers’ use creates data region. the array "a", which will create a space in the GPU without considering the previous values of the array "a" in the CPU, The third situation that causes this error is deleting the which will in turn eventually lead to wrong results. Another variable before using the same variable in the copyout data case of this error arising is when the programmers mistakenly clause, as in Figure 2c. Finally, there is a syntax error that use the delete clause at the exit data region when the variable can also lead to this type of error and is not detected by the needs to be used in the CPU, as shown in Figure 4b, which compiler when the programmers write the enter data region causes the variable to not be copied from the GPU to the directive without the "acc", as shown in Figure 3. In this CPU, therefore leading to wrong results. Similarly, when the case, the same array "a" is in both the enter and exit data exit data region directive has not been written or the "acc" regions, but the compiler will ignore the enter data region keyword is forgotten, this also will lead to wrong results as shown in the codes in Figures 4c and 4d, respectively. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access copyin, the array will be copied to the GPU and stay there without being copied back to the CPU. Also, if the keyword "data" has not been written in the structured data region directive, as in Figure 5b, any data clause will not be activated because the compiler simply ignores the data region directive and completes the compilation, which results in incorrect values. However, when the keyword "acc" is missing, the PGI compiler 18.4 will generate implicit copy for the array "a" that only occurred in the structured data region, but in the unstructured data region the compiler will (a) Using create clause instead of copy or copyin clause do nothing, as we see in the previous examples. (a) Using copyin instead of copy (b) Delete the array without copyout (b) Missing the keyword "data" FIGURE 5. Errors leading to wrong results in the structured data region. (c) Forget the exit data region directive B. OPENACC MEMORY ERRORS In this classification, this type of error will result in wrong results in some cases, similar to the previous classification, but this error will also affect application performance and GPU memory. The main concept behind this classification is that we identify cases of some unnecessary programming operations such as keeping unused variables and matrices in the GPU. Also, they create temporary arrays in the GPU for some operations and keep them unnecessarily after finishing operations, which can affect system performance by (d) Forget "acc" keyword at the exit data region directive allocating unnecessary data to GPU memory and consuming FIGURE 4. Errors leading to wrong results in the unstructured data space and energy. In Figure 6, the arrays "a" and "b" have region. been copied to the GPU for calculating the array "c", but after this operation has been completed, these two arrays still In the structured OpenACC data region, there are different consume GPU memory without any further usage. In this causes of this type of error, including using the wrong data case, the programmers should delete these two arrays per the clause or forgetting to write the OpenACC data directives exit data region directive. Also, further errors can result from correctly, which are considered syntax errors, but the using the same GPU memory with another part of the code, compilers do not detect them. For example, in Figure 5a, the which conflicts with GPU memory or slows down the programmers use the copyin data clause instead of the copy execution of the other operation. This type of error only data clause, while the array "a" needs to be copied from the happens in the unstructured data region because the CPU to the GPU and then copied back to the CPU. In this programmers’ responsibility is to use the enter and exit data situation, the final result will be incorrect because by using VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access regions directives correctly and to determine allocation and OpenACC race condition behaves differently and has deallocation operations based on the data clause directives. In different causes as a result of the nature of OpenACC and its the structured data region, the programmers determine only execution in a heterogeneous system architecture. The race the data clause that they need, and the compiler deals with condition can arise from executing processes concurrently in internal operations. When exiting the data region, the exit multiple threads without considering the sequence of the data directive handles GPU memory deallocation by using execution, while the thread execution sequence is critical to either the delete or copyout data clauses. Finally, one of the the final result. Also, a race condition can occur when there most important issues that some programmers might not be is competition between several threads to access the same aware of is if they have as many exit data directives for a memory location. Additionally, when programmers use given array as enter data directives. OpenACC for parallelizing their applications, there is no guarantee of the thread execution order [22]. As a result, the programmers are responsible for examining their code to make sure there is no data dependency, and they should not make any assumptions about the thread order execution. Therefore, OpenACC is more likely to have a race condition resulting from different causes and situations. We classify the OpenACC race condition causes into the following six categories: FIGURE 6. Memory error causing memory consumption for unnecessary 1) HOST AND DEVICE SYNCHRONIZATION RACE variables In some situations, the synchronization between the CPU (host) and the GPU (device) is important to maintain data Another cause of this error is programmers’ use of coherence. Therefore, data updating operations between host OpenACC data clause variables designated by functions; this and device and vice versa should be done carefully, and designation should be determined by reference, not by value. programmers must determine when and how to update their In Figure 7, the variable "val" in the function "func" is not data; otherwise, this can cause a race condition. The code in the same as the variable "val" in the copyin data clause in the Figure 8 is an example of a race condition that occurs main program. When the programmers call passes "val" by because of CPU/GPU synchronization. The elements in the value, this means it will make a copy of "val" and send it to "hist" array might be updated by multiple threads the function, which will then add it to the present data clause. concurrently, which causes a data race. The OpenACC Therefore, an error message will be issued to indicate that the update directive updates the values of the "hist" array variable "val" is not present in the GPU. The solution to this between GPU and CPU in the same data region without error is to pass the variable "val" by reference, which lets the ensuring data independence between each element because compiler know that the variable "val" is on the GPU, and the of the OpenACC parallel directive, which depends on the function "func" can then use it directly. programmers to ensure data independence for each element in the array. Finally, when programmers use "kernels" to parallelize their code, they will avoid the data race, but they will lose loop parallelism because the code will execute sequentially, as the usage of "kernels" relies on the compiler to determine dependency in this case. FIGURE 8. Race condition because of synchronization between host and device. FIGURE 7. Calling by value for data clause variable causing an error. 2) LOOP PARALLELIZATION RACE C. RACE CONDITION When programmers want to parallelize their loops, they can The race condition is a common error in parallel applications, use OpenACC directives to enable the loop body to be as well as in different programming models. However, the executed in parallel using concurrent hardware execution VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access threads. The loop iterations might run in parallel at the same will also not be detected by the compiler, but also will not time without considering the iteration order. Moreover, the yield wrong results. However, the loop will be executed last iteration can be executed and completed before the first sequentially without the programmers' awareness, which will iteration, which will lead to a potential race condition if the affect the overall application performance, as shown in execution order is important to the application. Also, the data Figure 10b. dependency between iterations causes a race condition. As a result, based on the application analysis and requirements, the programmers should parallelize their loops carefully to ensure there is no dependency in their loop body because if so, that will lead to a race condition, causing the application to fail to meet the user requirements. Figure 9a shows an example of a race condition resulting from data dependency. In this example, some x[i] may be read or written by two different threads simultaneously, which causes a race condition because of the data dependency in the second loop. Also, in Figure 9b, when using the keyword "kernels" instead (a) Missing the keywords "parallel" and "kernels" of "parallel" and the loop body has data dependency, this will generate an error message on runtime, indicating illegal address during kernel execution. (b) Missing the keyword "loop" FIGURE 10. Syntax error caused loop parallelism errors. (a) Data dependency race using parallel directive 3) SHARED DATA READ AND WRITE RACE Reading, writing, and updating to or from a shared array by multiple threads concurrently can be critical and error-prone, and all need to be handled carefully by programmers. For instance, writing data by thread "A" in a specific location and the same location is read by another thread "B"; in this case, there is a potential race condition. This shared data race can be classified into read-after-write, write-after-read, and write- after-write, but in read-only cases, there is no data race. The C and C++ programmers should use the "restrict" keyword (b) Data dependency race using kernels directive [25] whenever the pointers are not aliased and the compiler FIGURE 9. Race condition caused by data dependency. needs to be able to parallelize the code; otherwise, it will not work in parallel. Finally, we have multiple loops, each of which has a temporary array used during its own calculation, In addition, some syntax errors affect loop parallelism and as in Figure 11. If the array "tmp" is not declared to be prevent OpenACC from being executed correctly without private in each loop, then the "tmp" shared array might be compiler detection or programmer awareness. The results of accessed by different threads executing different loop these errors range from wrong results to performance issues. iterations in an unpredictable way, which will cause a race For instance, when programmers write the OpenACC loop condition and lead to the wrong results. directives without using the "parallel" or "kernels" directives, as shown in Figure 10a, it simply executes the loop, ignoring 4) ASYNCHRONOUS DIRECTIVES RACE the OpenACC directive and without compiler detection, All OpenACC directives are synchronous by default, which which leads to wrong results. Similarly, if the programmers means that when there are operations or instructions sent write the OpenACC directives using "parallel" instead of from the CPU to the GPU to be processed or calculated, the using "parallel loop" when they parallelize their loop, this CPU will wait for the GPU to complete the assigned work VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access before continuing further CPU execution. By using summation, multiplication, maximum, minimum, and various synchronous operations, OpenACC ensures the operation bitwise operations. Some compilers can detect reduction of execution order when running on the GPU as run in the the summation variable and implicitly insert the reduction original program, ensuring that the program will work clause, but for other operations and other compilers, the correctly with or without an accelerator [21]. However, while programmers should always indicate the reductions in their waiting for GPU computation to be completed, the system codes. Even though the reduction clause can avoid some resources will be unused for a while, which is not an efficient data dependencies by combining the results of each copy of way to run the code. Therefore, OpenACC supports the reduction variable with the original variable at the end of asynchronous operations by assigning the work to the GPU the compute region, the absence or misuse of this clause will while the CPU can complete other operations, which allows lead to a race condition in some cases. Also, the variables the applications to be worked asynchronously and can thus involved in the reduction clause must be initialized properly enhance efficiency. based on the reduction clause operations before using them in the clause, or undefined behavior will result. Figure 13 shows an example of a race condition that occurs as a result of reduction clause absence. The variables "sum" and "multi" will cause a race condition resulting in wrong results, and they should be included in a reduction clause. Some compilers can detect that the variable "sum" needs to be in the reduction clause and will implicitly generate a reduction clause, but other compilers cannot detect it or the variable "multi". FIGURE 11. Race condition caused by shared data read and write. By using the OpenACC asynchronous and wait directives, the CPU can continue working while the GPU works at the same time, allowing the system to be executed in a pipeline manner, thus enhancing performance. As a result, the programmers can ensure the synchronization of their applications when using asynchronous and wait directives to avoid causing a race condition between host and device. The race condition can be caused by misusing these directives without considering the dependency between different parts of the code, including OpenACC compute regions or data movements. An example of a race condition caused by misuse of OpenACC asynchronous and wait directives is shown in Figure 12. The array "A" will be assigned to FIGURE 12. Race condition because of the misuse of asynchronous directive. asynchronous queue number 1, while array "B" is assigned to queue number 2, and the CPU continues working without waiting for these two queues to be completed before computing array "C". The race condition occurs as a result of dependency between different parts of the code where arrays "A" and "B" are needed before calculating array "C", which leads to wrong results. 5) REDUCTION CLAUSE RACE FIGURE 13. Reduction clause error. The OpenACC reduction clause can generate private variable copies for each loop iteration in the OpenACC compute regions, collecting and reducing all these copies into one 6) OPENACC INDEPENDENT CLAUSE RACE final result based on the specified operations, which will be As previously discussed, the data dependency can cause a returned from the compute region to the CPU. The operator race condition in OpenACC and run-time errors, as well as on the scalar variable can be specified by the reduction preventing the code from being parallelized. The compiler clause, which supports some common operations such as data dependency analysis does not always have enough VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access information to make a decision as to whether the code can be behaves differently than when programmers use the parallelized or work sequentially, as in the case of using OpenACC asynchronous and wait directives, as shown in OpenACC kernel directives. Therefore, programmers Figure 15b. In this case, the deadlock will not occur at the sometimes need to provide the compiler with this end of the compute region because it has been assigned to an information, which can be done by using the OpenACC asynchronous queue, and the CPU will proceed with the independent clause that tells the compiler that a specific loop execution until it arrives at the wait directive, where the is data-independent, meaning that there is no dependency or deadlock will occur. If there is no wait directive, the relationship between any two loop iterations, thus overriding deadlock will occur at the end of the code. Finally, the the compiler's loop dependency analysis. However, the use of interaction between the asynchronous and wait directives will the OpenACC independent clause can be a solution, but also determine how the deadlock will behave. can cause a race condition when there is data dependency and the programmers use this clause, which allows the compiler to generate code to compute these loop iterations using independent asynchronous threads. The code shown in Figure 14 is an example of using the independent clause in a loop that includes data dependency, resulting in a race condition. In this example, the programmers tell the compilers that these loop iterations are data-independent with respect to each other, but they still have a dependency that causes a race condition, resulting in wrong and inconsistent results. Finally, programmers must be cautious when using this clause, because if any array element is written by iteration, and if there is another iteration that also writes or reads, this will cause a race condition, except for variables in the reduction clause. (a) Implicit barrier deadlock FIGURE 14. Race condition caused by misuse of OpenACC independent clause. D. DEADLOCK Deadlock in OpenACC can be divided into a host (CPU) and a device (GPU) deadlock. The device deadlock can occur when two threads get stuck waiting for each other to release the lock on a shared resource. In addition, OpenACC is considered lock-free programming, which is more challenging than simply using locks in protecting critical regions, as in OpenMP [22]. In OpenACC, the host deadlock can be a result of having device livelock because of the (b) Wait directive deadlock nature of the OpenACC hidden implicit barrier at the end of FIGURE 15. CPU Deadlock because of GPU livelock each compute region, and the execution of the CPU will not proceed until all threads on the GPU have reached the end of E. LIVELOCK the parallel compute region. In other words, the CPU will be There is a similarity between livelock and deadlock, except waiting for the GPU to finish its work while the GPU is that livelock occurs when two or more threads change their continuously busy because of the livelock. We call this CPU status continuously in response to other thread changes deadlock implicit barrier deadlock, as shown in Figure 15a, without performing any useful work. In OpenACC in which the second compute region calculating the array "B" applications, there is a relationship between deadlock and is in an infinite loop, and the CPU is stuck waiting for all livelock, as discussed in the previous section and shown in threads to reach the end of the compute region. The second Figure 15. The GPU livelock is causing a deadlock in the case of host deadlock is similar to the previous case, but it CPU, while the CPU livelock will keep the process busy VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access forever, and none of the processes will make any progress syntax and semantics to be checked to ensure its correctness. and will not be completed. Also in livelock, the threads Different information will be extracted from the source code might not be blocked forever, and it is hard to distinguish and displayed in a log file for the programmer for further between livelock and other long-running processes. Finally, debugging. This information will include the total number of livelock can lead to performance and power consumption compute region and structured and unstructured data regions, problems because of its useless busy-wait cycles. as well as their starts and ends in the source code. Also, the variables in each compute region will be stored with their V. ARCHITECTURE DESIGN related information, including their related compute region We designed a static testing tool for detecting OpenACC and which part of the equation, as well as in which loop if it related applications errors as discussed in the previous is within a loop. In addition to any information related to section, and also discussed our design in [26]. Our design has loops, equations, and parallel threads that will be displayed in the flexibility to detect actual and potential run-time errors the log file with much details of the tested source code. In the and report them to the programmers with related error following we will explain our static approach in detecting information that helps the programmer to correct them. Our several type of OpenACC errors based on our previous architecture used a static testing technique for building a new classifications. testing tool for OpenACC systems, which will enhance the system’s execution time. A. OPENACC DATA CLAUSE DETECTION Our architecture is responsible for analyzing the input We assume that any OpenACC data clause is potentially source code to detect static errors before compilation. As we error-prone because developers seem to use them discussed OpenACC errors previously, we noticed that we inefficiently, and the compilers cannot detect any errors that could detect some OpenACC run-time errors from the source related to OpenACC data clause directives. Our approach code, and these errors should therefore be resolved because uses static testing that can statically detect OpenACC data they will definitely occur at run time. After compilation and clause related to run-time errors, including the copy, copyin, during run time, potential run-time errors might or might not copyout as well as the other OpenACC data clauses. The data occur based on the execution behavior. We can detect the clauses related variables will be monitored and tested to causes of these potential errors from the source code before ensure their correctness and to detect any unsafe behavior compilation by using static testing. However, these potential because these errors cannot be detected by any compiler, and errors will become run-time errors if they have not been they will lead to wrong results. Also, these monitored data detected, and the programmers should be warned and clauses variables can be used to be instrumented for further consider them. dynamic testing during run-time. In more details, when Finally, it is difficult to test parallel applications due to the OpenACC directives are founded in the targeted source code, different factors and complicated scenarios that can cause our algorithm will determine the areas that have OpenACC run-time errors, as well as the nature of parallel applications data clauses and their type as well as the region that they and their behavior. These reasons lead to more effort to build belong to whether compute, structure or unstructured data the testing tool in terms of covering every possible scenario regions. This mechanism will help our algorithm to call the of the test cases and the data. related function to test this part of the code and check for a run-time error statically. In the case of testing the VI. STATIC TESTING APPROACH FOR DETECTNG unstructured data region, there is an additional test which is OPENACC ERRORS counting the number of variables that appear in the enter data In this approach, we will check and examine different region, which appears in copyin or create clauses. Also, OpenACC directives and clauses to identify actual and counting the number of variables that appear in copyout or potential run-time errors. Since there is a wide range of errors delete clauses at the exit region. Then, comparing these two and directives to be covered, we also classify our testing numbers, and they should be equal if not an error message approach into several classes that include OpenACC data with the related information will be sent to the user as we clause checking, reduction checking, and asynchronous will explain in Listing 1. checking, as well as instrumenting data race and deadlock for The following algorithm in Listing 1 shows our main data further checking in the dynamic phase of our approach. The clause checking algorithm, which started by exploring the targeted OpenACC source code will be classified into targeted source code to find OpenACC directives that potential error data region code, free data region code; including data clauses. Then, saving some OpenACC data potential error compute region code, free compute region clause related information for further using in our algorithm. code, and serial code. In detail, potential error regions refer When a compute region, structure or unstructured data region to the regions with actual and potential errors, while the free are founded we first check their OpenACC syntax by using regions refers to regions without errors. (Chk_ACC_Syntax) to find some related errors, because not Our static tool will understand the tested source code that all syntax errors are detected by the compiler and we will includes C++ and OpenACC, analyzing the source code cover these errors that cannot be detected. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access After completing our OpenACC syntax checking, our determining the variable places in the source code to be main checking algorithm will mapping between each examined in three locations: in the source code, the data directive type with its appropriate checking algorithm, which region, and before and after this region. The data clause includes (Chk_Struct) and (Chk_Unstruct) to test structured, variable will be tested differently in these three locations to unstructured data regions and compute region that will be ensure correctness and report any errors that cannot be discussed in algorithms in Listings 2 and 3. In the case of detected by any compiler. In addition, different data clause checking unstructured data regions, our approach will count types will be tested differently in our algorithm based on the number of variables that appear in enters and exit data their purpose and behavior. We studied and examined these regions. Finally, a comparison of these two counters, if they data clauses to ensure that our approach can detect any are not equal, an error message will be issued to the potential run-time errors related to OpenACC data clause developers because in unstructured data region the variables directives, which to the best of our knowledge have not been at the enter data region (in create or/and copyin data clauses) detected before in any compiler or testing tool. The must appear at the exit data region clause (in copyout or/and developers will receive an error message if any incorrect delete data clauses). situation arises. In the case of data region testing, different situations can LISTING 1. The Main Data Clause Algorithm cause run-time errors and can be detected during the static Chk_ACC_Code Algorithm START 1: SEARCH for OpenACC directives contain data clause in source code testing phase of our algorithm and display an error message 2: STORE Data_Clause_Var, Data_Clause_Type, Directive_Type, indicating the error place and type, with some additional Directive_Line_No, ACC_File_Name IN Directive_Data_Clause_List 3: CALL Chk_ACC_Syntax(ACC_File_Name) information for the developers to consider when correcting 4: WHILE Data_Clause_Var IN Directive_Data_Clause_List these errors. These situations including, for example, when a 5: CASE Directive_Type: 6: Strcut_Data_Region: copyin data clause variable is part of an equation, this will be 7: CALL Chk_Struct(Data_Clause_Var, Data_Clause_Type) indicated as an error because the copyin data clause will not 8: Unstrcut_Data_Region: return the values to the CPU. Therefore, any results stored in 9: CALL Chk_Unstruct(Data_Clause_Var, Data_Clause_Type) 10: Compute_Region: this variable will end in the GPU, and the CPU will complete 11: CALL Chk_Struct(Data_Clause_Var, Data_Clause_Type) execution without considering the newest result of this 12: IF Data_Clause_Type = create or Data_Clause_Type = copyin 13: STORE No. of Data_Clause_Var IN Enter_Ustruct_Data_Clause variable, which leads the developer to get the wrong result 14: ELSEIF Data_Clause_Type = delete or Data_Clause_Type = copyout without the compiler knowing or detecting. 15: STORE No. of Data_Clause_Var IN Exit_Ustruct_Data_Clause 16: ENDIF The second area tested by our algorithm is before the 17: ENDCASE structure data region or the compute region. The variables’ 18: ENDWHILE 19: IF Enter_Ustruct_Data_Clause = Exit_Ustruct_Data_Clause initialization and assignment will be tested in this area, 20: PASS detecting any potential error situations varying from one data 21: ELSE 22: Display Error Message clause type to another. One error satiation can occur in this 23: ENDIF area of code caused by the copyin data clause variable. As a Chk_ACC_Code Algorithm END result of the copyin data clause behavior that takes the variable from the CPU into the GPU, we need to determine The algorithm in Listing 2 is responsible for checking data whether these variables are initialized or have value before clauses founded in the structured data region or compute going to the GPU. As a result, if the copyin variables are not region because they have the same behavior and roles. The initialized or part of an equation, an error message will be targeted OpenACC data clauses in this algorithm are copy, displayed to the developer that includes the error type and the copyin, copyout, create, and present. However, the present consequences of this error. data clause will be instrumented for further dynamic testing Finally, the last part of the code to be covered is after the because it cannot be detected during our static approach. region. In this part, we focus on the data clause variables’ Structured data region is defined as that part of the code that appearance as to whether it appears or is used after the has explicit start and end points where data lifetime both region, which will cause a run-time error in some situations, begins and ends, and the memory only exists within the data and the developer will receive an error message indicating region. Compute region is defined as the region in which the that. These error situations include the appearance of copyin computation processes are executed in GPUs, whether it is a and create data clauses variables because these data clauses parallel region, kernel region, or serial region. Our static will not move out of the GPU, so their results will not be approach targets data clause variables, checking for their considered by the CPU, and the final result will be wrong. initialization, assignments, and usages in detecting any error related to any data clause. Some of these tests for the LISTING 2. Check Structured Data Clause Algorithm variables are similar, and some of them are different Chk_Struct Algorithm START 1: RECEIVE Data_Clause_Var, Data_Clause_Type depending on the data clause related to the tested variable. 2: SEARCH for Data_Clause_Var locations in source code The algorithm in Listing 2 begins by receiving the tested 3: STORE Data_Clause_Var, Data_Clause_Type, Data_Clause_Var_Line_No, Data_Clause_Var_File_Name, Data_Clause_Var_Location IN data clause variable as well as the data clause type, and then Data_Clause_Var_List 4: WHILE Data_Clause_Var in Data_Clause_Var_List VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access 5: CASE Data_Clause_Var_Location: variable locations in the data region, before and after. An 6: In_Data_Region: 7: IF Data_Clause_Var found addition check will be conducted to test the variables’ 8: IF Data_Clause_Var is part of an equation appearance upon entering the region and an exit region when 9: IF Data_Clause_Var is define && Data_Clause_Type = copyin 10: Display Error Message any variable appears at one only; an error message will be 11: ELSEIF Data_Clause_Var is reused && Data_Clause_Type = copyout displayed to the developer. The second step is checking the 12: Display Error Message 13: ELSEIF Data_Clause_Var is reused && Data_Clause_Type = copyin data clause variable for the three locations based on their data && the define_Var is not in copy or copyin clause type to detect any potential error and report to the 14: Display Error Message 15: ELSEIF Data_Clause_Var is define && Data_Clause_Var is reused && developers. (Data_Clause_Type = copyin or Data_Clause_Type = copyout) One example of the checking process is checking the 16: Display Error Message 17: ENDIF copyin clause in the unstructured data region, which is used 18: ELSEIF Data_Clause_Var is called by function to allocate memory in the GPU and copy data from the host 19: IF Data_Clause_Var is called by value 20: Display Error Message (CPU) to the device (GPU) when entering the data region. In 21: ELSEIF Data_Clause_Type = copyout this case, the copyin variables on entering data should be 22: Display Error Message 23: ENDIF deleted or copied out when exiting the data region. However, 24: ENDIF the compiler does not detect the related run-time errors that 25: ELSE 26: Display Error Message can occur. When the algorithm in Listing 3 deals with an 27: ENDIF unstructured data region copyin clause, there are some 28: Before_Data_Region: 29: IF Data_Clause_Var found similarities and differences from the previous structured 30: IF Data_Clause_Var initialized situation. Before the unstructured data region, our algorithm 31: PASS 32: ELSEIF Data_Clause_Var used in the equation as define && will use the same mechanism to detect any related errors that (Data_Clause_Type = create or Data_Clause_Type = copyout) can be affected by the copyin variables. However, in the after 33: Display Error Message 34: ELSE data region, the variable will be tested if it appears after the 35: Display Error Message data region, and if the same variable is not part of the 36: ENDIF 37: ELSE copyout clause at the exit data region, there is an error, and 38: Display Error Message the developer will receive an error message to address this 39: ENDIF 40: After_Data_Region: problem. Finally, when the variable is in the unstructured 41: IF Data_Clause_Var found data region, there are some cases where Algorithm 3 detects 42: IF Data_Clause_Type = create or Data_Clause_Type = copyin 43: Display Error Message them as errors and sends error messages to the developers. 44: ELSE The variables in the copyin clause must appear in either the 45: PASS 46: ENDIF copyout or delete clauses, but not both because this will be 47: ELSE detected and reported as an error. 48: Display Error Message 49: ENDIF 50: ENDCASE LISTING 3. Check Unstructured Data Clause Algorithm 51: ENDWHILE Chk_Unstruct Algorithm START Chk_Struct Algorithm END 1: RECEIVE Data_Clause_Var, Data_Clause_Type 2: SEARCH for Data_Clause_Var locations in source code 3: STORE Data_Clause_Var, Data_Clause_Type, Data_Clause_Var_Line_No, Regarding the unstructured data region, the algorithm in Data_Clause_Var_File_Name, Data_Clause_Var_Location IN Data_Clause_Var_List Listing 3 detects run-time errors that occur in data clauses, 5: WHILE Data_Clause_Var in Data_Clause_Var_List including create and copyin while entering the data region, as IF (Data_Clause_Type = copyin or Data_Clause_Type = create) && well as copyout and delete at the exit data region. In addition, 6: (Data_Clause_Var not found at exit data region directive) 7: Display Error Message the data clause variable that appears in the entering data 8: ELSEIF (Data_Clause_Type = copyout or Data_Clause_Type = delete) && region must appear in any data clause at the exit data region. (Data_Clause_Var not found at enter data region directive) 9: Display Error Message Otherwise, it will cause undetermined behavior or run-time 10: ENDIF errors in the worst case. The unstructured data region is 11: CASE Data_Clause_Var_location: 12: In_Data_Region: different from the structured data region in different aspects, 13: IF Data_Clause_Var found including the fact that the unstructured data region can have 14: IF Data_Clause_Var is part of an equation 15: IF Data_Clause_Var is define multiple starting and ending points, can branch across 16: IF Data_Clause_Type = delete multiple functions, and memory exists until explicitly 17: Display Error Message deallocated. Furthermore, the enter data region directive 18: ELSEIF (Data_Clause_Type = create or Data_Clause_Type = copyin) && Data_Clause_Var not found in copyout at exit handles device memory allocation, and the developer can 19: Display Error Message only use the create or copyin data clauses. The exit data 20: ENDIF 21: ELSEIF Data_Clause_Var is reused region directive handles device deallocations, and developers 22: IF Data_Clause_Type = copyout can only use either the copyout or the delete data clauses. 23: Display Error Message 24: ELSEIF Data_Clause_Type = create or Data_Clause_Type = copyin Similar to the previous algorithm, this checking process 25: IF defined_var not in copyout at exit firstly receives data clause variable and type from the main 26: Display Error Message 27: ELSEIF Data_Clause_Var not in delete at exit algorithm and then scans the source code to allocate the 28: Display Error Message VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access 29: ENDIF because this clause needs to be tested using dynamic testing 30: ENDIF 31: ELSEIF Data_Clause_Var is define and reused techniques. 32: IF Data_Clause_Type = delete 33: Display Error Message 34: ELSEIF Data_Clause_Type = copyout && Data_Clause_Var not at B. OPENACC RACE CONDITION DETECTION enter region Different causes can result in a race condition, including 35: Display Error Message executing processes by multiple threads and where the 36: ELSEIF (Data_Clause_Type = copyin or Data_Clause_Type = create) && Data_Clause_Var is not in copyout at exit region execution sequence makes a difference in the result, the 37: Display Error Message execution timing, and order. Also, this happens when there 38: ENDIF are two memory accesses in the program, where they both 39: ENDIF 40: ELSEIF Data_Clause_Var is called by function are performed concurrently by two threads or are targeting 41: IF Data_Clause_Var is called by value the same location. In addition, by using OpenACC, 42: Display Error Message 43: ELSEIF Data_Clause_Var is called by reference developers cannot guarantee the thread’s execution order. 44: IF (Data_Clause_Type = copyin or Data_Clause_Type = create) && Because of the developers' responsibility to ensure there is no Data_Clause_Var not in delete at exit 45: Display Error Message data dependency, OpenACC is more likely to have race 46: ELSEIF Data_Clause_Type = copyout conditions. Therefore, our static approach focuses on 47: Display Error Message covering different causes of OpenACC race condition to 48: ENDIF 49: ENDIF ensure an OpenACC race-free code. 50: ENDIF Our tool analyzes the user source code to explore and get 51: ELSE 52: Display Error Message different information to be used in the detection of OpenACC 53: ENDIF run-time errors. From the user source code, our tool obtains 54: Before_Data_Region: 55: IF Data_Clause_Var found the following: 56: IF Data_Clause_Var initialized 1) Counting the number of compute regions, structured 57: PASS 58: ELSEIF Data_Clause_Var used in the equation as define and unstructured data regions. Determining the start 59: IF Data_Clause_Type = create and end for each of them. 60: Display Error Message 61: ELSEIF (Data_Clause_Type = copyout or Data_Clause_Type = 2) Detecting if there is an independent clause and in delete) && Data_Clause_Var not in copyin or create at enter which compute region they are located. 62: Display Error Message 63: ENDIF 3) Creating a variable list for all variables located in each 64: ELSE compute region and storing them in the data structure. 65: Display Error Message 66: ENDIF 4) Detecting loops in each compute region for further 67: ELSE investigation regarding the data race test, to include the 68: Display Error Message 69: ENDIF following information: 70: After_Data_Region: Count the total number of for loops in the user 71: IF Data_Clause_Var found 72: IF Data_Clause_Type = delete code in the compute region only; any loop outside 73: Display Error Message the compute region are not considered. 74: ELSEIF (Data_Clause_Type = copyin or Data_Clause_Type = create) && Data_Clause_Var not in copyout at exit Determine the starting value and ending value for 75: Display Error Message each for loop. Also, for the increment or 76: ENDIF 77: ELSEIF Data_Clause_Var not found decrement values, even if the ending value is a 78: IF Data_Clause_Type = copyout variable, our tool scans the user source code to 79: Display Error Message 80: ELSEIF (Data_Clause_Type = copyin or Data_Clause_Type = create) identify the value of the ending variable in the && Data_Clause_Var not in delete at exit loop. 81: Display Error Message 82: ELSEIF Data_Clause_Var in delete and copyout at exit Determine compute region for this loop. 83: Display Error Message 5) Detecting all equations located in the compute regions. 84: ENDIF 85: ENDIF 6) Storing each array variable in each equation with its 86: ENDCASE status (read/write) and its index. 87: ENDWHILE Chk_Unstruct Algorithm END Our approach has several techniques to address race conditions based on their causes, as discussed earlier. The Finally, in data clause detection, the present data clause following is our solution for each classification of race refers to the listed variables already present on the device, so condition. no further action need be taken. This is most frequently used 1) HOST AND DEVICE SYNCHRONIZATION RACE when a data region exists in a higher-level routine. Because DETECTION there is no data movement, we do not check the variable The programmer should be careful when dealing with behavior; we check only the syntax with the previously updating data between host and device and should know mentioned algorithms. We cannot detect run-time errors when and how to do this updating. Sometimes this can cause resulting from the present clause using our static approach, a race condition. In this case, our static approach will and s a result, we will mark this data clause for further testing investigate any update between host and device to ensure VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access data coherence and warning the developers in case of any sequentially. The following algorithm in Listing 4 shows the potential error. The data-dependency analysis can be used in process of detecting race condition in loops: some cases if it involves the updating operation. Also, our static approach will insert an instrumentation statement to LISTING 4. Data Dependency Race Detection Algorithm Chk_Data_Dep_Race_ Algorithm START check the values between host and device to ensure 1: SEARCH for Loop inside an OpenACC Compute Region correctness. These instrumentation statements will capture 2: STORE Loop_ID, Loop_line_no, Loop_Iteration_Var, Loop_Start_Value, Loop_End_Value, Loop_Compute_Region, Loop_Has_Independant the updated variables on the one side (GPU or CPU) before 3: SEARCH Equation inside the loop the updating process and compare the results with the 4: STORE Equation_ID, Equation_Compute_Region, Equation_Loop_ID, Equation_Array_Var_Name, Equation_Array_Var_Status, developer's updated variable on the other side (GPU or CPU) Equation_Array_Var_Index, Equation_Array_Var_Type, and compare the two values. If they have any differences, Equation_Array_Var_Place 5: IF (Equation _Array_Var_Name_1 = Equation_Array_Var_Name_2) && this will be reported as an error to the developers. These (Equation_Compute_Region_1 = Equation_Compute_Region_2) instrumentation statements will be executed during our 6: IF Equation_Array_Var_Type = Array 7: IF (Equation_Array_Var_Index_1 != Equation_Array_Var_Index_2) && dynamic phase. (Equation_Array_Var_Place _1 != Equation_Array_Var_Place_2 ) 2) LOOP PARALLELIZATION RACE DETETION 8: Display Error Message Report Loop Parallelization Race Condition 9: ELSEIF Equation In OpenACC Independent Clause When parallelizing for loop what happened in reality, the 10: Display Error Message Report Independent Clause Race Condition threads are using different values of iteration variable in this 11: ENDIF 12: ENDIF loop variable and may be running in parallel at the same 13: ENDIF time. OpenACC does not make any guarantee regarding the Chk_Data_Dep_Race_ Algorithm END execution order of the threads. It is even possible that the last iteration of the loop may be completed before the zero loop 3) SHARED DATA READ AND WRITE RACE DETETION iteration. This will lead to a potential race condition. For detecting read/write race conditions, our tool builds a Therefore, our static approach will instrument the source table for each equation that has more than one array variable code for data dependency analysis as well as execution order because it could have data dependency, using the related loop analysis. In our situation, we will focus on the OpenACC information to find the index values. Then, compare their parallel region because when it has been used, the developer values to find if there are two or more threads with writing is responsible for ensuring that the code has enough and reading in the same variable. We store this information parallelism, while the kernel directives depend on the in the data structure, as shown in Figure 16: compiler to detect whether the code has enough parallelism or should be executed sequentially. Our static approach will also mark a parallel clause for further dynamic investigation during run-time. Our tool will investigate each compute region, loop, and FIGURE 16. Data structure to store information for each equation equation in detail, analyzing the variables in these compute regions and their behavior and values, reporting any possible race condition to the programmer. Dependency analysis also Our tool computes the values for each equation in a given will be conducted for any equation in any given loop to compute region and compares to test if there is a thread read detect any actual or potential race condition. Also, the errors and another write in the same place; this indicates a race in loops that prevent execution will be detected in the case of condition called read/write race condition. Also, if there is writing wrong conditions or any other mistakes in the loop one thread writing to place and another thread writing to the statement. same place, this will cause a write/read race condition. Also, For detecting code parallelism issues, our tool will test in the case of two threads writing in the same variable, this the generation of threads, which includes gang and vectors. causes a write/write race condition. However, in the case of Our static phase is based on the analysis of the user source two threads reading from the same variable, this will not code and will generate gang and vectors for each compute cause a race condition. The data structure in Figure 16 will be regions to compare with the actual gang and vectors from the used to detect read and write from multiple threads’ race execution of the user code. Our static phase will insert condition in our static testing tool. This will be shown in the statements in the user code to extract the actual gang and following algorithm in Listing 5, which can detect read after vectors from the user code to compare them with our write, write after read, and write after write race conditions. generation of gang and vectors. This will help us to detect LISTING 5. Read Write Race Detection Algorithm any related errors to include user code work sequentially Chk_R_W_Race_ Algorithm START rather than work in parallel because of not using OpenACC 1: SEARCH for Equation in OpenACC Compute Region directives properly. This could help the user to enhance their 2: STORE Equation_ID, Equation_Compute_Region, Equation_Loop_ID, Equation_Array_Var_Name, Equation_Array_Var_Status, code’s performance because the user assumes that some Equation_Array_Var_Index, Equation_Array_Var_Type compute regions work in parallel, but they actually work 3: IF (Equation _ID_1 = Equation_ID_2) && (Equation _Array_Var_Name_1 = Equation_Array_Var_Name_2) 4: IF (Equation_Array_Var_Index_1 != Equation_Array_Var_Index_2) VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access 5: IF Equation_Array_Var_Status_1 = Read && private copies from different threads are reduced into one Equation_Array_Var_Status_2 = Write 6: Display Error Message Report Read after Write Race Condition final result that returns to the CPU to be available at the end 7: ELSEIF Equation_Array_Var_Status_1 = Write && of the compute region. Reduction clause operators can be Equation_Array_Var_Status_2 = Read 8: Display Error Message Report Write after Read Race Condition specified on the scalar variables, which include several 9: ELSEIF Equation_Array_Var_Status_1 = Write && operations, but we focus on the main following four Equation_Array_Var_Status_2 = Write 10: Display Error Message Report Write after Write Race Condition operations: summation, multiplication, maximum and 11: ENDIF minimum operators, and others in our future works. Some 12: ENDIF compilers will detect reduction of the reduction variable and 13: ENDIF Chk_R_W_Race_ Algorithm END implicitly insert the reduction clause, but in other cases, it is the programmer’s responsibility to indicate reductions in OpenACC codes. 4) ASYNCHRONOUS DIRECTIVES RACE DETETION Although some data dependency can be avoided by using OpenACC has support for both asynchronous and wait a reduction clause, misusing the reduction clause in some directives. When using asynchronous computation or data cases will lead to a race condition. Also, the race condition movement, developers are responsible for ensuring that the can be caused by the absence of the reduction clause. The program has enough synchronization to resolve any data reduction clause combines the results for each copy of the races between the host and the GPU. In OpenACC, there is reduction variable from different threads with the original an implicit default barrier at the end of the parallel variable at the end of the OpenACC compute region. accelerator region, and the execution of the local thread will Therefore, the variable should be initialized to some value not proceed until all threads have reached the end of the based on the used operator before using the reduction clause. parallel region. By default, all OpenACC directives are Otherwise, undefined behavior will result and lead to a synchronous; that means that the CPU thread sends the wrong final result that cannot be detected by compilers. The required data and instructions to the GPU, and after that, the PGI compiler can detect the summation reduction clause in CPU thread will wait for the GPU to complete its work some cases and add it as implicit reduction but cannot detect before continuing execution. Using asynchronous and wait the other reduction operations, and the other compilers directives allows the CPU to continue working while the cannot detect any reduction operations. GPU works at the same time, which allows the pipelining Our static approach considers those problems and the execution of the system and enhances performance. following algorithm in Listing 6 to check the potential errors However, when developers use these directives without related to a reduction clause, including the reduction considering their system requirements, that can lead to a race variables’ initialization based on their operator types. condition. Concerning reduction variables involved in multiple nested In this case we focus on testing two main aspects: first, to loops where two or more of these loops have associated loop test if there is no dependency between the asynchronous directives, the reduction clause containing these variables directives, and if they can work pipelining without errors. In must appear in each of those loop directives; otherwise, a other words, the asynchronous operation region should not race condition will occur. have variables needed by another region; otherwise, a race condition will occur because the data will arrive before or LISTING 6. Check Reduction Clause Algorithm after the time that it is needed. Second, our approach will test Chk_Reduction Algorithm START the completion of the asynchronous directives at the end of 1: SEARCH for Red_Clause and Red_Clause_Var locations in source code 2: STORE Red_Clause_Var, Red_Clause_Oprator, Red_Clause_Line_No, the region; otherwise, a deadlock situation can arise. The first Red_Clause_Var_Value, Red_Clause_Var_Location IN Red_Clause_List situation can be tested during our static phase, and some 3: WHILE Red_Clause_Var in Red_Clause_List 4: CASE Red_Clause_Var_Location: instrumentation statements will be needed for further run- 5: Before_Red_Clause: time investigation. The second case will be instrumented by 6: IF Red_Clause_Var initialized our static approach and will need further dynamic testing to 7: IF Red_Clause_Oprator is "+" && Red_Clause_Var_Value != 0 8: Display Error Message detect any errors that may occur. Finally, the asynchronous 9: ELSEIF Red_Clause_Oprator is "*" && Red_Clause_Var_Value != 1 operation can cause deadlock and race condition and needs to 10: Display Error Message 11: ELSEIF Red_Clause_Oprator is "max" && be marked by our static analyzer for further testing with a Red_Clause_Var_Value ! = 0 dynamic approach. 12: Display Error Message 13: ELSEIF Red_Clause_Oprator is "min" && 5) REDUCTION CLAUSE RACE DETECTION Red_Clause_Var_Value < 100 14: Display Error Message One potential cause of a race condition in OpenACC is the 15: ENDIF misuse of the reduction clause. Therefore, our static approach 16: ELSE 17: Display Error Message conducts extra testing for this situation to ensure that there is 18: ENDIF no misuse of this clause, as the compiler will not detect errors 19: After_Red_Clause: 20: IF Red_Clause part of parallel or kernel or loop directives if there are any. In OpenACC reduction clause and for each 21: IF Red_Clause_Var is part of an equation loop iteration, variable copies are generated, and all of these 22: IF Red_Clause_Var is not a define 23: Display Error Message VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access 24: ELSEIF Red_Clause_Var is a define && Red_Clause_Var result, the compiler cannot detect that and resulted in race span on nested loop 25: IF Red_Clause_Var in Red_Clause not appear in multiple condition. The algorithm in Listing 4 shows our static loop directives analysis of detecting data dependency and independent 26: Display Error Message 27: ENDIF clause analysis. 28: ENDIF 29: ELSE 30: Display Error Message C. OPENACC DEADLOCK DETECTION 31: ENDIF Due to OpenACC’s hiding some details, OpenACC compute 32: ELSE 33: Display Error Message regions have an implicit barrier at the end of each compute 34: ENDIF region, and we cannot be sure about the thread executions as 35: ENDCASE 36: well as the thread arrival at the end of each compute region at ENDWHILE 37: WHILE Var in Source code a certain time. Therefore, we can consider the end of each 38: IF Var is part of equation in OpenACC compute region compute region as a potential deadlock point that might or 39: IF Var IN " Var += " or " Var = Var + " operations 40: Display Error Message might not occur. As a result, our static tool will mark the end 41: ELSEIF Var IN " Var *= " or " Var = Var * " operations of each compute region for further checking during runtime. 42: Display Error Message Also, one of the reasons for OpenACC deadlock in the 43: ELSEIF Var IN MAX function 44: Display Error Message CPU is having livelock in the GPU. This happens because of 45: ELSEIF Var IN MIN function the nature of the implicit barrier at the end of each compute 46: Display Error Message 47: ENDIF region. That means the GPU will be busy with the livelock 48: ENDIF while the CPU is waiting for the GPU to finish its operation. 49: ENDWHILE Chk_Reduction Algorithm END In the usage of the asynchronous directive, the GPU livelock also causes CPU deadlock, but by different behavior than in Finally, our algorithm will check if any of the variables the usage of the wait directive, which will cause the CPU to included in the OpenACC compute region have one of the have a deadlock in that statement. In this case, the deadlock reduction operations without using reduction directives; this behaviors are based on the asynchronous and wait directives will be reported as an error to the developers, as an absence interactions. To detect these situations, our static analysis of reduction clause can lead to the race condition. In addition will investigate the source code to ensure that there are no to inserting some instrumentation statements for the parts that infinite loops in each compute region. Any situation that cannot be detected during our static phase for further cannot be detected during our static phase will be marked for dynamic checking during run-time, and this instrumentation further checking. will be used in case of the absence of the reduction clause Our static analyzer will partially detect OpenACC when our static approach determines that the compute region deadlock and mark some places of the code for further has reduction operation and the developers forget to use the dynamic testing. The main purpose of our deadlock detection reduction clause. Our static approach will instrument this is to ensure there is no livelock in each compute region region by using our reduction-inserted statement. because it will cause a deadlock in the host. As a result, we 6) INDEPENDENT CLAUSE RACE DETECTION analyze each loop in the compute region and examine their When the developers use the independent clause that tells the conditions to avoid any possibility of "always true" compiler that this loop is data-independent, this can cause situations. However, our static analysis is limited to loops problems if there is a dependency. The developers' and detecting their conditions to avoid deadlock. More responsibility is to ensure not using this clause if there is a testing will be needed during run-time for any undetected data dependency because the independent clause allows situations, which will be marked by our static analyzer, and a developers to indicate that the iterations of the loop are data- warning will then be sent to the programmer. independent of each other. As a result, our static approach VII. DISCUSSION will investigate whether the developer decision is correct by In Section 4, we identified and classified run-time errors that conducting data-dependency analysis in this case. can occur in OpenACC applications and explained them with Our static tool will detect any OpenACC independent examples. These errors can happen because of programmers’ clause and determine their place in the source code and in lack of understanding of OpenACC and misuse of some which compute region they are located. Then, the source OpenACC directives and clauses. Also, some of these errors code will be analyzed to detect any dependency in each occur when the programmers try to parallelize their code equation in the independent compute region to ensure there is without fully understanding it and the regions that can be no dependency; if there is a dependency, our tool will detect parallelized, considering different aspects as well as the data the race condition that could result from this dependency and movements. In addition, some errors are related to the nature report them to the user. This examination will be done of OpenACC and how it behaves, as well as the high-level because of the nature of the independent clause of the programming used in OpenACC where the programmers OpenACC, because it tells the compiler that the programmer only use the directives without considering the operations is responsible for making sure this area is independent; as a VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access behind these directives, which is one of the OpenACC testing due to the nature of the OpenACC error and the features of which programmers should be aware. causes behind them. We noticed that some syntax or logical errors caused some TABLE I OUR STATIC TOOL EVALUATION of the OpenACC run-time errors, but compilers do not detect these syntax errors and just ignore the directive if it is not Detected by Our OpenACC Errors Static Approach* written correctly. Some of the run-time errors might have Data Clause Related Errors similar names, but behave differently in OpenACC and have Create Clause √ different causes. In our study, we included only the run-time Copy Clause √ errors in OpenACC applications that cannot be detected by Copyin Clause √ Copyout Clause √ the compilers. Finally, the programmers' responsibility is to Delete Clause √ understand their code and the OpenACC different data Present Clause X regions as well as compute regions to maximize their code Memory Errors √ Data Transmission Errors √ being parallelized and benefiting from OpenACC’s Race Conditions capabilities and features. Host/Device Synchronization Race X In addition, we briefly explained our testing tool Loop Parallelization Race P Data Read and Write Race √ architecture design in Section 5, and we discussed our static Asynchronous Directives Race X approach to detect OpenACC errors in Section 6. We used Reduction Clause Race √ several techniques and algorithms to detect different types of Independent Clause Race √ OpenACC errors based on their causes and how they behave. Deadlock Device Deadlock X Since we are dealing with big-sized codes as well as error- Host Deadlock P prone applications, we choose to use the static approach to Livelock P resolve as many errors as possible before compilation and * The symbols are √: Fully detected; P: Partially detected and need further testing during without executing the code. This will give us the ability to run-time; X: Cannot be detected by our static tool due to the nature of error and its cause analyze the code in detail and obtain full coverage of the and need to be tested by dynamic analysis. source code. The static analysis gives us the ability to detect actual run-time errors as well as potential errors from the As we noticed in Table 1, the majority of OpenACC data source code, which will be beneficial in enhancing the testing clause-related errors can be detected and resolved by using execution time by minimizing the errors that will need our static approach, while race condition and deadlock can be further testing during run-time. Finally, our static approach detected partially and need further testing during run-time. will mark the code that has potential errors or needs further However, our tool will minimize the number of errors that testing during run-time, as well as determining the code parts need to be detected during run-time, as well as the parts that that need inserted statements for further testing. need further testing as marked by our static tool. Finally, our static testing tool has successfully detected OpenACC errors VIII. EVALUATION including data clause-related errors, data transmission errors, In this section, we present the evaluation of our static testing and memory errors, as well as some race condition and tool and discuss our tool’s capability to detect several types deadlock cases. of OpenACC errors based on our classifications, as we The following Table 2 shows selective comparative study discussed before, as well as measuring the performances of for our testing approach that from all 50 benchmarks that we our testing tool. The experiments have been performed on an experimented, we choose ten benchmarks, two from each Intel(R) Core(TM) i7-7700HQ CPU (2.80GHz), 16 GB main benchmarks suites we discussed before, as a sample to be memory, with an NVIDIA GeForce GTX 1050 Mobile GPU. displayed on this table. This table shows some statistics about Also, we used 50 benchmarks from five different the chosen benchmarks, including some OpenACC related benchmarks suites including NAS [27], SHOC [28], information we used in our static testing approach. Finally, PolyBench-ACC[29], TORMENT OpenACC2016 [30], and Figure 17 shows the testing time for each of the tested EPCC [31] to conduct the experiments and measure our static benchmarks, where also will be used in our future overhead approach error coverage and testing performance. measurements when we apply our insertion statements for The following Table 1 shows the OpenACC errors and our our future dynamic testing. static tool’s ability to detect these errors. Our evaluation takes each type of OpenACC error and our tool ability to IX. CONCLUSION detect them fully or partially, where full detection means that OpenACC has many advantages and features that lead to our static tool can detect this error, while partial detection increased use to achieve high parallel systems working in a means that our tool can detect some cases, while other cases heterogeneous architecture. OpenACC is designed for need to be tested during run-time or investigated more than performance and portability that maintains existing static testing. In some cases, our tool cannot detect these sequential code and parallelizes it by using high-level errors by using static testing, and the errors need dynamic directives without considering many details. This will attract VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access more non-computer science specialists to use OpenACC to Also, another main contribution is to design and build a accelerate their systems when building powerful simulations new static testing tool capable of detecting OpenACC errors with minimum investment of effort or time in learning how by using static testing techniques. We built a testing tool that to program GPUs. Therefore, misunderstanding OpenACC can detect as many OpenACC errors as possible by using directives and clauses can lead programmers to misuse them static testing techniques. Our tool has successfully detected or to fail to follow the OpenACC instructions correctly. In OpenACC errors, including data transmission errors and this case, several run-time errors can occur due to memory errors, as well as some race condition and deadlock OpenACC’s nature as well as behavior. Programmers could cases. Also, our tool has detected OpenACC potential errors cause these errors when trying to parallelize applications and marked them for further testing during run-time, which without following directions to avoid some of these errors. will be part of our future work. Finally, to the best of our knowledge, there is no published TABLE 2 work that identifies or classifies OpenACC-related errors, as OPENACC STATICS FROM THE CHOSEN BENCHMARKS well as a testing tool that designated for detecting errors in No. of No. of No. of No. of Benchmarks * OpenACC Parallel Kernels OpenACC OpenACC applications. In our future work, we will build a Data Clause Construct Construct Directives dynamic testing tool for detecting OpenACC run-time errors StringMatch 1 2 1 0 2 that we cannot detect or partially detect with our static testing 3dstencil 1 2 1 0 4 tool. 2 27stencil 3 2 0 8 himeno 2 3 2 0 4 ACKNOWLEDGMENT atax 3 5 2 0 7 This project was funded by the Deanship of Scientific 3 Research (DSR), King Abdulaziz University, Jeddah, under bicg 4 2 0 7 4 grant No. ( DG-015-611-1440 ). The authors, therefore, BT 1 0 0 2 gratefully acknowledge the DSR technical and financial SP 4 2 0 0 7 support. 5 Reduction 2 0 2 4 Stencil 5 8 4 0 6 REFERENCES [1] A. M. Alghamdi and F. E. Eassa, “Parallel Hybrid Testing Tool for Applications Developed by Using MPI + OpenACC Dual- * The symbols are: 1: TORMENT, 2: EPCC, 3: PolyBench-ACC, 4: NAS, 5: SHOC Programming Model,” Adv. Sci. Technol. Eng. Syst. J., vol. 4, no. 2, pp. 203–210, 2019. [2] J. F. Münchhalfen, T. Hilbrich, J. Protze, C. Terboven, and M. S. Müller, “Classification of common errors in OpenMP applications,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8766, pp. 58–72, 2014. [3] M. Süß and C. Leopold, “Common Mistakes in OpenMP and How to Avoid Them,” in OpenMP Shared Memory Parallel Programming, 2008, pp. 312–323. [4] V. Forejt, S. Joshi, D. Kroening, G. Narayanaswamy, and S. Sharma, “Precise Predictive Analysis for Discovering Communication Deadlocks in MPI Programs,” ACM Trans. Program. Lang. Syst., vol. 39, no. 4, pp. 1–27, Aug. 2017. [5] T. Hilbrich, M. Weber, J. Protze, B. R. de Supinski, and W. E. Nagel, “Runtime Correctness Analysis of MPI-3 Nonblocking Collectives,” in Proceedings of the 23rd European MPI Users’ Group Meeting on - EuroMPI 2016, 2016, pp. 188–197. FIGURE 17. Testing time for our OpenACC static testing approach [6] S. Cook, Common Problems Causes and Solutions. 2012. [7] A. M. Alghamdi and F. E. Eassa, “Software Testing Techniques for Parallel Systems: A Survey,” IJCSNS Int. J. Comput. Sci. Netw. We conducted many experiments and built several Secur., vol. 19, no. 4, pp. 176–186, 2019. applications to test and simulate run-time errors that can [8] R. Surendran, “Debugging, Repair, and Synthesis of Task-Parallel Programs,” RICE UNIVERSITY, 2017. occur in OpenACC, and we studied their behavior to better [9] J. Zhou, S. Silvestro, H. Liu, Y. Cai, and T. Liu, “UNDEAD : understand them and discover their causes and effects on the Detecting and Preventing Deadlocks in Production Software,” in applications. Our contribution is to identify OpenACC errors Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, 2017, pp. 729–740. that cannot be detected by compilers and that can happen [10] The Open MPI Organization, “Open MPI: Open Source High without the programmers' awareness. Also, we classify these Performance Computing,” 2018. [Online]. Available: errors into five main types based on error similarity and their https://www.open-mpi.org/. [11] E. Saillard, P. Carribault, and D. Barthou, “PARCOACH: causes, as well as how they behave and their effects on the Combining static and dynamic validation of MPI collective system. We explain these errors with examples and communications,” Int. J. High Perform. Comput. Appl., vol. 28, no. determine their causes with respect to application output 4, pp. 425–434, 2014. when they occur. [12] H. Ma, S. R. Diersen, L. Wang, C. Liao, D. Quinlan, and Z. Yang, “Symbolic Analysis of Concurrency Errors in OpenMP Programs,” VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2935498, IEEE Access in 2013 42nd International Conference on Parallel Processing, AHMED MOHAMMED ALGHAMDI was born 2013, pp. 510–516. in Jeddah, Saudi Arabia, in 1983. He received [13] P. Chatarasi, J. Shirako, M. Kong, and V. Sarkar, “An Extended the B.Sc. degree in computer science from King Polyhedral Model for SPMD Programs and Its Use in Static Data Abdulaziz University, Jeddah, Saudi Arabia, in Race Detection,” in 23rd International Workshop on Languages 2005 and the first M.Sc. degree in business and Compilers for Parallel Computing, vol. 9519, 2017, pp. 106– administration from King Abdulaziz University, 120. Jeddah, Saudi Arabia, in 2010. He received the [14] R. Sharma, M. Bauer, and A. Aiken, “Verification of producer- second master degree in internet computing and consumer synchronization in GPU programs,” ACM SIGPLAN Not., network security from Loughborough vol. 50, no. 6, pp. 88–98, 2015. University, UK, in 2013. He is currently [15] P. Collingbourne, C. Cadar, and P. H. J. Kelly, “Symbolic Testing pursuing the Ph.D. degree in computer science of OpenCL Code,” in Hardware and Software: Verification and from King Abdulaziz University, Jeddah, Saudi Arabia. His research Testing, 2012, pp. 203–218. interests include high performance computing, parallel computing, [16] J. Yang, “A VALIDATION SUITE FOR HIGH-LEVEL programming models, software engineering, and software testing. DIRECTIVE-BASED PROGRAMMING MODEL FOR ACCELERATORS A VALIDATION SUITE FOR HIGH-LEVEL DIRECTIVE-BASED PROGRAMMING MODEL FOR,” FATHY ELBOURAEY EASSA received the University of Houston, 2015. B.Sc. degree in electronics and electrical [17] C. Wang, R. Xu, S. Chandrasekaran, B. Chapman, and O. communication engineering from Cairo Hernandez, “A validation testsuite for OpenACC 1.0,” in University, Egypt, in 1978, and the M.Sc. and Proceedings of the International Parallel and Distributed Ph.D. degrees in computers and systems Processing Symposium, IPDPS, 2014, pp. 1407–1416. engineering from Al-Azhar University, Cairo, [18] K. Friedline, S. Chandrasekaran, M. G. Lopez, and O. Hernandez, Egypt, in 1984 and 1989, respectively, with joint “OpenACC 2.5 Validation Testsuite Targeting Multiple supervision with the University of Colorado, Architectures,” 2017, pp. 557–575. USA, in 1989. He is currently a Full Professor [19] M. McCorkle, “ORNL Launches Summit Supercomputer,” The U.S. with the Computer Science Department, Faculty Department of Energy’s Oak Ridge National Laboratory, 2018. of Computing and Information technology, King Abdulaziz University, [Online]. Available: https://www.ornl.gov/news/ornl-launches- Saudi Arabia. His research interests include agent based software summit-supercomputer. engineering, cloud computing, software engineering, big data, distributed [20] OpenACC Standards, “The OpenACC Application Programming systems, and Exascale system testing. Interface version 2.7,” 2018. [21] S. Chandrasekaran and G. Juckeland, OpenACC for Programmers: Concepts and Strategies, First edit. Addison-Wesley Professional, 2017. [22] R. Farber, “From serial to parallel programming using OpenACC,” in Parallel Programming with OpenACC, Elsevier, 2017, pp. 1–28. [23] M. Daga, Z. S. Tschirhart, and C. Freitag, “Exploring Parallel Programming Models for Heterogeneous Computing Systems,” in 2015 IEEE International Symposium on Workload Characterization, 2015, pp. 98–107. [24] J. A. Herdman, W. P. Gaudin, O. Perks, D. A. Beckingsale, A. C. Mallinson, and S. A. Jarvis, “Achieving portability and performance through OpenACC,” Proc. WACCPD 2014 1st Work. Accel. Program. Using Dir. - Held Conjunction with SC 2014 Int. Conf. High Perform. Comput. Networking, Storage Anal., no. July 2013, pp. 19–26, 2015. [25] OpenACC Organization, “OpenACC Programming and Best Practices Guide,” 2015. [26] A. M. Alghamdi and F. Elbouraey, “A Parallel Hybrid-Testing Tool Architecture for a Dual-Programming Model,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 4, pp. 394–400, 2019. [27] R. Xu, X. Tian, S. Chandrasekaran, Y. Yan, and B. Chapman, “OpenACC Parallelization and optimization of NAS parallel benchmarks,” in GPU Technology Conference, 2014. [28] A. Danalis et al., “The Scalable Heterogeneous Computing (SHOC) benchmark suite,” in Proceedings of the 3rd Workshop on General- Purpose Computation on Graphics Processing Units - GPGPU ’10, 2010, p. 63. [29] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a High-Level Language Targeted to GPU Codes,” in In Proceedings of Innovative Parallel Computing (InPar), 2012. [30] D. Barba, A. Gonzalez-Escribano, and D. R. Llanos, “TORMENT OpenACC2016: A Benchmarking Tool for OpenACC Compilers,” in 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2017, pp. 246– 250. [31] EPCC, “EPCC OpenACC benchmark suite,” The University of Edinburgh, 2013. [Online]. Available: https://www.epcc.ed.ac.uk/research/computing/performance- characterisation-and-benchmarking/epcc-openacc-benchmark-suite. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.