I tried to use CUDA-Aware MPI together with ArrayFire and encountererd a problem: my code produced different results when using CUDA-Aware MPI in comparison to when the data is copied to the host memory and transfered using normal MPI communication.
When i tried to investigate this i wanted to compare the data that was send over CUDA-MPI with the data that was send using normal MPI, so i tried to copy the data and transfer it in both ways to compare what was communicated.
Using .host<>() or cudaMemcpy to copy the data before using the CUDA-Aware MPI call made the problem disappear and the correct result was calculated.
From this i figured it might be caused by the asynchronous style of ArrayFire and GPU programming in general and that a computation might not be finished, when CUDA-Aware MPI grabs the data from the GPUs memory.
Adding a af::sync() before the MPI call solved the problem.
This issue is kind of similiar to #1316 but more special, so you have to think about if you want to add the af::sync() call to the device<>() method. I don't know if you can reproduce this problem with ArrayFire alone (maybe you can?) and you probably don't want device<>() to be blocking.
So if you don't fix it a future reader might at least find this and be aware of the problem.
I tried to use CUDA-Aware MPI together with ArrayFire and encountererd a problem: my code produced different results when using CUDA-Aware MPI in comparison to when the data is copied to the host memory and transfered using normal MPI communication.
When i tried to investigate this i wanted to compare the data that was send over CUDA-MPI with the data that was send using normal MPI, so i tried to copy the data and transfer it in both ways to compare what was communicated.
Using
.host<>()orcudaMemcpyto copy the data before using the CUDA-Aware MPI call made the problem disappear and the correct result was calculated.From this i figured it might be caused by the asynchronous style of ArrayFire and GPU programming in general and that a computation might not be finished, when CUDA-Aware MPI grabs the data from the GPUs memory.
Adding a
af::sync()before the MPI call solved the problem.This issue is kind of similiar to #1316 but more special, so you have to think about if you want to add the
af::sync()call to thedevice<>()method. I don't know if you can reproduce this problem with ArrayFire alone (maybe you can?) and you probably don't wantdevice<>()to be blocking.So if you don't fix it a future reader might at least find this and be aware of the problem.