Skip to content

[BUG] [CUDA] freeze with reduce by key #2955

@ebarjou

Description

@ebarjou

Under circumstances that I do not fully understand, the CUDA backend can freeze when executing a reduce by key operation. I isolated the issue with the values contained in the key array, and it seems linked to the number of consecutive equal keys.
Under OpenCL, the function does not exhibit this behavior.

Description

I use the release AF binaries v3.7.1, with the Cuda backend, tested on two computers with Windows and Linux.
The issue happened every time at exactly the same spot in my program.
At first I saved the key and value arrays to files in order to test in a separate program, and it does act the same.
I then tried to replicate the issue by constructing array, and got the same issue when having two long sequences of equal keys.
It give no error log, no exception, the only information I have is that it freeze in the sumByKey operation (it behave the same with all ___ByKey operations).

Reproducible Code

Here's the test program that i came up with. Under OpenCL, it run flawlessly, but under Cuda it freeze each time at i=73.

#include <arrayfire.h>
#include <iostream>

int main(int argc, char *argv[]) {
    int N = 1280*1280;
    int count = 200;
    try{
        af::sync();
        af::array val = af::randu(N);
        af::array key = af::range(af::dim4(N), 0, af::dtype::s32);
        af::array res1, res2;
        for(int i = 0; i < count; ++i) {
            std::cout << i << " consecutive key : ";
            key(i) = 0;
            key(count+i) = 1;
            af::sumByKey(res1, res2, key, val);
            res2.eval();
            res1.eval();
            af::sync();
            std::cout << "Ok !" << std::endl;
        }

        std::cout << "Finished" << std::endl;
    } catch  (af::exception& e) { 
        std::cout << e.what() << std::endl;
        return -1;
    }
    return 0;
}

System Information

ArrayFire version 3.7.1
Intel Core i7-9750H, 16Go RAM, GTX1650 4Go
Cuda info : https://pastebin.com/kLGvdUA0
Output of nvidia-smi :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 446.14       Driver Version: 446.14       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8     6W /  N/A |    132MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions