Home CUDA Mandelbrot Fractal
Post
Cancel

CUDA Mandelbrot Fractal

Project Idea:

Mandelbrot fraction using CUDA and C++.

See the project on GitHub at the GitHub repo

Problem Statement:

Using CUDA with the C++ API I want to draw a Mandelbrot fractal. And I want to be able to display that to the user in the easiest way possible.

  • Use CUDA
  • Use C++
  • Use X11 to draw since Linux is the easiest dev env.

Results:

The code can be seen at the previous link or at the C++ file and the offending portions for the single and multithreaded code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
__global__ void mandelbrot_single_thread_gpu(int n, int* mbr) {
    for(int i = 0; i < n; i++) {
        // Index from the top left, using row-major order/x-major order
        uint32_t x_index = (i % WIDTH);
        uint32_t y_index = (i / WIDTH);

        float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
        float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;

        cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
        cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);

        int m_i = 0;
        for(m_i = 0; m_i < ITER_MAX; m_i++) {
            z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
            if(cuCabsf(z_n) > 2.0) {
                break;
            }
        }
        mbr[i] = m_i;
    }
}

__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
    int tid          = blockIdx.x * blockDim.x + threadIdx.x;
    int section_size = (n) / (thread_blocks * threads);
    for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
        // Index from the top left, using row-major order/x-major order
        uint32_t x_index = (i % WIDTH);
        uint32_t y_index = (i / WIDTH);

        float calculated_position_x = (x_index / (0.5 * WIDTH)) - 1.5;
        float calculated_position_y = (y_index / (0.5 * HEIGHT)) - 1.0;

        cuFloatComplex z_0 = make_cuFloatComplex(calculated_position_x, calculated_position_y);
        cuFloatComplex z_n = make_cuFloatComplex(calculated_position_x, calculated_position_y);

        int m_i = 0;
        for(m_i = 0; m_i < ITER_MAX; m_i++) {
            z_n = cuCaddf(cuCmulf(z_n, z_n), z_0);
            if(cuCabsf(z_n) > 2.0) {
                break;
            }
        }
        mbr[i] = m_i;
    }
}
// ...
int main(void) {
    // ...
    mandelbrot_single_thread_gpu<<<1, 1>>>(N, mbr);

    uint16_t thread_blocks = 4;
    uint16_t threads       = 256;
    mandelbrot_multi_thread_gpu<<<thread_blocks, threads>>>(N, mbr, thread_blocks, threads);
    // ...
}

And the most salient section of the multithreaded code where I actually break up the work is:

1
2
3
4
5
6
7
8
9
10
__global__ void
mandelbrot_multi_thread_gpu(int n, int* mbr, uint16_t thread_blocks, uint16_t threads) {
    int tid          = blockIdx.x * blockDim.x + threadIdx.x;
    int section_size = (n) / (thread_blocks * threads);
    for(int i = tid * section_size; i < (tid + 1) * section_size; i++) {
        // Mandelbrot logic
        ...
    }
}

The previous section is where the work is divvied up between the threads based on the allotted thread index. This would look like a number of horizontal stripes along the entire render.

The rendered result:

Rendered result

Link to the raw image

On a 2048x2048px grid using a basic single threaded approach on the GPU.

1
2
3
4
5
6
7
==42795== Profiling application: ./build/mandelFrac
==42795== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  208.700s         1  208.700s  208.700s  208.700s  mandelbrot_single_thread_gpu(int, int*)
      API calls:   99.75%  208.701s         1  208.701s  208.701s  208.701s  cudaDeviceSynchronize
                    0.25%  519.95ms         1  519.95ms  519.95ms  519.95ms  cudaMallocManaged
                    ...

Now for the multithreaded version with 4 thread blocks and 256 threads.

1
2
3
4
5
6
7
==43499== Profiling application: ./build/mandelFrac
==43499== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  371.59ms         1  371.59ms  371.59ms  371.59ms  mandelbrot_multi_thread_gpu(int, int*, unsigned short, unsigned short)
      API calls:   56.89%  493.62ms         1  493.62ms  493.62ms  493.62ms  cudaMallocManaged
                   42.84%  371.70ms         1  371.70ms  371.70ms  371.70ms  cudaDeviceSynchronize
                   ...

The speedup is wild. I get a 56,160% speedup!!

Work Division:

Since I decided to divide the work up in the way that I did, we can also have a look at how the work was divided by rendering out the thread ID number in some color scheme.

This would look like this in an exaggerated way with fewer workers/threads.

exaggerated

Link to the raw image

Whereas the actual worker tested looked more like:

real

Link to the raw image

The number of workers is physically unrepresentable as this is an 8-bit image and I had $4 \times 256 = 1024$ thread IDs

This post is licensed under CC BY 4.0 by the author.