We discuss the benefits, difficulties, and performance of a GPU implementation of the ChambollePock algorithm for TGV (total generalised variation) denoising of medical diffusion tensor images. Whereas we have previously studied the denoising of 2D slices of 22 and 33 tensors, attaining satisfactory performance on a normal CPU, here we concentrate on full 3D volumes of data, where each 3D voxel consists of a symmetric 33 tensor. One of the major computational bottle-necks in the ChambollePock algorithm for these problems is that on each iteration at each voxel of the data set, a tensor potentially needs to be projected to the positive semi-definite cone. This in practise demands the QR algorithm, as explicit solutions are not numerically stable. For a 128128128 data set, for example, the count is 2 megavoxels, which lends itself to massively parallel GPU implementation. Further performance enhancements are obtained by parallelising basic arithmetic operations and differentiation. Since we use the relatively recent OpenACC standard for the GPU implementation, the article includes a study and critique of its applicability.