It will be easier to use the arm neon compute as it's closely coupled with the arm code, and has convenient access to memory.If NEON can do 4 multiplies and accumulations per clock, it would be 2.4GHz*8 = 19.2 GFlops. With 4 cores this would be 76.8 GFlops.
To my knownledge the VideoCoreVII has 12 QPUs, each with 4 ALUs that can do two operations. This at 800Mhz gives 0.8 * 12 * 4 * 2 = 76.8 GFlops!
Is it that the practical Flops are for the CPUs more close to the theoretical Flops then for the GPU?
It's more awkward to access memory from the GPU (dma style operations, rather than random access).
But it will depend on the algorithm which works best. Using both would obviously be best (assuming arm and gpu are not otherwise needed).
Statistics: Posted by dom — Fri Dec 20, 2024 11:14 am