The problem is that the vc4 drivers set framebuffers to write-combine cache mode. Sequential write performance is fine but read performance is abysmal. Am guessing the other SBCs you reference don't do this. Software codec decoding on the Pi typically requires decoding to CPU allocated memory (normal caching) and then memcpy to the VC4 allocated frame buffer (write combine). Was hoping this changed for Pi5 given the dependence on software decoding, but it remains the same as the Pi4.buffer memory access isn't "throttled" by anything extra, as it is in the same SDRAM chip. The concerning data point here is that dumb drm buffers are definitely way too slow on the pi4 and some other SBCs, but on some SBCs it's several times faster. Not sure what's up with that, probably some sync/caching page settings?
To work around this, I wrote a kernel module to change VC4 allocated memory cache mode back to normal. That enables true zero-copy software decoding on the Pi4/Pi5. My original Pi4 approach for doing this was fairly painful and actually updated the page-tables. That failed for Pi5 because framebuffers are no longer CMA allocated. My updated Pi4+Pi5 approach changes the cache-mode in vma_area_struct->vm_page_prot and then calls mprotect() which conveniently applies the change. Am unaware of any non-kernel module method to alter the cache mode unfortunately.
Statistics: Posted by Vraz — Tue Oct 01, 2024 8:12 pm