I've written a custom f_read function for FatFS that allows me to read directly from the disk (SD Card) to the USB controller, which has almost doubled my read performance (to 5.3 MB/s). This is good, but I'm still only at about 1/3 of the maximum raw read performance (13+ MB/s), so I know there's something going on that's "wasting" cycles.
FatFS has a lot of arithmetic to calculate where a particular sector is located, lots of X / 512 or X % 512 (or other X / 2^Y), both of which can be easily broken into simple arithmetic & bitwise operations.
X / 512 == X >> 9 X % 512 == X & (512 - 1)
FatFS uses divide/modulus because those numbers may not always be 512, but in my case they will be. Can I assume the compiler, using -Os, is smart enough to convert these to bitwise operations? Or, does the UC3A3 have an arithmetic unit that is smart/powerful enough to handle these without problems?