I have a big programm for my Ap7000. The development time is 2years and i have learned much. Its executable is ~650kb.
For graphic calcualtion speedup i have used assembler and i beat the compiler generated code (96clk for A8R8G8B color interpolation with 22clk with my code.) I notice that GCC is capable of parallelisation for x86. I understand that this is hard for different architectures. But i notice strongly that the compilergenerated code - even compiled with O3 - is not Pipeline optimized.
There would be much more potential.
->Branch-Prediction: GCC emits mcalls! It is not predicatble and takes 6clk. Why no icall and loading the needed register while other instructions would stall due to depency to a ld.w result (latency of 3clks - or similar instructions wit equal effects).
->Ap7000 has parallel pipelines! Use it!
Example: i have written an higly optimized Fixpoint-template-class. Compilergenerated code for a 4D-Dotproduct "fp1*f + fp2*f + fp3*f + fp4*f" (f4 is fixpoint - line f)
//r12 stores f seems optimized ;-) 10000bda: 6c c9 ld.w r9,r6[0x30] 10000bdc: 6c da ld.w r10,r6[0x34] 10000bde: 6c fe ld.w lr,r6[0x3c] 10000be0: 6c eb ld.w r11,r6[0x38] 10000be2: f8 09 04 48 muls.d r8,r12,r9 10000be6: f8 0a 05 48 macs.d r8,r12,r10 10000bea: f8 0b 05 48 macs.d r8,r12,r11 10000bee: f8 0e 05 48 macs.d r8,r12,lr 10000bf2: f0 0a 16 08 lsr r10,r8,0x8 10000bf6: f5 e9 11 8a or r10,r10,r9<<0x18 10000bfa: 81 3a st.w r0[0xc],r10
heeey! macs.d has a result depency of 4-5clk!! Why loading all the registers before? Let the Mul-Pipeline be bussy when the ld.w executes in the Load/Store pipeline. :-( The compiler also did not know the mac-instructions. Emitting them automatically should be possible specially when code is quite clear like
00:02.71 Info: Profiler: VectorTransform 10k [4 * 4x4]: 2320436clk -> 19.336ms -> 51.71Hz -> cpu @ 120000kHz
Think of the performance if the Ap7k would be programmed pipeline optimized!
The test code is a 4DVector multiplication with a 4x4 Matrix.
What is the trick? Do i see potential where is none?
Are there some cool compiler / linker / assembler swithces that i should use?
I am missing __builtin_macs_d and similar builtin-functions. They do exist for saturation instructions... i can not understand this. I know, using inline assembly is hard for the optimizer. So normally the builtin-instructions are a perfect replacement.
DSP-Instructions is a cool feature.. specially without complete support. :evil:
As far as i can see in my linker skript is the line
Has this anything to do with this topic? UC sounds not like AP ;-)
Attachment shows the original code for my Vector-Transform-Test including the complete relevant disassembly. Fixpoint caluclations are done by operator overloading. 750 lines of template-code is not included. If its relevant, pleas request the relevant part.