Disappointed about Compiler Optimization [Pipeline Ap7k]

Go To Last Post
3 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi
I have a big programm for my Ap7000. The development time is 2years and i have learned much. Its executable is ~650kb.
For graphic calcualtion speedup i have used assembler and i beat the compiler generated code (96clk for A8R8G8B color interpolation with 22clk with my code.) I notice that GCC is capable of parallelisation for x86. I understand that this is hard for different architectures. But i notice strongly that the compilergenerated code - even compiled with O3 - is not Pipeline optimized.
There would be much more potential.
->Branch-Prediction: GCC emits mcalls! It is not predicatble and takes 6clk. Why no icall and loading the needed register while other instructions would stall due to depency to a ld.w result (latency of 3clks - or similar instructions wit equal effects).
->Ap7000 has parallel pipelines! Use it!
Example: i have written an higly optimized Fixpoint-template-class. Compilergenerated code for a 4D-Dotproduct "fp1*f + fp2*f + fp3*f + fp4*f" (f4 is fixpoint - line f)

//r12 stores f seems optimized ;-)
10000bda:	6c c9       	ld.w	r9,r6[0x30]
10000bdc:	6c da       	ld.w	r10,r6[0x34]
10000bde:	6c fe       	ld.w	lr,r6[0x3c]
10000be0:	6c eb       	ld.w	r11,r6[0x38]
10000be2:	f8 09 04 48 	muls.d	r8,r12,r9
10000be6:	f8 0a 05 48 	macs.d	r8,r12,r10
10000bea:	f8 0b 05 48 	macs.d	r8,r12,r11
10000bee:	f8 0e 05 48 	macs.d	r8,r12,lr
10000bf2:	f0 0a 16 08 	lsr	r10,r8,0x8
10000bf6:	f5 e9 11 8a 	or	r10,r10,r9<<0x18
10000bfa:	81 3a       	st.w	r0[0xc],r10

heeey! macs.d has a result depency of 4-5clk!! Why loading all the registers before? Let the Mul-Pipeline be bussy when the ld.w executes in the Load/Store pipeline. :-( The compiler also did not know the mac-instructions. Emitting them automatically should be possible specially when code is quite clear like

Longlong(a_int32)*(longlong(b_int32))+Longlong(c_int32)*(longlong(d_int32))...
Pseudocode...

Results:

Compiler standard code for Fixpoint O3 wrote:
00:02.71 Info: Profiler: VectorTransform 10k [4 * 4x4]: 2320436clk -> 19.336ms -> 51.71Hz -> cpu @ 120000kHz

Think of the performance if the Ap7k would be programmed pipeline optimized!
The test code is a 4DVector multiplication with a 4x4 Matrix.

:-(

What is the trick? Do i see potential where is none?
Are there some cool compiler / linker / assembler swithces that i should use?

I am missing __builtin_macs_d and similar builtin-functions. They do exist for saturation instructions... i can not understand this. I know, using inline assembly is hard for the optimizer. So normally the builtin-instructions are a perfect replacement.
DSP-Instructions is a cool feature.. specially without complete support. :evil:

As far as i can see in my linker skript is the line
OUTPUT_ARCH(avr32:uc)
Has this anything to do with this topic? UC sounds not like AP ;-)

Greetings :-)

Attachment shows the original code for my Vector-Transform-Test including the complete relevant disassembly. Fixpoint caluclations are done by operator overloading. 750 lines of template-code is not included. If its relevant, pleas request the relevant part.

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've also checked that avr32-gcc does an incredibly poor job when compiling dsp code...

I agree that there should be built-in instructions to take advantage of the cool dsp features that the AVR32 cpu has. Lets hope for Atmel to solve that in the near future...

Daniel Campora http://www.wipy.io

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

By the way: in my Code example above:

10000bda:   6c c9          ld.w   r9,r6[0x30]
10000bdc:   6c da          ld.w   r10,r6[0x34]
10000bde:   6c fe          ld.w   lr,r6[0x3c]
10000be0:   6c eb          ld.w   r11,r6[0x38] 

here, consecutive registers are loaded from consecutive adresses. (needs exchange of lr and r12)
why no "ldm" or at least two ld.d ?

Its interesting: loading a complete aligned struct into registers is an usual algorithm that is done relativley often. Why so unoptimized? Every instruction that can be eliminated is a waste of cache.