In C++ ( gcc 4.3.3 ) I am working on big, ugly, somewhat ridiculous source library; that is going well. At its very core, however, is one template class that wraps access to virtual registers ( and as part of its job, physical registers ). The class is, again, working well and doing what it is expected to do; no problem there either. The problem, such as it is, appears in the optimization of the accesses. I have, more or less, satisfied my fears that the missed optimizations may be a result of C++. It appears, indeed, to be internal to gcc and the optimizer.
To wit, I have pulled the basic functionality out of the class to create this test code ( with # for percent ):
#include#include void main(void) __attribute__((OS_main)); static inline uint8_t set_r18( const uint8_t& _value, const bool _volatile = false ) { register uint8_t return_value asm("r18"); if( _volatile ) asm volatile("mov #0, #1": "=r" (return_value) : "r" (_value) ); else asm("mov #0, #1" : "=r" (return_value) : "r" (_value) ); return return_value; } static inline uint8_t get_r18( const bool _volatile = false ) { register uint8_t reg_value asm("r18"); uint8_t return_value; if( _volatile ) asm volatile("mov #0, #1": "=r" (return_value) : "r" (reg_value) ); else asm("mov #0, #1" : "=r" (return_value) : "r" (reg_value) ); return return_value; } static inline const uint8_t& set_r18_mem( const uint8_t& _value, const bool _volatile = false ) { if ( _volatile ){ *(( volatile uint8_t * const ) 0x12) = _value; } else{ *(( uint8_t * const ) 0x12) = _value; } return _value; } static inline const uint8_t& get_r18_mem( const bool _volatile = false ) { if( _volatile ) return ( const uint8_t & ) (*(( uint8_t const * const ) 0x12)); else return ( const uint8_t & ) (*(( uint8_t const * const ) 0x12)); } void main(){ DDRB = 0xFF; set_r18( 0x03, false ); PORTB = get_r18( false ); set_r18_mem( 0x03, false ); PORTB = get_r18_mem( false ); while( true ); }
The command line options:
CPP_FLAGS="-O2 -ffunction-sections -fno-exceptions -std=c++0x -fno-inline-small-functions -funsigned-char -funsigned-bitfields -fshort-enums -fno-split-wide-types -fno-tree-scev-cprop -ffreestanding" LINK_FLAGS="-Wl,--gc-sections,--relax"
The code has two ways to get and set a value in r18, tested in main. Now, at some point I will have to be able to control the volatility of the access, but for now I'd be happy if I can just get them optimized as non-volatile. This is what I am getting in the lss file ( after quite a bit of cleanup ).
0000005e: // DDRB = 0xFF 5e: 8f ef ldi r24, 0xFF ; 255 60: 87 bb out 0x17, r24 ; 23 // set_r18( 0x03 ); 62: 83 e0 ldi r24, 0x03 ; 3 64: 28 2f mov r18, r24 // PORTB = get_r18(); 66: 22 2f mov r18, r18 68: 28 bb out 0x18, r18 ; 24 // set_r18_mem( 0x03 ); // * ldi r24, 0x03 - this is required here too 6a: 80 93 12 00 sts 0x0012, r24 // PORTB = get_r18_mem(); 6e: 88 bb out 0x18, r24 ; 24 // Desired set_r18( 0x03 ); ldi r18, 0x03 // Desired PORTB = get_r18(); out 0x18, r18
This is complete with what I would like to see as the optimized implementation. Ideally, anyway, this would all work with the gp register instructions. So, using inline asm I am able to use the register instructions but, as expected, the optimizer is stalled and there are some duplicate ( or at least redundant from an optimization point of view ) instructions - total cost four clock cycles. The pointer based version uses direct SRAM access with a total cost of four cycles as well. Because there is no inline asm the optimizer is able to see the second read ( equivalent to - mov r18, r18 ) and remove it. The optimal code would be a mere two cycles. However, to reach that the optimizer would need to recognize the address as being within gp register space and compatible with a mov in place of sts.
Anyone have any ideas about how to convince the optimizer to do something with this, or a different way to achieve the same. One thing I can't lose is flexibility in the wrapper. If a change negatively effects that, speed and size will just have to take second place to other things but if I can do better... well. This is the heart of a monster, a small improvement ( if you can call 50# reduction small ) here would be nice.
Martin Jay McKee