GCC vs ASM code size - Help me close the gap

Go To Last Post
104 posts / 0 new

Pages

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As you are fussing with this construct and cycle-counting, have you tried 8 bits times 8 bits to get a 16-bit result -- and then just force taking the high byte with a cast or union?  Perhaps you can trick the compiler this way.  (Actually, I call it "giving the compiler a hint" and imply your ASM mindset.)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
And a side question to the experts are there a way to inline an ASM function?  

You can accomplish that either by using a #define or by defining a static inline function.

 

#define zbDsblInt()                               \
({                                                \
    zb_CPU_stat_t zb_flag;                        \
    __asm__ __volatile__                          \
    (                                             \
        "in %0, 0x3f"       "\n\t"                \
        "cli"               "\n\t"                \
        : "=r" (zb_flag)                          \
        :                                         \
        : "memory"                                \
    );                                            \
    zb_flag;                                      \
})

 

or

 

static inline uint8_t zbDsblInt(void) __attribute__((always_inline));
static inline uint8_t zbDsblInt(void)
{
    uint8_t zb_flag;
    __asm__ __volatile__
    (
        "in %0, 0x3f"       "\n\t"
        "cli"               "\n\t"
        : "=r" (zb_flag)
        :
        : "memory"
    );
    zb_flag;
}

 

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just for fun, I poked mul() into a CodeVision test program as an inline function.

 

The results came fairly close -- except for the call to the utility routine to do the signed/unsigned mixed multiply adjustment.

 

[I had to use CV's "Promote char to int?" option. ;)  ]

 

Now, if you go through the possibilities and don't force it to do the adjustment, then it might come out "clean"?

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I copied your latest code to a standalone C source file (with a few missing definitions added to make it compile). That code is at http://pastebin.com/DKin9rVS.

 

Try as I might, I couldn't get avr-gcc to produce the longer multiplication code sequence with anything but -O0. -Os/-O1/-O2/-O3 all produce only two mulsu instructions, one for each multiplication.

 

Can you copy and paste the exact avr-gcc command that is used to compile your code? I suspect that this source file is not getting the right optimization flag for some reason.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Theusch, it only does the crazy long thing when one of the arguments is unsigned.

Christop, thanks - I will download your example from pastebin and investigate my makefile to see if I am breaking it there somehow.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Christop, It's defintly something in the makefile or environment for that project.

 

In a totally clean environment I get this

xAdd = (Z * xAdd)>>8;									// Work out how far the object is from the start of the lane
  mulsu   r21, r19
  movw    r24, r0
  eor     r1, r1
    
yAdd = (Z * yAdd)>>8;
  mulsu   r20, r19
  movw    r30, r0
  eor     r1, r1

Which is only one clock behind the inline ASM for the same thing - close enough for this part of the code that it will stay as C.

 

I have looked further and no where in my code that is doing similar fixed math is FMUL being used.  However in the kernel code FMUL is being used.  They are both being compiled with the same flags

 

CFLAGS = -mmcu=$(MCU) -nostartfiles
CFLAGS += -Wall -gdwarf-2 -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder
CFLAGS += -MD -MP -MT $(*F).o -MF dep/$(@F).d 
CFLAGS += -g3 -gdwarf-2

In fact in a clean environment even -O0 gives the almost optimal code.

 

The only difference I can see with my code and the kernel is that mine is chock full of things that look like this.

 

 __attribute__ ((section (".renderlinesa")))

 __attribute__ ((section (".freeflash")))

 __attribute__ ((section (".renderlinesb")))

 

They are both using

avr-gcc (GCC) 4.8.2 20131010

that came with MHVTools.

 

I'll get back here once I solve the mystery.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Straight into another question.

 

I have a table in flash that is aligned to a 256 byte boundary for speed of access in my ASM

 

LDFLAGS += -Wl,--section-start=.trigtableflash=0x00002100
const uint8_t SinCosTable[] __attribute__ ((section (".trigtableflash"))) =  {

and the ASM for sin() and cos() looks like this

 

; int8_t CosFastC(uint8_t angle)
;
; Returns the cosine of the angle
;
; Inputs
;		Angle (0..255) in r24
; Returns
;		cos(angle) as signed 8 bit value in r24
; Trashes
;		R24
;		R26:27

CosFastC:

     subi    r24, (-(64))			; COS is 90 degrees out of phase with SIN

SinFastC:

     mov     r30, r24				; Get the offset in the SIN table
     ldi     r31, hi8(SinCosTable)
     lpm     r24, Z				; Read value from table into r24
     ret

I don't think this one is likely - but is there are a way to make the C code not do

 

R = pgm_read_byte(&SinCosTable[objData->r]);
     adiw    r26, 0x05	; 5
     ld      r30, X
     sbiw    r26, 0x05	; 5
     ldi     r31, 0x00	; 0
     subi    r30, 0x00	; 0
     sbci    r31, 0xDF	; 223
     lpm     r30, Z

It's the LDI/SUBI/SBCI that the alignment was intended to get rid of.

 

The compiler KNOWs the table is 8 bit boundary, can you convince it that it does not need to subtract ZERO from the address?

 

Last Edited: Fri. Apr 17, 2015 - 02:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Don't expect C to make much better code. 

But perhaps try a pointer that point to SinCosTable and add the index, (perhaps place it in a register pair.)

 

But why do you want to change it to C if it's already running in ASM ?

 

If you really hunt speed then make some inline code where speed really matter to save the call/ret.

In ASM change to use IJUMP (I know that might change the structure away form what C use, and I guess that your ASM does aswell) 

 

Last Edited: Fri. Apr 17, 2015 - 07:37 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

But why do you want to change it to C if it's already running in ASM ?

 

It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.

 

The section of the game that renders all the video to screen and draws all the lines is in ASM as C won't be able to do that.

 

However the sections of the game that do

 

Object management

Object movement

Collision detection

Game logic

 

Are not so time critical and are mostly in C already.

 

The Sin/Cos thing is called often enough that I will code it as inline ASM though.  The inline ASM thing is not as intimidating as it first looked now I have bothered to work it out.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.

 

If you make a good documentation of how it works, I would prefer clean ASM, than some C with all kind of compiler "hacks". 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes - anything that looks uglier in C to make it optimal won't be used.

 

For example the above code where #defining an inline ASM macro only saved one clock over

 

xAdd = (xAdd * Z) >> 8;
yAdd = (yAdd * Z) >> 8;

 

is out.  But the inline ASM to replace

 

xAdd = (SinFastC(thisLaneAngle + edgeAngle) * laneWidth) >> 8;
yAdd = (CosFastC(thisLaneAngle + edgeAngle) * laneWidth) >> 8;

is in.  Because not only is it a lot of clocks, it also is called quite often.

 

Also any ASM code is pretty well explained when people do want to look at it - as an example

 

; fast_line_convert_x0_y0_into_VRAM_address
;
; converts the X0 and Y0 address passed in r24 and r22 into a VRAM memory
; location and leaves this result in R26:27 (VRAM_Address)
;
; Inputs
;            Y0 address = R22
;            X0 address = R24
; Outputs
;            VRAM_Address = R26:27 (X)
;
; Requires that the constants 4 and 32 are pre-loaded in R10, R11
;
; Trashes R0:1

.macro fast_line_convert_x0_y0_into_VRAM_address
     mul     r22, r10                               ; Multiply Y0 by 4   y7y6y5y4y3y2y1y0      => .0.0.0.0.0.0y7y6:y5y4y3y2y1y0.0.0
     movw    r26, r0                                ; move the 16 bit result into VRAM_Address
     andi    r26, 0b11100000                        ; clear out the bits that are used for Xn     .0.0.0.0.0.0y7y6:y5y4y3.0.0.0.0.0
     mul     r24, r11                               ; Multiply X0 by 32  x7x6x5x4x3x2x1x0      => .0.0.0x7x6x5x4x3:x2x1x0.0.0.0.0.0
     or      r26, r1                                ; OR X7..3 into low byte of VRAM_Address      .0.0.0.0.0.0y7y6:y5y4y3x7x6x5x4x3
     subi    r27, hi8(-(vram))                      ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3
.endm

And any code that is almost a direct equivalent of something from C has the C code above it in comments.

 

/*
void renderObjects(void){
     uint8_t i;
     ObjectDescStruct *Current;

     drawFunctionPointer_t drawFunction;

     for(i = 0; i < MAX_OBJS; i++) {
          Current = (ObjectDescStruct*)&ObjectStore[i];
          if(Current->obType != OBJ_EMPTY) {
               drawFunction = (drawFunctionPointer_t)pgm_read_word(&drawFunctionPointers[Current->obType]);
               drawFunction(Current);
          }
     }
}
*/

renderObjects:
     fast_line_entry_C                               ; save all the registers C does not want trashed
     fast_line_entry                                 ; set up all the registers for the draw line routines

     ldi		r28, lo8(ObjectStore)        ; Get the base address of the ObjectStore[] array
     ldi		r29, hi8(ObjectStore)
renderObjectsLoop:
     ld      r30, Y                                  ; Get the Object Type of the current object (r30 = ObjectStore[i].ObjType)
     cpi     r30, 0x00                               ; If the object type is <zero>
     breq    renderObjectsSkip                       ; we don't need to draw anything
     add     r30, r30                                ; Multiply the object number by 2 (flash addressing)
     clr     r31                                     ; and clear ZH (NB: Object Type can not be > 128 for this to work)
     subi    r30, lo8(-(drawFunctionPointers))       ; Add the base address to the function pointer table in PROGMEM
     sbci    r31, hi8(-(drawFunctionPointers))
     lpm     r26, Z+                                 ; Get the address of the code/routine that draws the given
     lpm     r27, Z+                                 ; object type into Register X
     movw    r30, r26                                ; move that value into Z ready for the ICALL
     icall                                           ; Call the routine that draws the object

renderObjectsSkip:
     adiw    r28, 0x08                               ; add 8 to Y to point to the next element of ObjectStore[]
     cpi     r29, 0x04                               ; if the address of Y hits 0x0400 we are at the end of the object store
     brne    renderObjectsLoop                       ; else loop back

     fast_line_exit_C                                ; restore the registers C needs to not be trashed
     ret

It is a fairly famous game I am porting to the Uzebox and I suspect a few people to be interested in how I managed it.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

  xAdd = (Z * xAdd)>>8;                                    // Work out how far the object is from the start of the lane
  mulsu   r21, r19
  movw    r24, r0
  eor     r1, r1

 I'm actually pretty impressed.  I thought that mixed-size arithmetic (multiplying two bytes to get 16 bits) was one of the last holdout advantages for assembler. (that, and direct use of the carry bit.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But the biggest problem : you never know when the compiler change the code to something bigger!

And remember for the compiler you haven't done the >>8 yet! (it should have been something like mov r24,r1) , I guess it will add: mov r24,r25

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2,

 

straight after the

movw r24, r0

if does

add    r25, r19
std    Y+3, r25

so there is no extra move to get the >>8.

 

The only real waste in those few lines is the extra unnecessary eor r1,r1 it has to do again a few clocks later.

 

BTW - I reinstalled 4.8.2, cleaned up my project and removed a lot of orphaned code and files and I could get the compiler to come up with the mulsu code above.

 

HOWEVER the compiler compiler completly screwed up a pointer in another part of the code and had me chasing a phantom bug for 2 days.

 

Installed 4.9.2 20140912 and the compiler now gets the ASM correct.  I suspect it was this bug that has been fixed as at 4.9.0

 

https://gcc.gnu.org/bugzilla/sho...

 

In 4.8.2 the code loaded a memory address for a pointer into X, did some additions to X to access members of the struct, then called another routine with the trashed version of X.

 

In 4.9.2 the address is loaded into r16, copied to X, access the members trashing X, reload X from r16, then call other routine.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm not sure that I get the hole thing. But as see it you don't need the movw r24, r0!

In ASM it should be done direct by storing  r1 (and I don't get the add r25,r19 it's not a +=)

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2,

 

As shown in post #50 the next C statement straight after them is an =

	xAdd = (Z * xAdd)>>8;		// Work out how far the object is from the start of the lane
	yAdd = (Z * yAdd)>>8;

	//xAdd = MulSU(Z, xAdd);
	//yAdd = MulSU(Z, yAdd);

	//MulSU2(Z, xAdd, yAdd);

	objData->x = x + xAdd;		// Add the delta X/Y to the lane start position
	objData->y = y + yAdd;
}

Now if I was optimizing the group of statements in ASM I just would have added R1 to R25.  I was only looking at the single

xAdd = (xAdd * Z)>>8;

here though.  My inline ASM version moved R1 somewhere first because it was an optimization of a single line.  The C version using MOVW is the same speed as MOV.  Just trashes R24 needlessly.

 

I know that if I did everything in ASM I would make faster code, but in this section of the game were speed is not as critical and readability is, I accept a clock here and a clock there slower.

 

In the bits of code that need to be fast

 

  • renderObjects
  • drawXXXX
  • bresenhamLine
  • setPixel

 

which are called hierarchically - I have been able to globally optimize register usage and very aggressivly inline stuff with assembly macros.  Having the carry flag and not having to comply with the ABI needing R24 always being used for arguments has allowed me to make that stuff very fast.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think this will be the last one.  I have picked most of the low hanging fruit.  In fact even with this one I am probably going to leave it in C.  I am curious if there is a way to get C to use MUL or ignore the MSB.

 

I have a 16 bit pointer that is the address of a structure in an array.  I need to go back to the index of the array

 

In effect I am doing

 

i = ( (uint16_t)(objectPointer) - (uint16_t)(&objectArray[0]) ) / sizeOf(object);

 

Now I know

  • the result of this answer can be at most 59 (a 6 bit number)
  • The sizeOf() the structure is 8 bytes
  • The base address of the array is 0x0220 (fixed address from a section)
  • Therefore the highest address the pointer could be is 0x03F8 (0x220 + 59*8)
  • Therefore the highest value the result of the /8 can be is 127 (a 7 bit number)

 

So I know if I do the divide by 8 first, I can just use the lowest 8 bits and minus 0x44 (0x0220 / 8 = 0x44) hence the most optimal C I can make is

 

i = ((uint8_t)((uint16_t)(ob2) / 8)) - 0x44;

which gets compiled to

 

ldi   r24, 0x03	; 3          Do the lsr/ror loop 3 times
lsr   r23
ror   r22
dec   r24
brne  .-8
subi  r22, 0x44	; 68

Which is 16 clocks.

 

In ASM I can

 

ldi   r21, 32        ; value to multiply by
mul   r23, r21
mov   r22, r1        ; get the lowest 5 bits of result into destination
mul   r24, r21
or    r22, r0        ; or in the highest 2 useful bits into destination
eor   r1,  r1
subi  r22, 0x44 ; 68 

 

that keeps the full 8 bits of the result in 9 clocks OR because I know the answer is only 7 bits I can

 

lsr   r23
ror   r22
lsr   r23
ror   r22
lsr   r22    ; last itteration does not need ror as result is 7 bits
subi  r22, 0x44

which is the answer I need in this specific case in 6 clocks.

 

Any way of getting C to understand I only have 7 bits and the short LSR/ROR trick can be taken?  Or even make it do the full 3 LSR/ROR (no BRNE.-8) and blow the code size out by 1 word and still get the answer in 7 clocks?
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can any one please tell me code vision avr is compatible with which usb programmer?its urgent

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does -O3 unroll ?

If yes find which compiler flag that make the change.

Last Edited: Thu. Apr 23, 2015 - 08:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

perhaps only /4  and then /2 (but it's dangers next version perhaps see that it can be done as one operation )

because you only need 7 bit you know that /4 still can be hold the byte. 

 

 

Last Edited: Thu. Apr 23, 2015 - 09:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sparrow2, -03 does unroll it but I have to use -0s to fit all my game into the mega644.  The difference between -O3 and -Os is about 4K in size saving.  When I include the music assets in the game I have about 300 bytes left.

 

I have just been reading some more about GCC and optimizing.  Looks like I have to put some of these functions in a seperate C file (fast.c) I can run -O3 on them, but leave the non critical stuff in slow.c and then link them.  There seems to be no way to just ask the compiler to switch optimize levels around bits of code.

 

But your idea of a /4 and /2 does work

 

ob1->animation = ((uint8_t)((uint8_t)((uint16_t)(ob2) / 4)) / 2) - 68;

becomes

 

    96e8:	76 95       	lsr	r23
    96ea:	67 95       	ror	r22
    96ec:	76 95       	lsr	r23
    96ee:	67 95       	ror	r22
    96f0:	66 95       	lsr	r22
    96f2:	64 54       	subi	r22, 0x44	; 68

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.

 

you mean to

 

andi   r22, 0x03
ori    r22, 0x04
lsr    r23
ror    r22
brcc   .-6
subi   r22, 0x44

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

correct instructions wrong numbers! (AND to keep data and then the OR the "counter")

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK - another one.

extern uint8_t spibyte_ff(void);

int mmcGetInt(void){
    return( spibyte_ff() | spibyte_ff()<<8 );
}

comes out as

int mmcGetInt(void){
    push    r28

return( spibyte_ff() | spibyte_ff()<<8 );

    call    0xa9c4      ; 0xa9c4 <spibyte_ff>
    mov     r28, r24
    call    0xa9c4      ; 0xa9c4 <spibyte_ff>
    mov     r18, r28
    ldi     r19, 0x00   ; 0

}
    movw    r20, r18
    or      r21, r24
    movw    r24, r20
    pop     r28
    ret

and I would like it to look more like

 

    call    spibyte_ff
    push    r24
    call    spibyte_ff
    pop     r25
    ret

Technically in the ASM version I don't need to push/pop as I know R25 is untouched by spibyte_ff, but I am keeping it C/ABI compatible and assuming R25 could get trashed.

 

 

I'm using GCC 4.9.2 with -Os.  I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't expect that it can come all the way down to that.

But first try (I don't have your compiler here) to replace the | with + (with no overlap it's the same)

And try to make a local U16 you assign to.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Had already tried

 

uint16_t i;

((char*)&i)[0] = mmc_get_byte();
((char*)&i)[1] = mmc_get_byte();
return i;

and

 

return( mmc_get_byte() + mmc_get_byte()<<8 );

and

 

return( mmc_get_byte() + mmc_get_byte()*256 );

The one with OR ended up the best.

 

Just wanted to check here that there was not something else I didn't know about to help.  I have learned a LOT about how the C compiler thinks here, but I still know very little.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would still try with a local variable, and spread it over 2 lines.

that way you should at least avoid the push and pop.  
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

I have learned a LOT about how the C compiler thinks here

That's the problem a compiler don't think wink

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm using GCC 4.9.2 with -Os.  I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.

That seems pretty sensible.   You can always change to an ATmega1284 if you run out of space.

 

I have no idea what your "song" might be.    Any .WAV or .MP3 is going to use more flash than any AVR could supply.

Adding a microSD or even an external Flash memory will provide extra memory.

 

I can quite understand your desire to minimise program size.    I was expecting you to be struggling with a 2kB or 8kB device that had no bigger brothers.

 

Yes,  you can save a few bytes by altering some function usage.    The general approach is:

1. list functions by size.   look at the greedy ones first.

2. use appropriate width of variables and scope.

3. avoid f-p maths and printf() functions.

4. use smaller algorithms.

 

I suspect that those 4 points would give you a significant saving.   Minor tweaks to one function call will only save tens of bytes.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that code like this has undefined (or implementation defined at best) behavior:

return( mmc_get_byte() + mmc_get_byte()<<8 );

 

C doesn't specify the order in which the functions are called so the compiler might call the second function first.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

christop wrote:

Note that code like this has undefined (or implementation defined at best) behavior:

return( mmc_get_byte() + mmc_get_byte()<<8 );

C doesn't specify the order in which the functions are called so the compiler might call the second function first.

 

Thanks for the reminder.  I have read this before.

 

sparrow2 wrote:

I would still try with a local variable, and spread it over 2 lines.

that way you should at least avoid the push and pop.  

 

OK - just tried splitting it to two lines.  Slightly different result, but same size.

 

david.prentice wrote:
You can always change to an ATmega1284 if you run out of space.

 

Sadly no.  It is a game for the Uzebox.  The user base already all have Mega644s.  Squeezing the last drop of blood from the stone is the only option.

 

david.prentice wrote:
I have no idea what your "song" might be.    Any .WAV or .MP3 is going to use more flash than any AVR could supply.

Adding a microSD or even an external Flash memory will provide extra memory.

 

It is a "MOD" like song with the notes being synthesised rather than being WAVs. There is only one WAV which is sampled speech of the word "Play"

 

I think I have moved all the "graphics assets" I can to SD card.  Some of them can not be moved to SD card - for example the message saying "SD card not found" :)

 

david.prentice wrote:
1. list functions by size.   look at the greedy ones first.

 

Done.  Partially at least.  The greediest functions are giant 12,000 line long unrolled ASM fucntions, but they exist so the game can be done at all.

 

david.prentice wrote:
2. use appropriate width of variables and scope.

 

I think I have been programming space constrained systems long enough to have this one down pat :)

 

david.prentice wrote:
avoid f-p maths and printf() functions.

 

All fixed point and integer using as few as bits as can get the task done.

 

No printf or atoi() stuff.  My font table is even non-ascii to save time/space printing my BCD and HEX values to screen.

 

david.prentice wrote:
 Minor tweaks to one function call will only save tens of bytes.

 

Have all the usual major tweaks you mention in hand.

 

So far with the minor tweaks like this one - I have saved 400ish bytes.  That is one third of the way to being able to include a 3rd song in the game.

 

Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits.  That would save another 800 bytes.

 

So if I keep plodding along I might get there.

 

For reference my memory budget is

 

  • 4096 Bootloader (I can't touch)
  • 16436 the 12 thousand lines of ASM code that draws pixels to the screen
  • 7137 Music assets including the sampled speech
  • 4918 Fast line drawing routines and vector objects
  • 1280 3D point and vector data
  • 780 Object behaviour data
  • 768 trig, scaling and zoom tables
  • 512 font data that can not be stored on SD card
  • 226 Text that can not be stored on SD card
  • 474 free space (and counting)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What do you have in the eeprom?

And why does the bootloader need to be that big?

Last Edited: Sat. Aug 1, 2015 - 10:36 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Bootloader is written by someone else and contains

 

Video and sound generation (NTSC + mono 15Khz)

SD Card and FatFS reading of games stored on SD card

Menu system to select which new game to play

Flash programming stuff as normal for a bootloader

 

The EEPROM is shared space that all games written for the uzebox have to share.  It is "formatted" and arbitrated by the "kernel" code you have to include/link to your own game.

 

I myself use 64 bytes of the EEPROM for a highscore table.  Its generally considered poor form to use more than 32 bytes, but my game is impressive enough I think it warrants 2 "blocks" :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK - this one has just moved to a (dot)s file.  Using the knowledge that mmc_get_byte only touches r24, r30, r31 AND the fact that GetInt is a subset of GetLong let me shrink it a bit.

 

.section .text.mmcGetLong
mmcGetLong:
    rcall   mmc_get_byte     ; First byte from SD card straight to R22
    mov     r22, r24
    rcall   mmc_get_byte     ; Second byte from SD card straight to R23
    mov     r23, r24
                             ; Fall through to GetInt: to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
    rcall   mmc_get_byte     ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
    mov     r20, r24
    rcall   mmc_get_byte     ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
    mov     r21, r24    
    movw    r24, r20         ; Move R20:21 to the R24:25 location C is expecting it
    ret

.section .text.mmcGetChar    ; Simple fall through to get_byte
mmcGetChar:

.section .text.mmc_get_byte
mmc_get_byte:
    rcall   spibyte_ff
    .
    .
    .

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It is still going to stay in ASM now, because I have made the effort and the ASM is 1/2 the size when considering getInt and getLong together.

 

But it was bugging the hell out of me that I could not get it to just OR two bytes together into a i16 without all the mess.

 

This si the best I have come up with.

 

uint16_t mmcGetInt(void){

union{
    struct {
        uint8_t i1;
        uint8_t i2;
    };
    uint16_t i16;
} ii;

ii.i1 = getbyte();
ii.i2 = getbyte();

return(ii.i16);

}

And that comes out as

 

push    r28
call    0xd900   ; 0xd900 <getbyte>
mov     r28, r24

call    0xd900   ; 0xd900 <getbyte>
mov     r25, r24

mov     r24, r28
pop     r28
ret

It is only one push/pop behind the ASM (considering getInt alone)

 

Does anyone know if there is a way to hint the compiler that "getByte" only touches r24 so it can avoid the push/pop and just use some register pair like R22:23 ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looking at your "budget",   I would start by optimising the 16kB of ASM.    That is clearly the greediest part.

 

Quite honestly,  there is little point in optimising mmcGetInt(void).    It is the calling sequence that matters.     e.g. RCALL mmcGetInt / MOV r24 / MOV r25 i.e. 3 words.

Even that could be reduced to 1 word if the subsequent code "knows" it is going to use r24, r25.

Even if the actual mmcGetInt is in an external file (so the linker wants CALL) you can put in a local trampoline.

 

You are an experienced ASM programmer.    You must know all the standard techniques for reducing code.    You also know that 95% of code space is not speed critical.   It is only the 5% that is used for 90% of the time e.g. inside loops,  common subroutines.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
I would start by optimising the 16kB of ASM.    That is clearly the greediest part.

 

The 16K is not optimizable at all and they whole game hinges on it.  Without it there is no game and the rest of the exersize is pointless.  I am willing to be proven wrong here if anyone thinks they can save any space on it but I think that 16K can't have a single BYTE saved from it.

 

david.prentice wrote:
Quite honestly,  there is little point in optimising mmcGetInt(void).

 

I know any individual routine is never going to save heaps of space.  But when you have done ALL the rest you can think of you have to try them.

 

I have done a lot of optimizing so far that I have not posted questions here about because I figured them out myself.  So far I have now saved 600 bytes from optimizing all the little things like GetInt.  They ADD UPP.

 

I managed to reduce getInt and getLong from 106 bytes (first un optimized C) down to 20 bytes in ASM. (My best C got down to 46 bytes I think)

 

The other nice thing about my "optimizing the little things" like the SD card reading functions here, is that when I push all my little things back into the "kernel" I will save space on everyone else Uzebox game that is written in the future. 

 

david.prentice wrote:
You also know that 95% of code space is not speed critical.

 

This is actually an unusual case that more than %50 of my flash space is taken up with speed critical things.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And no one picked me up on the obvious saving of another 4 bytes I missed. (I noticed it myself when I was cleaning up comments)

 

.section .text.mmcGetLong
mmcGetLong:
  rcall  mmcGetInt       ; First two bytes from SD card and move to R22:23
  movw   r22, r24
                         ; Fall through to GetInt to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
  rcall  mmc_get_byte    ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
  mov    r20, r24
  rcall  mmc_get_byte    ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
  mov    r21, r24
  movw   r24, r20        ; Move R20:21 to the R24:25 location C is expecting it
  ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits.  That would save another 800 bytes.
I've done that for this:

https://www.avrfreaks.net/comment/1628051#comment-1628051

Also supports 2-bit, but it sounds awful ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks joey,

 

I am pretty sure that 4 bit will sound fine for the sample as it is played at the same time as 3 other channels of 8 bit generated sound in the music.

 

The problem for Uze will be if he can fit the 3 note channels, the one noise channel and the 4 bit WAV channel into the 130 clocks that HSync runs for.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Should be easily done in assembler.  Currently you have:

8-bit:

LPM: 3 cycles

 

You'll want:

4-bit:

low nibble:

LPM: 3 cycles

MOV: 1 cycle

AND: 1 cycle

high nibble:

SWAP: 1 cycle

AND: 1 cycle

You'd also need to track which nibble you're working on, so some extra cycles would be spent setting, testing, and clearing a flag somewhere.  I assume all the low-hanging fruit like GPIOR0 are already spoken for, but perhaps there's something else available.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 6, 2015 - 12:00 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

<Thread Necromancy>

 

Hi again all,

 

Need help giving the compiler hints again that are human friendly.

 

The following code compiles and works fine.  Just a bit slow.  As I have said previously I like to keep code in C rather than moving it to ASM if I can keep performance within 5% or so.  Unless of course the C is more confusing and less readable than the ASM version.

 

typedef struct {
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStruct;

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){

	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail

	if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
	
	gradLines[linkNext].next = 0;                                           // Insert the values
	gradLines[linkNext].col  = col;
	gradLines[linkNext].y1   = y1;
	gradLines[linkNext].x1   = x1;
	gradLines[linkNext].x1Lo = 0;
	gradLines[linkNext].m    = m;
	gradLines[linkNext].y2   = y2;
}

And the LSS is

 

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
    2a18:	0f 93       	push	r16

	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
    2a1a:	a0 91 c7 01 	lds	r26, 0x01C7	; 0x8001c7 <linkNext>
    2a1e:	af 3f       	cpi	r26, 0xFF	; 255
    2a20:	d9 f0       	breq	.+54     	; 0x2a58 <Add_Line+0x40>
    2a22:	e1 e0       	ldi	r30, 0x01	; 1
    2a24:	ea 0f       	add	r30, r26

	if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
    2a26:	aa 23       	and	r26, r26
    2a28:	39 f0       	breq	.+14     	; 0x2a38 <Add_Line+0x20>
    2a2a:	98 e0       	ldi	r25, 0x08	; 8
    2a2c:	a9 9f       	mul	r26, r25
    2a2e:	d0 01       	movw	r26, r0
    2a30:	11 24       	eor	r1, r1
    2a32:	a0 50       	subi	r26, 0x00	; 0
    2a34:	bc 4f       	sbci	r27, 0xFC	; 252
    2a36:	ec 93       	st	X, r30
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
    2a38:	e0 93 c7 01 	sts	0x01C7, r30	; 0x8001c7 <linkNext>
	
	gradLines[linkNext].next = 0;                                           // Insert the values
    2a3c:	98 e0       	ldi	r25, 0x08	; 8
    2a3e:	e9 9f       	mul	r30, r25
    2a40:	f0 01       	movw	r30, r0
    2a42:	11 24       	eor	r1, r1
    2a44:	e0 50       	subi	r30, 0x00	; 0
    2a46:	fc 4f       	sbci	r31, 0xFC	; 252
    2a48:	10 82       	st	Z, r1
	gradLines[linkNext].col  = col;
    2a4a:	81 83       	std	Z+1, r24	; 0x01
	gradLines[linkNext].y1   = y1;
    2a4c:	62 83       	std	Z+2, r22	; 0x02
	gradLines[linkNext].x1   = x1;
    2a4e:	44 83       	std	Z+4, r20	; 0x04
	gradLines[linkNext].x1Lo = 0;
    2a50:	13 82       	std	Z+3, r1	; 0x03
	gradLines[linkNext].m    = m;
    2a52:	36 83       	std	Z+6, r19	; 0x06
    2a54:	25 83       	std	Z+5, r18	; 0x05
	gradLines[linkNext].y2   = y2;
    2a56:	07 83       	std	Z+7, r16	; 0x07
}
    2a58:	0f 91       	pop	r16
    2a5a:	08 95       	ret

 

I have no idea WHY the PUSH R16 at the start of the routine and the POP R16 at the end OR the superfluous SUBI R31, 0x00 and OER R1,R1 but that's not the question here.

 

If you look at the C code - you can see the two array access locations are sequential.  I have been able to trick the compiler into using the larger indexes to access the second one like this

 

typedef struct {
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStruct;

typedef struct {
	uint8_t last;
	uint8_t dummy1;
	uint8_t dummy2;
	uint8_t dummy3;
	uint8_t dummy4;
	int16_t dummy5;
	uint8_t dummy6;
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStructX2;

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){

	GradLineStructX2 *KludgePointer;
	
	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail

	KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
	
	if (linkNext != 0) { KludgePointer->last = linkNext + 1;	};          // If we are pointing to array index[0] then don't update "next"
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
	
	KludgePointer->next = 0;                                                // Insert the values
	KludgePointer->col  = col;
	KludgePointer->y1   = y1;
	KludgePointer->x1   = x1;
	KludgePointer->x1Lo = 0;
	KludgePointer->m    = m;
	KludgePointer->y2   = y2;
}

Which compiles to this

 

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
    2a18:	0f 93       	push	r16
    2a1a:	d9 01       	movw	r26, r18

	GradLineStructX2 *KludgePointer;
	
	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
    2a1c:	90 91 c7 01 	lds	r25, 0x01C7	; 0x8001c7 <linkNext>
    2a20:	9f 3f       	cpi	r25, 0xFF	; 255
    2a22:	a1 f0       	breq	.+40     	; 0x2a4c <Add_Line+0x34>

	KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
    2a24:	28 e0       	ldi	r18, 0x08	; 8
    2a26:	92 9f       	mul	r25, r18
    2a28:	f0 01       	movw	r30, r0
    2a2a:	11 24       	eor	r1, r1
    2a2c:	e0 50       	subi	r30, 0x00	; 0
    2a2e:	fc 4f       	sbci	r31, 0xFC	; 252
    2a30:	31 e0       	ldi	r19, 0x01	; 1
    2a32:	39 0f       	add	r19, r25
	
	if (linkNext != 0) { KludgePointer->last = linkNext + 1;	};              // If we are pointing to array index[0] then don't update "next"
    2a34:	91 11       	cpse	r25, r1
    2a36:	30 83       	st	Z, r19
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
    2a38:	30 93 c7 01 	sts	0x01C7, r19	; 0x8001c7 <linkNext>
	
	KludgePointer->next = 0;                                           // Insert the values
    2a3c:	10 86       	std	Z+8, r1	; 0x08
	KludgePointer->col  = col;
    2a3e:	81 87       	std	Z+9, r24	; 0x09
	KludgePointer->y1   = y1;
    2a40:	62 87       	std	Z+10, r22	; 0x0a
	KludgePointer->x1   = x1;
    2a42:	44 87       	std	Z+12, r20	; 0x0c
	KludgePointer->x1Lo = 0;
    2a44:	13 86       	std	Z+11, r1	; 0x0b
	KludgePointer->m    = m;
    2a46:	b6 87       	std	Z+14, r27	; 0x0e
    2a48:	a5 87       	std	Z+13, r26	; 0x0d
	KludgePointer->y2   = y2;
    2a4a:	07 87       	std	Z+15, r16	; 0x0f
}
    2a4c:	0f 91       	pop	r16
    2a4e:	08 95       	ret

Now it still has the superfluous PUSH/POP/SUBI - but has gotten down in size by 10%.

 

But it does look kludgy.

 

Can a Compiler+ASM guru show me a better way to make the 2nd array access not do the expensive re-calculate of the array

?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW - here is my ASM version

; void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
;
;   if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
;
;   if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
;                                                                           //   index[1] is the first valid entry that can be used.
;
;   linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
;
;   gradLines[linkNext].next = 0;                                           // Insert the values
;   gradLines[linkNext].col  = col;
;   gradLines[linkNext].y1   = y1;
;   gradLines[linkNext].x1   = x1;
;   gradLines[linkNext].x1Lo = 0;
;   gradLines[linkNext].m    = m;
;   gradLines[linkNext].y2   = y2;
; }
;
; Notes:
;   The  "if (linkNext == 255) { return; }"  test is done quite late.  After address conversion and linkNext++.  This makes that
;   case slower than could be if the comparison was done early.  This however would make the more common and important case
;   when linkNext != 255 one clock slower.  You don't really care how slow the ==255 case is as you have run out of memory to
;   add more lines at thise stage anyway.
	
AddLine:	
	lds     r23, linkNext              ; Fetch the global variable "linkNext" that is the current head pointer to gradient_lines list
	ldi     r21, 0x08
	mul     r23, r21                   ; Multiply "linkNext" by sizeof() the structure for the address conversion step
	movw    R30, R0                    ; Move the result of the MUL into Z early so we can restore r1 to <zero>
	eor     r1, r1                     ; r1 = 0
	inc     r23                        ; INC "linkNext".  The result of this is to be stored AND used for the ==255 check
	breq    Add_Line_Exit              ; If "linkNext" was ==255 prior to the above INC then the result would be ZERO so we return()
	sts     linkNext, r23              ; Save the value of the global variable "linkNext"
	subi    r31, 0xFC                  ; Finish of the array
address conversion by adding the base address of the array ; only the high byte needs to be added as the array is 256 byte aligned. ldi r21, 0x01 cpse r23, r21 ; Compare "linkNext + 1" to 0x01. If this is true then the value of "linkNext" before the ; inc was 0x00 and we should skip over the next instruction st Z, r23 ; array[linkNext].next = linkNext + 1 std Z+8, r1 ; array[linkNext + 1].next = 0 std Z+9, r24 ; array[linkNext + 1].col = col std Z+10, r22 ; array[linkNext + 1].y1 = y1 std Z+11, r1 ; array[linkNext + 1].x1lo = 0 std Z+12, r20 ; array[linkNext + 1].x1 = x1 std Z+13, r18 ; array[linkNext + 1].m = m std Z+14, r19 ; 2nd byte of 16 bit value m std Z+15, r16 ; array[linkNext + 1].y2 = y2 Add_Line_Exit: ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

for the hole picture:

you pass something close to the structure  <GradLineStruct> that must generate a lot of code before the call !

 

so perhaps pass a pointer to GradLineStruct variable that hold the values you want!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 - thanks for the reply.

 

This function is the most likely delineation between what would be considered "user code" and "Kernel code"

 

At a lower level than this I have code that sorts the polygons into a Y sorted list, converts that list of lines in the form Y= mX+c, uses that Y sorted list to create a Run Length Encoded RAM buffer while doing some interrupt ended code that "races the beam"

 

I want to hide all that incredibly complex stuff from anyone wanting to use this RLE mode for writing their own game. 

 

So "AddPolyLine" is the point at which I stop being responsible for data structures and how the code works and how the user wants to write their code.  Think of it as an API call.

 

I have no control, nor do I want to impose upon the user, how they are going to store/treat col, y1, x1, m and y2.

 

The video mode itself is not as amazing as T2K, but has the potential to be the 2nd most amazing video output code done on the AVR if I have not made any mistakes on my spreadsheet.

 

(Help with optimizing stuff for T2K was probably how this thread started BTW)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The reason for the PUSH/POP of R16 is because of this..

 

https://gcc.gnu.org/wiki/avr-gcc...

 

The ABI dictates that R2-R17 must be saved. Usually the compiler sticks to the ...

 

https://gcc.gnu.org/wiki/avr-gcc...

 

But if the code requires it to use SO many registers it will start to use call saved and then, as you see, it has to save them.

 

If you pass in a struct rather than all the individual dimensions you will reduce the register usage.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Clawson,

 

I am well aware of call-saved and call-clobbered.

 

R16 is not clobbered at all.  It is passed in.  Pushed to the stack.  Used in an Store and then popped off the stack.  A pointless waste of a push and pop because it is not altered between the push and pop.

 

Kind of like the first EOR r1,r1 is pointless because r1 is not used as <zero> between then and its next trashing by the second MUL.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes but the compiler will just follow internal rules such as "if R2-R17 used then PUSH/POP"

 

BTW the compiler is an open project, all improvements/fixes welcome from all who paid to use it.

 

(Oh, wait a minute, no one actually pays do they?)

Last Edited: Sun. Apr 22, 2018 - 11:08 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So, r16 gets the y2 variable due to the calling convention. This variable is not written to, but still the compiler sees fit to save it.

Sure , the compiler is being dumb here, nothing I've never seen before.

 

I don't see any elegant way to solve this. Just an inelegant  way, which is to fuse 2 uint8_t variables into a uint16_t, so that r16 is not used, y2 will go to r18 instead which is not call saved.

 

void Add_Line(uint8_t col, uint16_t y1_x1, int16_t m, uint8_t y2){
    ...
    gradLines[linkNext].y1   = y1_x1 >> 8;
    gradLines[linkNext].x1   = y1_x1 | 0xFF;
    ...
}

 

 

Then you would have to call it with

Add_Line(col, y1<<8 | x1, m, y2);

 

Ugly, right? But maybe in this case the compiler will be able to avoid the push/pop.

 

edit: you can make it less ugly with a macro, I guess.

 

#define ADD_LINE(col,y1,x1,m,y2) Add_Line(col, y1<<8 | x1, m, y2)

 

edit 2: and when I say the compiler is being dumb, I don't intend to aim it at the developers. GCC is immensely complex and sometimes, there is no easy way to fix things like this. In fact, those guys are heroes in my opinion, I fear the day when SprinterSB can/will no longer develop avr-gcc...

Last Edited: Sun. Apr 22, 2018 - 11:32 AM

Pages