As you are fussing with this construct and cycle-counting, have you tried 8 bits times 8 bits to get a 16-bit result -- and then just force taking the high byte with a cast or union? Perhaps you can trick the compiler this way. (Actually, I call it "giving the compiler a hint" and imply your ASM mindset.)
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
I copied your latest code to a standalone C source file (with a few missing definitions added to make it compile). That code is at http://pastebin.com/DKin9rVS.
Try as I might, I couldn't get avr-gcc to produce the longer multiplication code sequence with anything but -O0. -Os/-O1/-O2/-O3 all produce only two mulsu instructions, one for each multiplication.
Can you copy and paste the exact avr-gcc command that is used to compile your code? I suspect that this source file is not getting the right optimization flag for some reason.
Posted by andrewm1973: Fri. Apr 17, 2015 - 02:07 AM
1
2
3
4
5
Total votes: 0
Thanks Christop, It's defintly something in the makefile or environment for that project.
In a totally clean environment I get this
xAdd = (Z * xAdd)>>8; // Work out how far the object is from the start of the lane
mulsu r21, r19
movw r24, r0
eor r1, r1
yAdd = (Z * yAdd)>>8;
mulsu r20, r19
movw r30, r0
eor r1, r1
Which is only one clock behind the inline ASM for the same thing - close enough for this part of the code that it will stay as C.
I have looked further and no where in my code that is doing similar fixed math is FMUL being used. However in the kernel code FMUL is being used. They are both being compiled with the same flags
; int8_t CosFastC(uint8_t angle)
;
; Returns the cosine of the angle
;
; Inputs
; Angle (0..255) in r24
; Returns
; cos(angle) as signed 8 bit value in r24
; Trashes
; R24
; R26:27
CosFastC:
subi r24, (-(64)) ; COS is 90 degrees out of phase with SIN
SinFastC:
mov r30, r24 ; Get the offset in the SIN table
ldi r31, hi8(SinCosTable)
lpm r24, Z ; Read value from table into r24
ret
I don't think this one is likely - but is there are a way to make the C code not do
Posted by andrewm1973: Fri. Apr 17, 2015 - 08:33 AM(Reply to #59)
1
2
3
4
5
Total votes: 0
sparrow2 wrote:
But why do you want to change it to C if it's already running in ASM ?
It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.
The section of the game that renders all the video to screen and draws all the lines is in ASM as C won't be able to do that.
However the sections of the game that do
Object management
Object movement
Collision detection
Game logic
Are not so time critical and are mostly in C already.
The Sin/Cos thing is called often enough that I will code it as inline ASM though. The inline ASM thing is not as intimidating as it first looked now I have bothered to work it out.
It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.
If you make a good documentation of how it works, I would prefer clean ASM, than some C with all kind of compiler "hacks".
is in. Because not only is it a lot of clocks, it also is called quite often.
Also any ASM code is pretty well explained when people do want to look at it - as an example
; fast_line_convert_x0_y0_into_VRAM_address
;
; converts the X0 and Y0 address passed in r24 and r22 into a VRAM memory
; location and leaves this result in R26:27 (VRAM_Address)
;
; Inputs
; Y0 address = R22
; X0 address = R24
; Outputs
; VRAM_Address = R26:27 (X)
;
; Requires that the constants 4 and 32 are pre-loaded in R10, R11
;
; Trashes R0:1
.macro fast_line_convert_x0_y0_into_VRAM_address
mul r22, r10 ; Multiply Y0 by 4 y7y6y5y4y3y2y1y0 => .0.0.0.0.0.0y7y6:y5y4y3y2y1y0.0.0
movw r26, r0 ; move the 16 bit result into VRAM_Address
andi r26, 0b11100000 ; clear out the bits that are used for Xn .0.0.0.0.0.0y7y6:y5y4y3.0.0.0.0.0
mul r24, r11 ; Multiply X0 by 32 x7x6x5x4x3x2x1x0 => .0.0.0x7x6x5x4x3:x2x1x0.0.0.0.0.0
or r26, r1 ; OR X7..3 into low byte of VRAM_Address .0.0.0.0.0.0y7y6:y5y4y3x7x6x5x4x3
subi r27, hi8(-(vram)) ; Add the base address of VRAM .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3
.endm
And any code that is almost a direct equivalent of something from C has the C code above it in comments.
/*
void renderObjects(void){
uint8_t i;
ObjectDescStruct *Current;
drawFunctionPointer_t drawFunction;
for(i = 0; i < MAX_OBJS; i++) {
Current = (ObjectDescStruct*)&ObjectStore[i];
if(Current->obType != OBJ_EMPTY) {
drawFunction = (drawFunctionPointer_t)pgm_read_word(&drawFunctionPointers[Current->obType]);
drawFunction(Current);
}
}
}
*/
renderObjects:
fast_line_entry_C ; save all the registers C does not want trashed
fast_line_entry ; set up all the registers for the draw line routines
ldi r28, lo8(ObjectStore) ; Get the base address of the ObjectStore[] array
ldi r29, hi8(ObjectStore)
renderObjectsLoop:
ld r30, Y ; Get the Object Type of the current object (r30 = ObjectStore[i].ObjType)
cpi r30, 0x00 ; If the object type is <zero>
breq renderObjectsSkip ; we don't need to draw anything
add r30, r30 ; Multiply the object number by 2 (flash addressing)
clr r31 ; and clear ZH (NB: Object Type can not be > 128 for this to work)
subi r30, lo8(-(drawFunctionPointers)) ; Add the base address to the function pointer table in PROGMEM
sbci r31, hi8(-(drawFunctionPointers))
lpm r26, Z+ ; Get the address of the code/routine that draws the given
lpm r27, Z+ ; object type into Register X
movw r30, r26 ; move that value into Z ready for the ICALL
icall ; Call the routine that draws the object
renderObjectsSkip:
adiw r28, 0x08 ; add 8 to Y to point to the next element of ObjectStore[]
cpi r29, 0x04 ; if the address of Y hits 0x0400 we are at the end of the object store
brne renderObjectsLoop ; else loop back
fast_line_exit_C ; restore the registers C needs to not be trashed
ret
It is a fairly famous game I am porting to the Uzebox and I suspect a few people to be interested in how I managed it.
xAdd = (Z * xAdd)>>8; // Work out how far the object is from the start of the lane
mulsu r21, r19
movw r24, r0
eor r1, r1
I'm actually pretty impressed. I thought that mixed-size arithmetic (multiplying two bytes to get 16 bits) was one of the last holdout advantages for assembler. (that, and direct use of the carry bit.)
Posted by andrewm1973: Sat. Apr 18, 2015 - 10:57 AM
1
2
3
4
5
Total votes: 0
sparrow2,
straight after the
movw r24, r0
if does
add r25, r19
std Y+3, r25
so there is no extra move to get the >>8.
The only real waste in those few lines is the extra unnecessary eor r1,r1 it has to do again a few clocks later.
BTW - I reinstalled 4.8.2, cleaned up my project and removed a lot of orphaned code and files and I could get the compiler to come up with the mulsu code above.
HOWEVER the compiler compiler completly screwed up a pointer in another part of the code and had me chasing a phantom bug for 2 days.
Installed 4.9.2 20140912 and the compiler now gets the ASM correct. I suspect it was this bug that has been fixed as at 4.9.0
In 4.8.2 the code loaded a memory address for a pointer into X, did some additions to X to access members of the struct, then called another routine with the trashed version of X.
In 4.9.2 the address is loaded into r16, copied to X, access the members trashing X, reload X from r16, then call other routine.
Posted by andrewm1973: Sat. Apr 18, 2015 - 07:12 PM
1
2
3
4
5
Total votes: 0
sparrow2,
As shown in post #50 the next C statement straight after them is an =
xAdd = (Z * xAdd)>>8; // Work out how far the object is from the start of the lane
yAdd = (Z * yAdd)>>8;
//xAdd = MulSU(Z, xAdd);
//yAdd = MulSU(Z, yAdd);
//MulSU2(Z, xAdd, yAdd);
objData->x = x + xAdd; // Add the delta X/Y to the lane start position
objData->y = y + yAdd;
}
Now if I was optimizing the group of statements in ASM I just would have added R1 to R25. I was only looking at the single
xAdd = (xAdd * Z)>>8;
here though. My inline ASM version moved R1 somewhere first because it was an optimization of a single line. The C version using MOVW is the same speed as MOV. Just trashes R24 needlessly.
I know that if I did everything in ASM I would make faster code, but in this section of the game were speed is not as critical and readability is, I accept a clock here and a clock there slower.
In the bits of code that need to be fast
renderObjects
drawXXXX
bresenhamLine
setPixel
which are called hierarchically - I have been able to globally optimize register usage and very aggressivly inline stuff with assembly macros. Having the carry flag and not having to comply with the ABI needing R24 always being used for arguments has allowed me to make that stuff very fast.
Posted by andrewm1973: Thu. Apr 23, 2015 - 03:10 AM
1
2
3
4
5
Total votes: 0
I think this will be the last one. I have picked most of the low hanging fruit. In fact even with this one I am probably going to leave it in C. I am curious if there is a way to get C to use MUL or ignore the MSB.
I have a 16 bit pointer that is the address of a structure in an array. I need to go back to the index of the array
In effect I am doing
i = ( (uint16_t)(objectPointer) - (uint16_t)(&objectArray[0]) ) / sizeOf(object);
Now I know
the result of this answer can be at most 59 (a 6 bit number)
The sizeOf() the structure is 8 bytes
The base address of the array is 0x0220 (fixed address from a section)
Therefore the highest address the pointer could be is 0x03F8 (0x220 + 59*8)
Therefore the highest value the result of the /8 can be is 127 (a 7 bit number)
So I know if I do the divide by 8 first, I can just use the lowest 8 bits and minus 0x44 (0x0220 / 8 = 0x44) hence the most optimal C I can make is
i = ((uint8_t)((uint16_t)(ob2) / 8)) - 0x44;
which gets compiled to
ldi r24, 0x03 ; 3 Do the lsr/ror loop 3 times
lsr r23
ror r22
dec r24
brne .-8
subi r22, 0x44 ; 68
Which is 16 clocks.
In ASM I can
ldi r21, 32 ; value to multiply by
mul r23, r21
mov r22, r1 ; get the lowest 5 bits of result into destination
mul r24, r21
or r22, r0 ; or in the highest 2 useful bits into destination
eor r1, r1
subi r22, 0x44 ; 68
that keeps the full 8 bits of the result in 9 clocks OR because I know the answer is only 7 bits I can
lsr r23
ror r22
lsr r23
ror r22
lsr r22 ; last itteration does not need ror as result is 7 bits
subi r22, 0x44
which is the answer I need in this specific case in 6 clocks.
Any way of getting C to understand I only have 7 bits and the short LSR/ROR trick can be taken? Or even make it do the full 3 LSR/ROR (no BRNE.-8) and blow the code size out by 1 word and still get the answer in 7 clocks?
And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.
Posted by andrewm1973: Thu. Apr 23, 2015 - 10:59 PM
1
2
3
4
5
Total votes: 0
Sparrow2, -03 does unroll it but I have to use -0s to fit all my game into the mega644. The difference between -O3 and -Os is about 4K in size saving. When I include the music assets in the game I have about 300 bytes left.
I have just been reading some more about GCC and optimizing. Looks like I have to put some of these functions in a seperate C file (fast.c) I can run -O3 on them, but leave the non critical stuff in slow.c and then link them. There seems to be no way to just ask the compiler to switch optimize levels around bits of code.
Posted by andrewm1973: Thu. Apr 23, 2015 - 11:04 PM(Reply to #72)
1
2
3
4
5
Total votes: 0
sparrow2 wrote:
And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.
call spibyte_ff
push r24
call spibyte_ff
pop r25
ret
Technically in the ASM version I don't need to push/pop as I know R25 is untouched by spibyte_ff, but I am keeping it C/ABI compatible and assuming R25 could get trashed.
I'm using GCC 4.9.2 with -Os. I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.
Just wanted to check here that there was not something else I didn't know about to help. I have learned a LOT about how the C compiler thinks here, but I still know very little.
Posted by david.prentice: Sat. Aug 1, 2015 - 02:05 PM
1
2
3
4
5
Total votes: 0
I'm using GCC 4.9.2 with -Os. I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.
That seems pretty sensible. You can always change to an ATmega1284 if you run out of space.
I have no idea what your "song" might be. Any .WAV or .MP3 is going to use more flash than any AVR could supply.
Adding a microSD or even an external Flash memory will provide extra memory.
I can quite understand your desire to minimise program size. I was expecting you to be struggling with a 2kB or 8kB device that had no bigger brothers.
Yes, you can save a few bytes by altering some function usage. The general approach is:
1. list functions by size. look at the greedy ones first.
2. use appropriate width of variables and scope.
3. avoid f-p maths and printf() functions.
4. use smaller algorithms.
I suspect that those 4 points would give you a significant saving. Minor tweaks to one function call will only save tens of bytes.
Posted by andrewm1973: Sat. Aug 1, 2015 - 10:09 PM(Reply to #79)
1
2
3
4
5
Total votes: 0
christop wrote:
Note that code like this has undefined (or implementation defined at best) behavior:
return( mmc_get_byte() + mmc_get_byte()<<8 );
C doesn't specify the order in which the functions are called so the compiler might call the second function first.
Thanks for the reminder. I have read this before.
sparrow2 wrote:
I would still try with a local variable, and spread it over 2 lines.
that way you should at least avoid the push and pop.
OK - just tried splitting it to two lines. Slightly different result, but same size.
david.prentice wrote:
You can always change to an ATmega1284 if you run out of space.
Sadly no. It is a game for the Uzebox. The user base already all have Mega644s. Squeezing the last drop of blood from the stone is the only option.
david.prentice wrote:
I have no idea what your "song" might be. Any .WAV or .MP3 is going to use more flash than any AVR could supply.
Adding a microSD or even an external Flash memory will provide extra memory.
It is a "MOD" like song with the notes being synthesised rather than being WAVs. There is only one WAV which is sampled speech of the word "Play"
I think I have moved all the "graphics assets" I can to SD card. Some of them can not be moved to SD card - for example the message saying "SD card not found" :)
david.prentice wrote:
1. list functions by size. look at the greedy ones first.
Done. Partially at least. The greediest functions are giant 12,000 line long unrolled ASM fucntions, but they exist so the game can be done at all.
david.prentice wrote:
2. use appropriate width of variables and scope.
I think I have been programming space constrained systems long enough to have this one down pat :)
david.prentice wrote:
avoid f-p maths and printf() functions.
All fixed point and integer using as few as bits as can get the task done.
No printf or atoi() stuff. My font table is even non-ascii to save time/space printing my BCD and HEX values to screen.
david.prentice wrote:
Minor tweaks to one function call will only save tens of bytes.
Have all the usual major tweaks you mention in hand.
So far with the minor tweaks like this one - I have saved 400ish bytes. That is one third of the way to being able to include a 3rd song in the game.
Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits. That would save another 800 bytes.
So if I keep plodding along I might get there.
For reference my memory budget is
4096 Bootloader (I can't touch)
16436 the 12 thousand lines of ASM code that draws pixels to the screen
7137 Music assets including the sampled speech
4918 Fast line drawing routines and vector objects
Posted by andrewm1973: Sat. Aug 1, 2015 - 10:58 PM(Reply to #84)
1
2
3
4
5
Total votes: 0
Bootloader is written by someone else and contains
Video and sound generation (NTSC + mono 15Khz)
SD Card and FatFS reading of games stored on SD card
Menu system to select which new game to play
Flash programming stuff as normal for a bootloader
The EEPROM is shared space that all games written for the uzebox have to share. It is "formatted" and arbitrated by the "kernel" code you have to include/link to your own game.
I myself use 64 bytes of the EEPROM for a highscore table. Its generally considered poor form to use more than 32 bytes, but my game is impressive enough I think it warrants 2 "blocks" :)
Posted by andrewm1973: Sat. Aug 1, 2015 - 11:09 PM
1
2
3
4
5
Total votes: 0
OK - this one has just moved to a (dot)s file. Using the knowledge that mmc_get_byte only touches r24, r30, r31 AND the fact that GetInt is a subset of GetLong let me shrink it a bit.
.section .text.mmcGetLong
mmcGetLong:
rcall mmc_get_byte ; First byte from SD card straight to R22
mov r22, r24
rcall mmc_get_byte ; Second byte from SD card straight to R23
mov r23, r24
; Fall through to GetInt: to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
rcall mmc_get_byte ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
mov r20, r24
rcall mmc_get_byte ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
mov r21, r24
movw r24, r20 ; Move R20:21 to the R24:25 location C is expecting it
ret
.section .text.mmcGetChar ; Simple fall through to get_byte
mmcGetChar:
.section .text.mmc_get_byte
mmc_get_byte:
rcall spibyte_ff
.
.
.
It is only one push/pop behind the ASM (considering getInt alone)
Does anyone know if there is a way to hint the compiler that "getByte" only touches r24 so it can avoid the push/pop and just use some register pair like R22:23 ?
Posted by david.prentice: Sun. Aug 2, 2015 - 10:07 PM
1
2
3
4
5
Total votes: 0
Looking at your "budget", I would start by optimising the 16kB of ASM. That is clearly the greediest part.
Quite honestly, there is little point in optimising mmcGetInt(void). It is the calling sequence that matters. e.g. RCALL mmcGetInt / MOV r24 / MOV r25 i.e. 3 words.
Even that could be reduced to 1 word if the subsequent code "knows" it is going to use r24, r25.
Even if the actual mmcGetInt is in an external file (so the linker wants CALL) you can put in a local trampoline.
You are an experienced ASM programmer. You must know all the standard techniques for reducing code. You also know that 95% of code space is not speed critical. It is only the 5% that is used for 90% of the time e.g. inside loops, common subroutines.
Posted by andrewm1973: Sun. Aug 2, 2015 - 10:48 PM
1
2
3
4
5
Total votes: 0
david.prentice wrote:
I would start by optimising the 16kB of ASM. That is clearly the greediest part.
The 16K is not optimizable at all and they whole game hinges on it. Without it there is no game and the rest of the exersize is pointless. I am willing to be proven wrong here if anyone thinks they can save any space on it but I think that 16K can't have a single BYTE saved from it.
david.prentice wrote:
Quite honestly, there is little point in optimising mmcGetInt(void).
I know any individual routine is never going to save heaps of space. But when you have done ALL the rest you can think of you have to try them.
I have done a lot of optimizing so far that I have not posted questions here about because I figured them out myself. So far I have now saved 600 bytes from optimizing all the little things like GetInt. They ADD UPP.
I managed to reduce getInt and getLong from 106 bytes (first un optimized C) down to 20 bytes in ASM. (My best C got down to 46 bytes I think)
The other nice thing about my "optimizing the little things" like the SD card reading functions here, is that when I push all my little things back into the "kernel" I will save space on everyone else Uzebox game that is written in the future.
david.prentice wrote:
You also know that 95% of code space is not speed critical.
This is actually an unusual case that more than %50 of my flash space is taken up with speed critical things.
Posted by andrewm1973: Wed. Aug 5, 2015 - 11:09 PM
1
2
3
4
5
Total votes: 0
And no one picked me up on the obvious saving of another 4 bytes I missed. (I noticed it myself when I was cleaning up comments)
.section .text.mmcGetLong
mmcGetLong:
rcall mmcGetInt ; First two bytes from SD card and move to R22:23
movw r22, r24
; Fall through to GetInt to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
rcall mmc_get_byte ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
mov r20, r24
rcall mmc_get_byte ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
mov r21, r24
movw r24, r20 ; Move R20:21 to the R24:25 location C is expecting it
ret
Posted by andrewm1973: Wed. Aug 5, 2015 - 11:39 PM
1
2
3
4
5
Total votes: 0
Thanks joey,
I am pretty sure that 4 bit will sound fine for the sample as it is played at the same time as 3 other channels of 8 bit generated sound in the music.
The problem for Uze will be if he can fit the 3 note channels, the one noise channel and the 4 bit WAV channel into the 130 clocks that HSync runs for.
Should be easily done in assembler. Currently you have:
8-bit:
LPM: 3 cycles
You'll want:
4-bit:
low nibble:
LPM: 3 cycles
MOV: 1 cycle
AND: 1 cycle
high nibble:
SWAP: 1 cycle
AND: 1 cycle
You'd also need to track which nibble you're working on, so some extra cycles would be spent setting, testing, and clearing a flag somewhere. I assume all the low-hanging fruit like GPIOR0 are already spoken for, but perhaps there's something else available.
"Experience is what enables you to recognise a mistake the second time you make it."
"Good judgement comes from experience. Experience comes from bad judgement."
"Wisdom is always wont to arrive late, and to be a little approximate on first possession."
"When you hear hoofbeats, think horses, not unicorns."
"Fast. Cheap. Good. Pick two."
"We see a lot of arses on handlebars around here." - [J Ekdahl]
Posted by andrewm1973: Sat. Apr 21, 2018 - 09:53 PM
1
2
3
4
5
Total votes: 0
<Thread Necromancy>
Hi again all,
Need help giving the compiler hints again that are human friendly.
The following code compiles and works fine. Just a bit slow. As I have said previously I like to keep code in C rather than moving it to ASM if I can keep performance within 5% or so. Unless of course the C is more confusing and less readable than the ASM version.
typedef struct {
uint8_t next;
uint8_t col;
uint8_t y1;
uint8_t x1Lo;
uint8_t x1;
int16_t m;
uint8_t y2;
} GradLineStruct;
void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
if (linkNext == 255) { return; } // If we have run out of space to add new lines then fail
if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1; }; // If we are pointing to array index[0] then don't update "next"
// index[1] is the first valid entry that can be used.
linkNext++; // Update the "Next" to point to the current index (it was pointing to last)
gradLines[linkNext].next = 0; // Insert the values
gradLines[linkNext].col = col;
gradLines[linkNext].y1 = y1;
gradLines[linkNext].x1 = x1;
gradLines[linkNext].x1Lo = 0;
gradLines[linkNext].m = m;
gradLines[linkNext].y2 = y2;
}
And the LSS is
void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
2a18: 0f 93 push r16
if (linkNext == 255) { return; } // If we have run out of space to add new lines then fail
2a1a: a0 91 c7 01 lds r26, 0x01C7 ; 0x8001c7 <linkNext>
2a1e: af 3f cpi r26, 0xFF ; 255
2a20: d9 f0 breq .+54 ; 0x2a58 <Add_Line+0x40>
2a22: e1 e0 ldi r30, 0x01 ; 1
2a24: ea 0f add r30, r26
if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1; }; // If we are pointing to array index[0] then don't update "next"
2a26: aa 23 and r26, r26
2a28: 39 f0 breq .+14 ; 0x2a38 <Add_Line+0x20>
2a2a: 98 e0 ldi r25, 0x08 ; 8
2a2c: a9 9f mul r26, r25
2a2e: d0 01 movw r26, r0
2a30: 11 24 eor r1, r1
2a32: a0 50 subi r26, 0x00 ; 0
2a34: bc 4f sbci r27, 0xFC ; 252
2a36: ec 93 st X, r30
// index[1] is the first valid entry that can be used.
linkNext++; // Update the "Next" to point to the current index (it was pointing to last)
2a38: e0 93 c7 01 sts 0x01C7, r30 ; 0x8001c7 <linkNext>
gradLines[linkNext].next = 0; // Insert the values
2a3c: 98 e0 ldi r25, 0x08 ; 8
2a3e: e9 9f mul r30, r25
2a40: f0 01 movw r30, r0
2a42: 11 24 eor r1, r1
2a44: e0 50 subi r30, 0x00 ; 0
2a46: fc 4f sbci r31, 0xFC ; 252
2a48: 10 82 st Z, r1
gradLines[linkNext].col = col;
2a4a: 81 83 std Z+1, r24 ; 0x01
gradLines[linkNext].y1 = y1;
2a4c: 62 83 std Z+2, r22 ; 0x02
gradLines[linkNext].x1 = x1;
2a4e: 44 83 std Z+4, r20 ; 0x04
gradLines[linkNext].x1Lo = 0;
2a50: 13 82 std Z+3, r1 ; 0x03
gradLines[linkNext].m = m;
2a52: 36 83 std Z+6, r19 ; 0x06
2a54: 25 83 std Z+5, r18 ; 0x05
gradLines[linkNext].y2 = y2;
2a56: 07 83 std Z+7, r16 ; 0x07
}
2a58: 0f 91 pop r16
2a5a: 08 95 ret
I have no idea WHY the PUSH R16 at the start of the routine and the POP R16 at the end OR the superfluous SUBI R31, 0x00 and OER R1,R1 but that's not the question here.
If you look at the C code - you can see the two array access locations are sequential. I have been able to trick the compiler into using the larger indexes to access the second one like this
typedef struct {
uint8_t next;
uint8_t col;
uint8_t y1;
uint8_t x1Lo;
uint8_t x1;
int16_t m;
uint8_t y2;
} GradLineStruct;
typedef struct {
uint8_t last;
uint8_t dummy1;
uint8_t dummy2;
uint8_t dummy3;
uint8_t dummy4;
int16_t dummy5;
uint8_t dummy6;
uint8_t next;
uint8_t col;
uint8_t y1;
uint8_t x1Lo;
uint8_t x1;
int16_t m;
uint8_t y2;
} GradLineStructX2;
void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
GradLineStructX2 *KludgePointer;
if (linkNext == 255) { return; } // If we have run out of space to add new lines then fail
KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
if (linkNext != 0) { KludgePointer->last = linkNext + 1; }; // If we are pointing to array index[0] then don't update "next"
// index[1] is the first valid entry that can be used.
linkNext++; // Update the "Next" to point to the current index (it was pointing to last)
KludgePointer->next = 0; // Insert the values
KludgePointer->col = col;
KludgePointer->y1 = y1;
KludgePointer->x1 = x1;
KludgePointer->x1Lo = 0;
KludgePointer->m = m;
KludgePointer->y2 = y2;
}
Which compiles to this
void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
2a18: 0f 93 push r16
2a1a: d9 01 movw r26, r18
GradLineStructX2 *KludgePointer;
if (linkNext == 255) { return; } // If we have run out of space to add new lines then fail
2a1c: 90 91 c7 01 lds r25, 0x01C7 ; 0x8001c7 <linkNext>
2a20: 9f 3f cpi r25, 0xFF ; 255
2a22: a1 f0 breq .+40 ; 0x2a4c <Add_Line+0x34>
KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
2a24: 28 e0 ldi r18, 0x08 ; 8
2a26: 92 9f mul r25, r18
2a28: f0 01 movw r30, r0
2a2a: 11 24 eor r1, r1
2a2c: e0 50 subi r30, 0x00 ; 0
2a2e: fc 4f sbci r31, 0xFC ; 252
2a30: 31 e0 ldi r19, 0x01 ; 1
2a32: 39 0f add r19, r25
if (linkNext != 0) { KludgePointer->last = linkNext + 1; }; // If we are pointing to array index[0] then don't update "next"
2a34: 91 11 cpse r25, r1
2a36: 30 83 st Z, r19
// index[1] is the first valid entry that can be used.
linkNext++; // Update the "Next" to point to the current index (it was pointing to last)
2a38: 30 93 c7 01 sts 0x01C7, r19 ; 0x8001c7 <linkNext>
KludgePointer->next = 0; // Insert the values
2a3c: 10 86 std Z+8, r1 ; 0x08
KludgePointer->col = col;
2a3e: 81 87 std Z+9, r24 ; 0x09
KludgePointer->y1 = y1;
2a40: 62 87 std Z+10, r22 ; 0x0a
KludgePointer->x1 = x1;
2a42: 44 87 std Z+12, r20 ; 0x0c
KludgePointer->x1Lo = 0;
2a44: 13 86 std Z+11, r1 ; 0x0b
KludgePointer->m = m;
2a46: b6 87 std Z+14, r27 ; 0x0e
2a48: a5 87 std Z+13, r26 ; 0x0d
KludgePointer->y2 = y2;
2a4a: 07 87 std Z+15, r16 ; 0x0f
}
2a4c: 0f 91 pop r16
2a4e: 08 95 ret
Now it still has the superfluous PUSH/POP/SUBI - but has gotten down in size by 10%.
But it does look kludgy.
Can a Compiler+ASM guru show me a better way to make the 2nd array access not do the expensive re-calculate of the array
Posted by andrewm1973: Sun. Apr 22, 2018 - 01:29 AM
1
2
3
4
5
Total votes: 0
BTW - here is my ASM version
; void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
;
; if (linkNext == 255) { return; } // If we have run out of space to add new lines then fail
;
; if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1; }; // If we are pointing to array index[0] then don't update "next"
; // index[1] is the first valid entry that can be used.
;
; linkNext++; // Update the "Next" to point to the current index (it was pointing to last)
;
; gradLines[linkNext].next = 0; // Insert the values
; gradLines[linkNext].col = col;
; gradLines[linkNext].y1 = y1;
; gradLines[linkNext].x1 = x1;
; gradLines[linkNext].x1Lo = 0;
; gradLines[linkNext].m = m;
; gradLines[linkNext].y2 = y2;
; }
;
; Notes:
; The "if (linkNext == 255) { return; }" test is done quite late. After address conversion and linkNext++. This makes that
; case slower than could be if the comparison was done early. This however would make the more common and important case
; when linkNext != 255 one clock slower. You don't really care how slow the ==255 case is as you have run out of memory to
; add more lines at thise stage anyway.
AddLine:
lds r23, linkNext ; Fetch the global variable "linkNext" that is the current head pointer to gradient_lines list
ldi r21, 0x08
mul r23, r21 ; Multiply "linkNext" by sizeof() the structure for the address conversion step
movw R30, R0 ; Move the result of the MUL into Z early so we can restore r1 to <zero>
eor r1, r1 ; r1 = 0
inc r23 ; INC "linkNext". The result of this is to be stored AND used for the ==255 check
breq Add_Line_Exit ; If "linkNext" was ==255 prior to the above INC then the result would be ZERO so we return()
sts linkNext, r23 ; Save the value of the global variable "linkNext"
subi r31, 0xFC ; Finish of the array
address conversion by adding the base address of the array
; only the high byte needs to be added as the array is 256 byte aligned.
ldi r21, 0x01
cpse r23, r21 ; Compare "linkNext + 1" to 0x01. If this is true then the value of "linkNext" before the
; inc was 0x00 and we should skip over the next instruction
st Z, r23 ; array[linkNext].next = linkNext + 1
std Z+8, r1 ; array[linkNext + 1].next = 0
std Z+9, r24 ; array[linkNext + 1].col = col
std Z+10, r22 ; array[linkNext + 1].y1 = y1
std Z+11, r1 ; array[linkNext + 1].x1lo = 0
std Z+12, r20 ; array[linkNext + 1].x1 = x1
std Z+13, r18 ; array[linkNext + 1].m = m
std Z+14, r19 ; 2nd byte of 16 bit value m
std Z+15, r16 ; array[linkNext + 1].y2 = y2
Add_Line_Exit:
ret
Posted by andrewm1973: Sun. Apr 22, 2018 - 08:23 AM
1
2
3
4
5
Total votes: 0
sparrow2 - thanks for the reply.
This function is the most likely delineation between what would be considered "user code" and "Kernel code"
At a lower level than this I have code that sorts the polygons into a Y sorted list, converts that list of lines in the form Y= mX+c, uses that Y sorted list to create a Run Length Encoded RAM buffer while doing some interrupt ended code that "races the beam"
I want to hide all that incredibly complex stuff from anyone wanting to use this RLE mode for writing their own game.
So "AddPolyLine" is the point at which I stop being responsible for data structures and how the code works and how the user wants to write their code. Think of it as an API call.
I have no control, nor do I want to impose upon the user, how they are going to store/treat col, y1, x1, m and y2.
The video mode itself is not as amazing as T2K, but has the potential to be the 2nd most amazing video output code done on the AVR if I have not made any mistakes on my spreadsheet.
(Help with optimizing stuff for T2K was probably how this thread started BTW)
Posted by andrewm1973: Sun. Apr 22, 2018 - 10:41 AM
1
2
3
4
5
Total votes: 0
Clawson,
I am well aware of call-saved and call-clobbered.
R16 is not clobbered at all. It is passed in. Pushed to the stack. Used in an Store and then popped off the stack. A pointless waste of a push and pop because it is not altered between the push and pop.
Kind of like the first EOR r1,r1 is pointless because r1 is not used as <zero> between then and its next trashing by the second MUL.
So, r16 gets the y2 variable due to the calling convention. This variable is not written to, but still the compiler sees fit to save it.
Sure , the compiler is being dumb here, nothing I've never seen before.
I don't see any elegant way to solve this. Just an inelegant way, which is to fuse 2 uint8_t variables into a uint16_t, so that r16 is not used, y2 will go to r18 instead which is not call saved.
Ugly, right? But maybe in this case the compiler will be able to avoid the push/pop.
edit: you can make it less ugly with a macro, I guess.
#define ADD_LINE(col,y1,x1,m,y2) Add_Line(col, y1<<8 | x1, m, y2)
edit 2: and when I say the compiler is being dumb, I don't intend to aim it at the developers. GCC is immensely complex and sometimes, there is no easy way to fix things like this. In fact, those guys are heroes in my opinion, I fear the day when SprinterSB can/will no longer develop avr-gcc...
As you are fussing with this construct and cycle-counting, have you tried 8 bits times 8 bits to get a 16-bit result -- and then just force taking the high byte with a cast or union? Perhaps you can trick the compiler this way. (Actually, I call it "giving the compiler a hint" and imply your ASM mindset.)
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
Top#define zbDsblInt() \
({ \
zb_CPU_stat_t zb_flag; \
__asm__ __volatile__ \
( \
"in %0, 0x3f" "\n\t" \
"cli" "\n\t" \
: "=r" (zb_flag) \
: \
: "memory" \
); \
zb_flag; \
})
or
static inline uint8_t zbDsblInt(void) __attribute__((always_inline));
static inline uint8_t zbDsblInt(void)
{
uint8_t zb_flag;
__asm__ __volatile__
(
"in %0, 0x3f" "\n\t"
"cli" "\n\t"
: "=r" (zb_flag)
:
: "memory"
);
zb_flag;
}
Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net
- Log in or register to post comments
TopJust for fun, I poked mul() into a CodeVision test program as an inline function.
The results came fairly close -- except for the call to the utility routine to do the signed/unsigned mixed multiply adjustment.
[I had to use CV's "Promote char to int?" option. ;) ]
Now, if you go through the possibilities and don't force it to do the adjustment, then it might come out "clean"?
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopI copied your latest code to a standalone C source file (with a few missing definitions added to make it compile). That code is at http://pastebin.com/DKin9rVS.
Try as I might, I couldn't get avr-gcc to produce the longer multiplication code sequence with anything but -O0. -Os/-O1/-O2/-O3 all produce only two mulsu instructions, one for each multiplication.
Can you copy and paste the exact avr-gcc command that is used to compile your code? I suspect that this source file is not getting the right optimization flag for some reason.
- Log in or register to post comments
TopTheusch, it only does the crazy long thing when one of the arguments is unsigned.
Christop, thanks - I will download your example from pastebin and investigate my makefile to see if I am breaking it there somehow.
- Log in or register to post comments
TopThanks Christop, It's defintly something in the makefile or environment for that project.
In a totally clean environment I get this
Which is only one clock behind the inline ASM for the same thing - close enough for this part of the code that it will stay as C.
I have looked further and no where in my code that is doing similar fixed math is FMUL being used. However in the kernel code FMUL is being used. They are both being compiled with the same flags
In fact in a clean environment even -O0 gives the almost optimal code.
The only difference I can see with my code and the kernel is that mine is chock full of things that look like this.
__attribute__ ((section (".renderlinesa")))
__attribute__ ((section (".freeflash")))
__attribute__ ((section (".renderlinesb")))
They are both using
that came with MHVTools.
I'll get back here once I solve the mystery.
- Log in or register to post comments
TopStraight into another question.
I have a table in flash that is aligned to a 256 byte boundary for speed of access in my ASM
and the ASM for sin() and cos() looks like this
I don't think this one is likely - but is there are a way to make the C code not do
It's the LDI/SUBI/SBCI that the alignment was intended to get rid of.
The compiler KNOWs the table is 8 bit boundary, can you convince it that it does not need to subtract ZERO from the address?
- Log in or register to post comments
TopDon't expect C to make much better code.
But perhaps try a pointer that point to SinCosTable and add the index, (perhaps place it in a register pair.)
But why do you want to change it to C if it's already running in ASM ?
If you really hunt speed then make some inline code where speed really matter to save the call/ret.
In ASM change to use IJUMP (I know that might change the structure away form what C use, and I guess that your ASM does aswell)
- Log in or register to post comments
TopIt's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.
The section of the game that renders all the video to screen and draws all the lines is in ASM as C won't be able to do that.
However the sections of the game that do
Object management
Object movement
Collision detection
Game logic
Are not so time critical and are mostly in C already.
The Sin/Cos thing is called often enough that I will code it as inline ASM though. The inline ASM thing is not as intimidating as it first looked now I have bothered to work it out.
- Log in or register to post comments
TopIf you make a good documentation of how it works, I would prefer clean ASM, than some C with all kind of compiler "hacks".
- Log in or register to post comments
TopYes - anything that looks uglier in C to make it optimal won't be used.
For example the above code where #defining an inline ASM macro only saved one clock over
is out. But the inline ASM to replace
is in. Because not only is it a lot of clocks, it also is called quite often.
Also any ASM code is pretty well explained when people do want to look at it - as an example
And any code that is almost a direct equivalent of something from C has the C code above it in comments.
It is a fairly famous game I am porting to the Uzebox and I suspect a few people to be interested in how I managed it.
- Log in or register to post comments
TopI'm actually pretty impressed. I thought that mixed-size arithmetic (multiplying two bytes to get 16 bits) was one of the last holdout advantages for assembler. (that, and direct use of the carry bit.)
- Log in or register to post comments
TopBut the biggest problem : you never know when the compiler change the code to something bigger!
And remember for the compiler you haven't done the >>8 yet! (it should have been something like mov r24,r1) , I guess it will add: mov r24,r25
- Log in or register to post comments
Topsparrow2,
straight after the
if does
so there is no extra move to get the >>8.
The only real waste in those few lines is the extra unnecessary eor r1,r1 it has to do again a few clocks later.
BTW - I reinstalled 4.8.2, cleaned up my project and removed a lot of orphaned code and files and I could get the compiler to come up with the mulsu code above.
HOWEVER the compiler compiler completly screwed up a pointer in another part of the code and had me chasing a phantom bug for 2 days.
Installed 4.9.2 20140912 and the compiler now gets the ASM correct. I suspect it was this bug that has been fixed as at 4.9.0
https://gcc.gnu.org/bugzilla/sho...
In 4.8.2 the code loaded a memory address for a pointer into X, did some additions to X to access members of the struct, then called another routine with the trashed version of X.
In 4.9.2 the address is loaded into r16, copied to X, access the members trashing X, reload X from r16, then call other routine.
- Log in or register to post comments
TopI'm not sure that I get the hole thing. But as see it you don't need the movw r24, r0!
In ASM it should be done direct by storing r1 (and I don't get the add r25,r19 it's not a +=)
- Log in or register to post comments
Topsparrow2,
As shown in post #50 the next C statement straight after them is an =
Now if I was optimizing the group of statements in ASM I just would have added R1 to R25. I was only looking at the single
here though. My inline ASM version moved R1 somewhere first because it was an optimization of a single line. The C version using MOVW is the same speed as MOV. Just trashes R24 needlessly.
I know that if I did everything in ASM I would make faster code, but in this section of the game were speed is not as critical and readability is, I accept a clock here and a clock there slower.
In the bits of code that need to be fast
which are called hierarchically - I have been able to globally optimize register usage and very aggressivly inline stuff with assembly macros. Having the carry flag and not having to comply with the ABI needing R24 always being used for arguments has allowed me to make that stuff very fast.
- Log in or register to post comments
TopI think this will be the last one. I have picked most of the low hanging fruit. In fact even with this one I am probably going to leave it in C. I am curious if there is a way to get C to use MUL or ignore the MSB.
I have a 16 bit pointer that is the address of a structure in an array. I need to go back to the index of the array
In effect I am doing
i = ( (uint16_t)(objectPointer) - (uint16_t)(&objectArray[0]) ) / sizeOf(object);
Now I know
So I know if I do the divide by 8 first, I can just use the lowest 8 bits and minus 0x44 (0x0220 / 8 = 0x44) hence the most optimal C I can make is
which gets compiled to
Which is 16 clocks.
In ASM I can
that keeps the full 8 bits of the result in 9 clocks OR because I know the answer is only 7 bits I can
which is the answer I need in this specific case in 6 clocks.
Any way of getting C to understand I only have 7 bits and the short LSR/ROR trick can be taken? Or even make it do the full 3 LSR/ROR (no BRNE.-8) and blow the code size out by 1 word and still get the answer in 7 clocks?
- Log in or register to post comments
TopCan any one please tell me code vision avr is compatible with which usb programmer?its urgent
- Log in or register to post comments
TopDoes -O3 unroll ?
If yes find which compiler flag that make the change.
- Log in or register to post comments
Topperhaps only /4 and then /2 (but it's dangers next version perhaps see that it can be done as one operation )
because you only need 7 bit you know that /4 still can be hold the byte.
- Log in or register to post comments
TopAnd I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.
- Log in or register to post comments
TopSparrow2, -03 does unroll it but I have to use -0s to fit all my game into the mega644. The difference between -O3 and -Os is about 4K in size saving. When I include the music assets in the game I have about 300 bytes left.
I have just been reading some more about GCC and optimizing. Looks like I have to put some of these functions in a seperate C file (fast.c) I can run -O3 on them, but leave the non critical stuff in slow.c and then link them. There seems to be no way to just ask the compiler to switch optimize levels around bits of code.
But your idea of a /4 and /2 does work
becomes
- Log in or register to post comments
Topyou mean to
- Log in or register to post comments
Topcorrect instructions wrong numbers! (AND to keep data and then the OR the "counter")
- Log in or register to post comments
TopOK - another one.
comes out as
and I would like it to look more like
Technically in the ASM version I don't need to push/pop as I know R25 is untouched by spibyte_ff, but I am keeping it C/ABI compatible and assuming R25 could get trashed.
I'm using GCC 4.9.2 with -Os. I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.
- Log in or register to post comments
TopI don't expect that it can come all the way down to that.
But first try (I don't have your compiler here) to replace the | with + (with no overlap it's the same)
And try to make a local U16 you assign to.
- Log in or register to post comments
TopHad already tried
and
and
The one with OR ended up the best.
Just wanted to check here that there was not something else I didn't know about to help. I have learned a LOT about how the C compiler thinks here, but I still know very little.
- Log in or register to post comments
TopI would still try with a local variable, and spread it over 2 lines.
that way you should at least avoid the push and pop.
- Log in or register to post comments
TopThat's the problem a compiler don't think
- Log in or register to post comments
TopThat seems pretty sensible. You can always change to an ATmega1284 if you run out of space.
I have no idea what your "song" might be. Any .WAV or .MP3 is going to use more flash than any AVR could supply.
Adding a microSD or even an external Flash memory will provide extra memory.
I can quite understand your desire to minimise program size. I was expecting you to be struggling with a 2kB or 8kB device that had no bigger brothers.
Yes, you can save a few bytes by altering some function usage. The general approach is:
1. list functions by size. look at the greedy ones first.
2. use appropriate width of variables and scope.
3. avoid f-p maths and printf() functions.
4. use smaller algorithms.
I suspect that those 4 points would give you a significant saving. Minor tweaks to one function call will only save tens of bytes.
David.
- Log in or register to post comments
TopNote that code like this has undefined (or implementation defined at best) behavior:
C doesn't specify the order in which the functions are called so the compiler might call the second function first.
- Log in or register to post comments
TopThanks for the reminder. I have read this before.
OK - just tried splitting it to two lines. Slightly different result, but same size.
Sadly no. It is a game for the Uzebox. The user base already all have Mega644s. Squeezing the last drop of blood from the stone is the only option.
It is a "MOD" like song with the notes being synthesised rather than being WAVs. There is only one WAV which is sampled speech of the word "Play"
I think I have moved all the "graphics assets" I can to SD card. Some of them can not be moved to SD card - for example the message saying "SD card not found" :)
Done. Partially at least. The greediest functions are giant 12,000 line long unrolled ASM fucntions, but they exist so the game can be done at all.
I think I have been programming space constrained systems long enough to have this one down pat :)
All fixed point and integer using as few as bits as can get the task done.
No printf or atoi() stuff. My font table is even non-ascii to save time/space printing my BCD and HEX values to screen.
Have all the usual major tweaks you mention in hand.
So far with the minor tweaks like this one - I have saved 400ish bytes. That is one third of the way to being able to include a 3rd song in the game.
Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits. That would save another 800 bytes.
So if I keep plodding along I might get there.
For reference my memory budget is
- Log in or register to post comments
TopWhat do you have in the eeprom?
And why does the bootloader need to be that big?
- Log in or register to post comments
TopBootloader is written by someone else and contains
Video and sound generation (NTSC + mono 15Khz)
SD Card and FatFS reading of games stored on SD card
Menu system to select which new game to play
Flash programming stuff as normal for a bootloader
The EEPROM is shared space that all games written for the uzebox have to share. It is "formatted" and arbitrated by the "kernel" code you have to include/link to your own game.
I myself use 64 bytes of the EEPROM for a highscore table. Its generally considered poor form to use more than 32 bytes, but my game is impressive enough I think it warrants 2 "blocks" :)
- Log in or register to post comments
TopOK - this one has just moved to a (dot)s file. Using the knowledge that mmc_get_byte only touches r24, r30, r31 AND the fact that GetInt is a subset of GetLong let me shrink it a bit.
- Log in or register to post comments
TopIt is still going to stay in ASM now, because I have made the effort and the ASM is 1/2 the size when considering getInt and getLong together.
But it was bugging the hell out of me that I could not get it to just OR two bytes together into a i16 without all the mess.
This si the best I have come up with.
And that comes out as
It is only one push/pop behind the ASM (considering getInt alone)
Does anyone know if there is a way to hint the compiler that "getByte" only touches r24 so it can avoid the push/pop and just use some register pair like R22:23 ?
- Log in or register to post comments
TopLooking at your "budget", I would start by optimising the 16kB of ASM. That is clearly the greediest part.
Quite honestly, there is little point in optimising mmcGetInt(void). It is the calling sequence that matters. e.g. RCALL mmcGetInt / MOV r24 / MOV r25 i.e. 3 words.
Even that could be reduced to 1 word if the subsequent code "knows" it is going to use r24, r25.
Even if the actual mmcGetInt is in an external file (so the linker wants CALL) you can put in a local trampoline.
You are an experienced ASM programmer. You must know all the standard techniques for reducing code. You also know that 95% of code space is not speed critical. It is only the 5% that is used for 90% of the time e.g. inside loops, common subroutines.
David.
- Log in or register to post comments
TopThe 16K is not optimizable at all and they whole game hinges on it. Without it there is no game and the rest of the exersize is pointless. I am willing to be proven wrong here if anyone thinks they can save any space on it but I think that 16K can't have a single BYTE saved from it.
I know any individual routine is never going to save heaps of space. But when you have done ALL the rest you can think of you have to try them.
I have done a lot of optimizing so far that I have not posted questions here about because I figured them out myself. So far I have now saved 600 bytes from optimizing all the little things like GetInt. They ADD UPP.
I managed to reduce getInt and getLong from 106 bytes (first un optimized C) down to 20 bytes in ASM. (My best C got down to 46 bytes I think)
The other nice thing about my "optimizing the little things" like the SD card reading functions here, is that when I push all my little things back into the "kernel" I will save space on everyone else Uzebox game that is written in the future.
This is actually an unusual case that more than %50 of my flash space is taken up with speed critical things.
- Log in or register to post comments
TopAnd no one picked me up on the obvious saving of another 4 bytes I missed. (I noticed it myself when I was cleaning up comments)
- Log in or register to post comments
Tophttps://www.avrfreaks.net/comment/1628051#comment-1628051
Also supports 2-bit, but it sounds awful ;-)
"Experience is what enables you to recognise a mistake the second time you make it."
"Good judgement comes from experience. Experience comes from bad judgement."
"Wisdom is always wont to arrive late, and to be a little approximate on first possession."
"When you hear hoofbeats, think horses, not unicorns."
"Fast. Cheap. Good. Pick two."
"We see a lot of arses on handlebars around here." - [J Ekdahl]
- Log in or register to post comments
TopThanks joey,
I am pretty sure that 4 bit will sound fine for the sample as it is played at the same time as 3 other channels of 8 bit generated sound in the music.
The problem for Uze will be if he can fit the 3 note channels, the one noise channel and the 4 bit WAV channel into the 130 clocks that HSync runs for.
- Log in or register to post comments
TopShould be easily done in assembler. Currently you have:
You'll want:
You'd also need to track which nibble you're working on, so some extra cycles would be spent setting, testing, and clearing a flag somewhere. I assume all the low-hanging fruit like GPIOR0 are already spoken for, but perhaps there's something else available.
"Experience is what enables you to recognise a mistake the second time you make it."
"Good judgement comes from experience. Experience comes from bad judgement."
"Wisdom is always wont to arrive late, and to be a little approximate on first possession."
"When you hear hoofbeats, think horses, not unicorns."
"Fast. Cheap. Good. Pick two."
"We see a lot of arses on handlebars around here." - [J Ekdahl]
- Log in or register to post comments
Top<Thread Necromancy>
Hi again all,
Need help giving the compiler hints again that are human friendly.
The following code compiles and works fine. Just a bit slow. As I have said previously I like to keep code in C rather than moving it to ASM if I can keep performance within 5% or so. Unless of course the C is more confusing and less readable than the ASM version.
And the LSS is
I have no idea WHY the PUSH R16 at the start of the routine and the POP R16 at the end OR the superfluous SUBI R31, 0x00 and OER R1,R1 but that's not the question here.
If you look at the C code - you can see the two array access locations are sequential. I have been able to trick the compiler into using the larger indexes to access the second one like this
Which compiles to this
Now it still has the superfluous PUSH/POP/SUBI - but has gotten down in size by 10%.
But it does look kludgy.
Can a Compiler+ASM guru show me a better way to make the 2nd array access not do the expensive re-calculate of the array
- Log in or register to post comments
TopBTW - here is my ASM version
- Log in or register to post comments
Topfor the hole picture:
you pass something close to the structure <GradLineStruct> that must generate a lot of code before the call !
so perhaps pass a pointer to GradLineStruct variable that hold the values you want!
- Log in or register to post comments
Topsparrow2 - thanks for the reply.
This function is the most likely delineation between what would be considered "user code" and "Kernel code"
At a lower level than this I have code that sorts the polygons into a Y sorted list, converts that list of lines in the form Y= mX+c, uses that Y sorted list to create a Run Length Encoded RAM buffer while doing some interrupt ended code that "races the beam"
I want to hide all that incredibly complex stuff from anyone wanting to use this RLE mode for writing their own game.
So "AddPolyLine" is the point at which I stop being responsible for data structures and how the code works and how the user wants to write their code. Think of it as an API call.
I have no control, nor do I want to impose upon the user, how they are going to store/treat col, y1, x1, m and y2.
The video mode itself is not as amazing as T2K, but has the potential to be the 2nd most amazing video output code done on the AVR if I have not made any mistakes on my spreadsheet.
(Help with optimizing stuff for T2K was probably how this thread started BTW)
- Log in or register to post comments
TopThe reason for the PUSH/POP of R16 is because of this..
https://gcc.gnu.org/wiki/avr-gcc...
The ABI dictates that R2-R17 must be saved. Usually the compiler sticks to the ...
https://gcc.gnu.org/wiki/avr-gcc...
But if the code requires it to use SO many registers it will start to use call saved and then, as you see, it has to save them.
If you pass in a struct rather than all the individual dimensions you will reduce the register usage.
- Log in or register to post comments
TopClawson,
I am well aware of call-saved and call-clobbered.
R16 is not clobbered at all. It is passed in. Pushed to the stack. Used in an Store and then popped off the stack. A pointless waste of a push and pop because it is not altered between the push and pop.
Kind of like the first EOR r1,r1 is pointless because r1 is not used as <zero> between then and its next trashing by the second MUL.
- Log in or register to post comments
TopYes but the compiler will just follow internal rules such as "if R2-R17 used then PUSH/POP"
BTW the compiler is an open project, all improvements/fixes welcome from all who paid to use it.
(Oh, wait a minute, no one actually pays do they?)
- Log in or register to post comments
TopSo, r16 gets the y2 variable due to the calling convention. This variable is not written to, but still the compiler sees fit to save it.
Sure , the compiler is being dumb here, nothing I've never seen before.
I don't see any elegant way to solve this. Just an inelegant way, which is to fuse 2 uint8_t variables into a uint16_t, so that r16 is not used, y2 will go to r18 instead which is not call saved.
Then you would have to call it with
Ugly, right? But maybe in this case the compiler will be able to avoid the push/pop.
edit: you can make it less ugly with a macro, I guess.
edit 2: and when I say the compiler is being dumb, I don't intend to aim it at the developers. GCC is immensely complex and sometimes, there is no easy way to fix things like this. In fact, those guys are heroes in my opinion, I fear the day when SprinterSB can/will no longer develop avr-gcc...
- Log in or register to post comments
TopPages