## ADD with Saturation [need inline Asm help]

36 posts / 0 new
Author
Message

I'm looking to optimize this function:

```// Add signed to unsigned using saturation, return unsigned
uint8_t addwsat(uint8_t a, int8_t b) {
if(b>=0) {
if(a>255-b) return 255;
}
else {
if(a<(-b)) return 0;
}
return a+b;
}```

which the compilers converts to:

```uint8_t addwsat(uint8_t a, int8_t b) {
MOV       R23,R24        Copy register
MOV       R20,R24        Copy register
MOV       R18,R22        Copy register
CLR       R19            Clear Register
SBRC      R18,7          Skip if bit in register cleared
if(b>=0) {
SBRC      R22,7          Skip if bit in register cleared
RJMP      PC+0x000A      Relative jump
if(a>255-b) return 255;
SER       R24            Set Register
SUB       R24,R18        Subtract without carry
SBC       R25,R19        Subtract with carry
CP        R24,R20        Compare
CPC       R25,R21        Compare with carry
BRGE      PC+0x0C        Branch if greater or equal, signed
SER       R24            Set Register
RET                      Subroutine return
if(a<(-b)) return 0;
CLR       R24            Clear Register
CLR       R25            Clear Register
SUB       R24,R18        Subtract without carry
SBC       R25,R19        Subtract with carry
CP        R20,R24        Compare
CPC       R21,R25        Compare with carry
BRGE      PC+0x03        Branch if greater or equal, signed
RET                      Subroutine return
return a+b;
MOV       R24,R22        Copy register
}
RET                      Subroutine return```

I'm convinced that there is a more optimum way to achieve the same result. Any suggestions?

Why not use an int16_t intermediate register - just do the add.

```If high-byte == 0x01 return 0xFF
If high-byte == 0xFF return 0x00
if high-byte == 0x00 return low-byte```

Not sure if it's going to be "tighter" but maybe worth a look.

EDIT: haven't counted but it "looks" tighter:

```00000080 :
union {
int16_t i;
uint8_t b[2];
} join;

join.i = a + b;
80:	77 27       	eor	r23, r23
82:	67 fd       	sbrc	r22, 7
84:	70 95       	com	r23
86:	68 0f       	add	r22, r24
88:	71 1d       	adc	r23, r1
if (join.b[1] == 1) {
8a:	71 30       	cpi	r23, 0x01	; 1
8c:	11 f4       	brne	.+4      	; 0x92
8e:	8f ef       	ldi	r24, 0xFF	; 255
90:	08 95       	ret
return 0xFF;
}
else if (join.b[1] == 0xFF) {
92:	7f 3f       	cpi	r23, 0xFF	; 255
94:	11 f4       	brne	.+4      	; 0x9a
96:	80 e0       	ldi	r24, 0x00	; 0
98:	08 95       	ret
return 0x00;
}
else {
return join.b[0];
9a:	86 2f       	mov	r24, r22
}
}
9c:	08 95       	ret```

Clean copy of source:

```uint8_t addwsat(uint8_t a, int8_t b) __attribute__((noinline));
uint8_t addwsat(uint8_t a, int8_t b) {
union {
int16_t i;
uint8_t b[2];
} join;

join.i = a + b;
if (join.b[1] == 1) {
return 0xFF;
}
else if (join.b[1] == 0xFF) {
return 0x00;
}
else {
return join.b[0];
}
}```

(EDIT2: except it doesn't work - I think I have a sign promotion issue somewhere)

EDIT3: no, it does work (as far as I can tell). My initial test code was:

`	DDRB = addwsat((uint8_t)240, (int8_t)237)`

Only thing is that 237 is not a valid int8_t value ;-)

I've now bounds checked it at both ends and it seems to work.

Yes, thanks! It is working.
Although my assembly output is different:

```@00000A0D: addwsat
286:        join.i = a + b;
CLR       R23            Clear Register
SBRC      R22,7          Skip if bit in register cleared
287:        if (join.b[1] == 1) {
CPI       R23,0x01       Compare with immediate
BRNE      PC+0x03        Branch if not equal
SER       R24            Set Register
RET                      Subroutine return
290:        else if (join.b[1] == 0xFF) {
CPI       R23,0xFF       Compare with immediate
BRNE      PC+0x03        Branch if not equal
RET                      Subroutine return
294:           return join.b[0];
MOV       R24,R22        Copy register
296:      }
RET                      Subroutine return```

I am using AVR-GCC 4.3.3 with -Os optimization. Is this the latest version?

I bet it can be optimized further by using inline assembly, by checking directly the carry bits...

What is lat? XMEGA?

No RSTDISBL, no fun!

Yes, the target is an xmega, although the instruction set is mostly the same.

I got LAT when I built for mega168. It's the Atmel's debugger name for some other more normal opcode. (in fact it is COM)

Quote:
I bet it can be optimized further by using inline assembly, by checking directly the carry bits...

Sure it can be. AVRs have hardware unsigned overunderflow(both SREG_C and SREG_H), signed overunderflow(which is called SREG_V) and if signed, then it can distinguish overflow from underflow (which is called SREG_S). These are calculated automatically. I suppose you could make this code about 6-8 opcodes +ret long.
But since that is -Os, I do not see the point.

No RSTDISBL, no fun!

clawson wrote:
I got LAT when I built for mega168. It's the Atmel's debugger name for some other more normal opcode. (in fact it is COM)
The latest instruction manual appears to have some sort of a LAT instruction. It is probably used in one of those tiny tiny MCUs? If that is the case, then this must be a bug in disassembler?

Have you seen that?

Warning: Grumpy Old Chuff. Reading this post may severely damage your mental health.

I wanted to optimize the assembly output, but the compiler gives me an error at the LAT instruction:

`../asmutil.S:23: Error: unknown opcode `lat'`

Any ideas on how to fix it?

LAT=COM

Shirley, LAT is a pseudonym for COM. i.e. flip every bit.

You simply use your pre-processor to change the mnemonic. Or personally, I would create a conditional macro.

David.

No.

Description:
Operation:
(Z) â† Rd âŠ• (Z), Rd â† (Z)

Syntax: Operands: Program Counter:
LAT Z,Rd 0 â‰¤ d â‰¤ 31 PC â† PC + 1

16-bit Opcode: 1001 001r rrrr 0111
Words: 1 (2 bytes)
Cycles: 1

Warning: Grumpy Old Chuff. Reading this post may severely damage your mental health.

Observe:

```uint8_t addwsat(uint8_t a, int8_t b) __attribute__((noinline));
uint8_t addwsat(uint8_t a, int8_t b) {
union {
int16_t i;
uint8_t b[2];
} join;

join.i = a + b;
if (join.b[1] == 1) {
return 0xFF;
}
else if (join.b[1] == 0xFF) {
return 0x00;
}
else {
return join.b[0];
}
} ```
```// test.s
.LFB2:
.LM1:
.LVL0:
/* prologue: function */
/* frame size = 0 */
.LM2:
clr r23
sbrc r22,7
com r23
```
```// test.lss
00000080 :
union {
int16_t i;
uint8_t b[2];
} join;

join.i = a + b;
80:	77 27       	eor	r23, r23
82:	67 fd       	sbrc	r22, 7
84:	70 95       	com	r23
```
```// Atmel's weird and whacky disassembler in Studio
---- test.c ---------------------------------------------------------------------------------------
6:        uint8_t addwsat(uint8_t a, int8_t b) {
+00000040:   2777        CLR       R23            Clear Register
+00000041:   FD67        SBRC      R22,7          Skip if bit in register cleared
+00000042:   9570        LAT       R23            Load and Toggle```

You pays yer money and you takes yer choice as to how it's annotated.

Clearly this IS a fault in 4.18SP3's disassembler. Why you'd use this when you have access to avr-as source of avr-obdump's own disassembly I don't know?

Why is the assembly in Studio weird and wacky if it generates code that works?
And like MBedder says, the LAT and COM are not the same instruction...
Well, while we figure that out, I optimized the assembly a little bit, 2 instructions less:

```; join.i = a + b;
eor   r23, r23
sbrc  r22, 7
com   r23
; if (join.b[1] == 1) {
cpi   r23, 0x01   ; == 1?
brne   .+4        ; no=>jump
ldi   r24, 0xFF   ; return 255
ret
; else if (join.b[1] == 0xFF) {
cpi   r23, 0xFF   ; == 255 ?
brne  .+2         ; no=>return join.b[0]
ldi   r24, 0x00   ; return 0
ret ```

Quote:

Why is the assembly in Studio weird and wacky if it generates code that works?

It doesn't generate the code?!? It takes code that has been generated by a C compiler and tries to disassemble it and on this occasion gets it wrong.

BTW do not use PC relative addressing in branches - labels cost nothing!

In fact if you are going to use GCC generated Asm why not use the .s files?

Yes, I am placing the assembly function in a .s file.
But I'm still wondering why I can't use the LAT instruction if I wanted to...

Quote:

But I still wondering why I can't use the LAT instruction if I wanted to..

Because it's the WRONG instruction. The instruction that inverts the contents of a register to make 0x00 into 0xFF is COM. Atmel's stupid disassembler made a mistake when it disassembled the code that he C compiler had created.

And when I talk about .s I don't mean the ones you are creating but the one that the C compiler generates for each .c file you build. You simply give the option --save-temps and for every foo.c that is built you get a foo.i (the pre-proc file) and foo.s (the avr-as source output by the compiler and fed into the assembler).

This is the .s file for that routine (including an empty main):

```	.file	"test.c"
__SREG__ = 0x3f
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__CCP__  = 0x34
__tmp_reg__ = 0
__zero_reg__ = 1
;  GNU C (WinAVR 20100110) version 4.3.3 (avr)
; 	compiled by GNU C version 3.4.5 (mingw-vista special r3), GMP version 4.2.3, MPFR version 2.4.1.
;  GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
;  options passed:  -fpreprocessed test.i -mmcu=atmega168 -Os -Wall
;  -std=gnu99 -funsigned-char -funsigned-bitfields -fpack-struct
;  -fshort-enums -fverbose-asm
;  options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
;  -fbranch-count-reg -fcaller-saves -fcommon -fcprop-registers
;  -fcrossjumping -fcse-follow-jumps -fdefer-pop -fearly-inlining
;  -feliminate-unused-debug-types -fexpensive-optimizations
;  -fforward-propagate -ffunction-cse -fgcse -fgcse-lm
;  -fguess-branch-probability -fident -fif-conversion -fif-conversion2
;  -finline-functions -finline-functions-called-once
;  -finline-small-functions -fipa-pure-const -fipa-reference -fivopts
;  -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants
;  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
;  -fpack-struct -fpeephole -fpeephole2 -freg-struct-return -fregmove
;  -freorder-functions -frerun-cse-after-loop -fsched-interblock
;  -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros
;  -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-aliasing
;  -ftree-ccp -ftree-copy-prop -ftree-copyrename -ftree-dce
;  -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im
;  -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
;  -ftree-reassoc -ftree-salias -ftree-scev-cprop -ftree-sink -ftree-sra
;  -ftree-store-ccp -ftree-ter -ftree-vect-loop-version -ftree-vrp
;  -funit-at-a-time -fverbose-asm -fzero-initialized-in-bss

;  Compiler executable checksum: 61d68a374065d489330774d2533cbbdf

.text
/* prologue: function */
/* frame size = 0 */
clr r23	 ;  tmp49
sbrc r22,7	 ;  tmp49
com r23	 ;  tmp49
cpi r23,lo8(1)	 ;  join,
brne .L2	 ; ,
ldi r24,lo8(-1)	 ;  D.1336,
ret
.L2:
cpi r23,lo8(-1)	 ;  join,
brne .L4	 ; ,
ldi r24,lo8(0)	 ;  D.1336,
ret
.L4:
mov r24,r22	 ;  D.1336, tmp49
ret
.global	main
.type	main, @function
main:
/* prologue: function */
/* frame size = 0 */
.L7:
rjmp .L7	 ;
.size	main, .-main
```

clawson wrote:
Because it's the WRONG instruction
What I meant is what if I wanted to use the LAT instruction in general (not just in this ADDWSAT function). Neither the editor or the assembler recognize it.

These instructions (along with LAC, LAS and XCH) are valid only for new XMega CPU core (V3XJ). They can be successfully assembled after selecting an XMega as a target or after issuing the following #pragma directive:

```#pragma AVRPART CORE CORE_VERSION V3XJ

xch     z,r1            ; Exchange r1 with RAM[Z}

lac     z,r2            ; Load And Clear R2 bits of RAM{Z}
las     z,r3            ; Load And Set R3 bits of RAM{Z}
lat     z,r4            ; Load And Toggle R4 bits of RAM{Z}
```

Warning: Grumpy Old Chuff. Reading this post may severely damage your mental health.

Ok, I have optimized much further:

```; ADD with Saturation
; uint8_t addwsat(uint8_t a, int8_t b);
sbrs  r22, 7    ; skip if b is negative
rjmp  checkoverflow
checkunderflow:
brcs  exitadd1  ; carry not set-> underflow
ldi   r24, 0x00 ; return 0
ret
checkoverflow:
brcc  exitadd2  ; carry set-> overflow
ldi   r24, 0xFF ; return 255
ret
.endfunc```

Now I think the code is small enough that it might be efficient to make it inline. Can someone help with this? I don't know much about register clobbering and such...

Ok, I am reading the inline assembler cookbook and I am trying to write the inline assembly code, I still need some help, so far I have this:

```#define addwsat(_a,_b)                  \
({                                  \
asm (                           \
"sbrs  %[b], 7"       "\n\t"    \
"rjmp  2f"            "\n\t"    \
"brcs  1f"            "\n\t"    \
"ldi   %[a], 0x00"    "\n\t"    \
"1: "   "ret"                 "\n\t"    \
"2: "   "brcc  3f"            "\n\t"    \
"ldi   r24, 0xFF"     "\n\t"    \
"3: "   "ret"                           \
: [a] "=r" (_a)                 \
: [a] "r" ((uint8_t)(_a)),  [b] "r" ((int8_t)(_b)));\
}) ```

But the jumps to the RET instruction, and the RET instruction itself, need to be replaced by a "jump outside of this section of code". Also, can someone check the overall code?

Why would you have "RET" in inline code? It's not CALL'd - it's inline - that's the point. You do want one common exit point that all code paths ultimately jump to and that falls through to the next Asm or C statements.

Shall I move this to the GCC forum now as you are more likely to be seen by the inline Asm experts there?

Cliff

Exactly, that is what I meant about the ret. Yes, please move to the gcc forum.

I do not get your intention. You are trying to squeeze a 12w code into the one which takes only 8w. You explicitly point out your aim is to make the code smaller (-Os). And now you want to make it inlined. Can you explain what is the purpose of it?

No RSTDISBL, no fun!

This particular function is used in a loop that needs to be executed very fast. But regardless, Smaller code and Faster code are not mutually exclusive...

Here is the code I use to add-saturate resp. sub-saturate unsigned char + char, 7 Instructions each:

```;; unsigned R16
;; signed   R0
;; R16 := R16 + R0 with Saturation
; Transform [0x00, 0xff] -> [0x80, 0x7f]
subi R16, 0x80
brvc 0f
; Signed overflow -> MAX
ldi  R16, 0x7f
sbrc R0, 7
; R0 is negative -> load MIN
ldi  R16, 0x80
0:
; Back-Transform [0x80, 0x7f] -> [0x00, 0xff]
subi R16, 0x80```

Subtraktion

```;; unsigned R16
;; signed   R0
;; R16 := R16 - R0 with Saturation
; Transform [0x00, 0xff] -> [0x80, 0x7f]
subi R16, 0x80
sub  R16, R0
brvc 0f
; Signed Overflow -> MAX
ldi  R16, 0x80
sbrc R0, 7
; R0 ist negativs -> MIN
ldi  R16, 0x7f
0:
; Back-Trafo [0x80, 0x7f] -> [0x00, 0xff]
subi R16, 0x80```

avrfreaks does not support Opera. Profile inactive.

And here wrapped into Inline Assembler for you :-)

```#include

static inline uint8_t
{
asm ("subi #0, 0x80" "\n\t"
"brvc 0f"       "\n\t"
"ldi  #0, 0x7f" "\n\t"
"sbrc #1, 7"    "\n\t"
"ldi  #0, 0x80" "\n"
"0:"            "\n\t"
"subi #0, 0x80"
: "+d" (a)
: "r" (b));

return a;
}```

You have to replace all # by % because this f***orum software crashed with %.

avrfreaks does not support Opera. Profile inactive.

Thanks for your code! That is just what I needed...

Use % (all FIVE chars) in the places where you need to display a % that's within quotes and it then works.

`asm ("subi %0, 0x80" "\n\t"`

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

Quote:

Use %

Welcome to the "inner circle" - those of us who know how to type % into a post so it isn't turned into % sign :lol:

Yeah, thanks, you and another 'freaks veteran schooled me on that one . I'm a straightup "crime-fighter" ( dissipating posting frustrations wherever I go ) with that one. :)

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

indianajones11 wrote:
UseÂ %

That works just once. If someone quotes that code he will run in the same error
www.avrfreaks.net wrote:

Your browser sent a request that this server could not understand.
Apache Server atÂ www.avrfreaks.net Port 80

and wonder what's wrong with his browser.

Why is it not possible to fix that bug in the first place?

avrfreaks does not support Opera. Profile inactive.

Quote:

Why is it not possible to fix that bug in the first place?

Atmel don't have two sheckels to rub together :-(

Quote:

Why is it not possible to fix that bug in the first place?

Haven't we been here before, SprinterSB?

Atmels budget for AVRfreaks is asymptotically close to zero.

Do us freaks like it? Nah!
Will that change? Prolly not in the near future.
Is it the only 'freaks site PITA? Far from it.

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]