Inline assembly to use FMUL

Go To Last Post
31 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So far, I have been calling an assembly function to use the FMUL instruction:

; 1:7 Fixed Point Multiplication
.global fmuls_8
.func fmuls_8
fmuls_8:
    push r23
    mov r23, r24
    fmuls R22, R23
    mov r24, r1
    clr r1
    pop r23
	ret
.endfunc

Works fine, but I want to get better performance by having the assembly inline with my C code. This will save time by avoiding a function call, and the code will be smaller since I don't need to place the data on special registers to do the call and then to do the function return.

Basically, what I need is to fmultiply two 8bit variables and only take the high byte from the 16bit result. Can someone help?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So use inline assembly. Which compiler do you use?

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm using AVR GCC.
What I am not familiar with is with the clobbering stuff...
How do I know which register the compiler is using for a variable?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ganzziani wrote:
So far, I have been calling an assembly function to use the FMUL instruction:
Why do you play games with r23?
Why not fmuls r24, r23?
Quote:

; 1:7 Fixed Point Multiplication
.global fmuls_8
.func fmuls_8
fmuls_8:
    push r23
    mov r23, r24
    fmuls R22, R23
    mov r24, r1
    clr r1
    pop r23
	ret
.endfunc

Works fine, but I want to get better performance by having the assembly inline with my C code. This will save time by avoiding a function call, and the code will be smaller since I don't need to place the data on special registers to do the call and then to do the function return.

Basically, what I need is to fmultiply two 8bit variables and only take the high byte from the 16bit result. Can someone help?

Are you asking how to do inline assembly?
Unless the compiler introduces a function call,
the following should work faster than a function
call but slower than inline assembly.
static inline unsigned char fmuls(unsigned char arg1, unsigned char arg2)
{
return (arg1*arg2)>>7;
}

If you go the inline assembly route,
wrap it in a static inline function to be sure the arguments are a byte a piece.
WinAVR->avr-libc->user manual->inline assembler cookbook

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
Why do you play games with r23?
Why not fmuls r24, r23?
The operands of FMUL need to be registers R16 thru R23.
skeeve wrote:
the following should work faster than a function call but slower than inline assembly.
static inline unsigned char fmuls(unsigned char arg1, unsigned char arg2)
{
return (arg1*arg2)>>7;
}

Thanks, I will check it out. I suspect this will generate a MUL and then a shift, which is better than calling a function. The FMUL however does the MUL and the shift in one instruction.

The inline assembler cookbook looks like TFM I nead to R. I will appreciate any help though.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is some inline assembly code (replace $ with percent) and a variation of skeeve's C code. I haven't spent time to determine if the results are the same in both cases.

#define FMULS8(v1, v2)                \
({                                    \
    uint8_t res;                      \
    uint8_t val1 = v1;                \
    uint8_t val2 = v2;                \
    __asm__ __volatile__              \
    (                                 \
        "fmuls $1, $2"         "\n\t" \
        "mov $0, r1"           "\n\t" \
        "clr r1"               "\n\t" \
        : "=&d" (res)                 \
        : "a" (val1), "a" (val2)      \
    );                                \
    res;                              \
})

static inline signed char fmuls8(signed char arg1, signed char arg2)
{
    return((arg1 * arg2) >> 7);
}

int8_t a;
int8_t b;
int8_t r;
int main(void)
{
#if 0
    r = FMULS8(a, b);
#else
    r = fmuls8(a, b);
#endif
    for (;;);
}

The code resulting produced for each by WinAVR 20100110 is:

    r = FMULS8(a, b);
    18a4:	20 91 02 01 	lds	r18, 0x0102
    18a8:	30 91 00 01 	lds	r19, 0x0100
    18ac:	a3 03       	fmuls	r18, r19
    18ae:	81 2d       	mov	r24, r1
    18b0:	11 24       	eor	r1, r1
    18b2:	80 93 01 01 	sts	0x0101, r24

    r = fmuls8(a, b);
    18a4:	80 91 00 01 	lds	r24, 0x0100
    18a8:	20 91 02 01 	lds	r18, 0x0102
    18ac:	82 02       	muls	r24, r18
    18ae:	c0 01       	movw	r24, r0
    18b0:	11 24       	eor	r1, r1
    18b2:	88 0f       	add	r24, r24
    18b4:	89 2f       	mov	r24, r25
    18b6:	88 1f       	adc	r24, r24
    18b8:	99 0b       	sbc	r25, r25
    18ba:	80 93 01 01 	sts	0x0101, r24

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
"=&d" (res)
.
.
res;

So vars. in output list means return value ? What if you have more than 1 value in the output list ?

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

indianajones11 wrote:
So vars. in output list means return value?
Yes. It's all fairly well explained in the Inline Assembly Cookbook, part of the avr-libc FAQ installed when you installed WinAVR and also available online.
indianajones11 wrote:
What if you have more than 1 value in the output list ?
Multiple entries in the output list mean you have multiple outputs. You can also have the same variable be both an input and an output.

The macro that I proposed was written to be used as a function; that's why the "res" variable appears on a line by itself at the end. (That's the idiom for making a macro behave like a function.) Of course, if you have multiple outputs you'll have to arrange for them to be conveyed to the caller, e.g. via passed pointers, a structure, global variables, etc.

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried using the assembly macro and I get these errors:

mso.s:2577: Error: constant value required
mso.s:2577: Error: register r16-r23 required
mso.s:2577: Error: `,' required
mso.s:2577: Error: register r16-r23 required
mso.s:2577: Error: garbage at end of line
mso.s:2577: Error: junk at end of line, first unrecognized character is `1'
mso.s:2577: Error: junk at end of line, first unrecognized character is `2'
mso.s:2578: Error: constant value required
mso.s:2578: Error: `,' required
mso.s:2578: Error: garbage at end of line
mso.s:2578: Error: junk at end of line, first unrecognized character is `0'

Lines 2577 and 2578 are:

	fmul $1, $2
	mov $0, r1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dkinzer wrote:
Here is some inline assembly code (replace $ with percent)
Please follow the directions. The substitution was made because the forum software chokes on the post with the percent sign in its proper place. There may be other ways to avoid this problem but I and others similarly have chosen this way.

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

:oops: thanks

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Don, it works great!
I have a loop where I use the FMUL, now the loop takes 12 cycles less. My program memory increased just a couple of bytes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Clawson's workaround for percent sign posting:

Clawson wrote:
Replace % signs with %
(that's the HTML code for a % sign)

I can't make it work, personally. I tried angle brackets. How come I can just put it in % and it works ?!!

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

indianajones11 wrote:
Clawson's workaround for percent sign posting:
Another alternative, which I've seen before but hadn't yet tried, is to use symbolic labels (identifiers enclosed in square brackets) as shown below. The identifiers needn't be the same as the corresponding variable names (enclosed in parentheses) as I've done.

The other nice thing about using symbolic labels instead of ordinal references is that it makes it simpler to add/remove elements of the input and output lists. With the ordinal references you must adjust some/all of the references but with symbolic references you don't.

Note that older versions of avr-gcc don't support symbolic labels. I don't know when support was added but it works fine in WinAVR_20100110. It would be nice if you could omit the variable reference if a symbolic label is present. I tried it - it doesn't work.

#define FMULS8(v1, v2)                         \
({                                             \
    uint8_t res;                               \
    uint8_t val1 = v1;                         \
    uint8_t val2 = v2;                         \
    __asm__ __volatile__                       \
    (                                          \
        "fmuls %[val1], %[val2]"   "\n"        \
        "mov %[res], r1"           "\n"        \
        "clr r1"                   "\n"        \
        : [res] "=&d" (res)                    \
        : [val1] "a" (val1), [val2] "a" (val2) \
    );                                         \
    res;                                       \
})

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ganzziani wrote:
it works great!
The question is, however, do you understand it well enough to be able to do something similar in the future. If not, you might want to spend some time reading the Inline Assembly Cookbook.

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Quote:
Replace % signs with %
I can't make it work, personally.
It needs a semicolon at the end:
%
See, it works :)

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My macro reads:

// c = a*b in Q1.7
#define fmul8_8(_a,_b)                    \
    ({                                    \
        int8_t _c;                        \
        asm (                             \
        "fmuls %[a], %[b]"    "\n\t"      \
        "mov   %[c], R1"      "\n\t"      \
        "clr    __zero_reg__"             \
        : [c] "=r" (_c)                   \
        : [a] "a" ((uint8_t)(_a)),  [b] "a" ((uint8_t)(_b))); \
        _c;})

  • the output register need not to be early clobbered
  • the output register need not to be class "d". "r" is just fine
  • the asm mentions all side effects so there is no need for the asm to be volatile. If the compiler can prove that the result is not needed, it can throw away the whole asm.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As Steve says that HTML sequence needs a terminating semi-colon to be complete. But as Steve also showed when you do that all you get is a % sign. To actually type the sequence so it's visible in a post I actually type &X#37; (with X replaced by a semi-colon). (this works because 38 is the code for ampersand)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you Steve & CLiff .

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I want a variation on the macro, but returning the 16bit result. This seems to be working, but can someone else verify?

// Multiply two 8bit signed numbers, return 16bit
#define FMULS(_a,_b)                      \
    ({                                    \
        int16_t _c;                       \
        asm (                             \
        "fmuls %[a], %[b]"    "\n\t"      \
        "movw  %[c], R0"     "\n\t"    \
        "clr    __zero_reg__"             \
        : [c] "=r" (_c)                   \
        : [a] "a" ((uint8_t)(_a)),  [b] "a" ((uint8_t)(_b))); \
        _c;}) 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looks reasonable. An inline function would be easier to read and write.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have been using the macro proposed by SprinterSB and dkinzer. It has been working great. But I found out recently that there is a subtle error, since I am discarding the lower 8 bit part of the result, there is a rounding error. I found that FMULS8(_a,_b) != -FMULS8(-_a,_b). So the way to fix it is to add 0.5 to the result. I tried modifying the macro myself, without luck:

// Multiply two 8bit signed numbers, return 8bit, fix rounding
#define FMULS8R(_a,_b)              \
({                                  \
    int8_t _c;                      \
    asm (                           \
    "fmuls %[a], %[b]"    "\n\t"    \
    "movw R24,R0"         "\n\t"    \
    "adiw R24,63"         "\n\t"    \
    "mov   %[c], R25"     "\n\t"    \
    "clr    __zero_reg__"           \
    : [c] "=r" (_c)                 \
    : [a] "a" ((uint8_t)(_a)),  [b] "a" ((uint8_t)(_b))); \
        _c;}) 

Any help would be appreciated!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Isn't 63 more like 0.246?

JW

[Oh, and this get captchad! :-( ]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What numbers are you using as an example and what answer did it give you ? To round in Q1.7, you add 64 not 63.

Edit: Since the LSB of the 2 byte result can use all 8 bits, wouldn't it be 128 ( halfway pt. ), and not 64...?

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The 8bits numbers I am multiplying represent Q7 format numbers, when multiplied, I get a Q15 result (16bits). These are signed numbers.

I add 63 to the 16bit number for rounding (using ADIW, which can only add an immediate up to 63).

I mistakenly said that I add 0.5 for rounding, what I meant is that I add half of the max value of the lower byte for rounding.

One example where I found the error was with these numbers (showing the decimal representation instead of the fractional): FMULS8(8,50) and FMULS8(8,-50)

FMULS(8, 50) = 0x08 fmul 0x32 = 0x0320
taking only the most significant byte, the result is 3.

FMULS(8,-50) = 0x08 fmul 0xCE = 0xFCE0
taking only the most significant byte, the result is -4.

but using the rounding, the results are:
0x0320 + 0x3F = 0x035F --> High byte = 3
0xFCE0 + 0x3F = 0xFD1F --> High byte = -3

Which is the result I want (a*(-b) = -(a*b))

If there is an error with my approach, please let me know... I'll have to check some more numbers, now I am having doubts if I need to add 128 instead of 64 (or 63). Thanks for looking indianajones11...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sure, no problem. I know the adiw only takes 63 as its max number, but you'd have to code it differently and add 64, IF you only had 7 bits in the LSB, but you have all 8. So:

Quote:
...what I meant is that I add half of the max value of the lower byte for rounding.
Exactly why the # to use for rounding is 128 since the LSB of the result spans the full EIGHT bits. It's the MSB that only has 7 bits.

Your example using 8, and 50 is correct: 8/128 = 0.0625, 50/128 = 0.3906. Their product = 0.0244. I should keep it all to 3 sig. digits, but...I didn't. Your macro's ans = 3 => 3/128 = 0.0234. 0.0234 / 0.0244 *100% = 96.1%...a 3.9% error isn't too bad and hey it's ONLY 7 bits of resolution ! The error falls to below 0.1% for some higher valued numbers I used earlier today.

I'm not so good at handling the negative varieties, and leave it for someone else ! Someday, though...

But your negative # error is just from using SEVEN, not 8, bit numbers, no rnd'ing and lowly integer math, which together lack in resolution/precision. Q1.15 using your same approach and NO rnd'ing would give about 0 % error, I'll bet. DSP books I've studied always state rnd'ing 1st is more accurate than just truncating, so you've got it ( after using the correct rnd'ing # :mrgreen: ) .

Inline function version:

static  inline  int8_t  fmuls8(int8_t _a, int8_t _b)
{                                    
        int8_t _c;                        
        
	asm volatile
	(
	    "\n"    
		                     
		"fmuls %[a], %[b]"    "\n\t"      
		"movw   R24, R0"      "\n\t"      
		"adiw   R24, 63"       "\n\t"      
		"mov   %[c], R25"      "\n\t"      
		"clr    __zero_reg__"    "\n\t"         

		: [c] "=r" (_c)                   
		: [a] "a" (_a),  [b] "a" (_b)
	);
	
	return _c;
}

I think r24, r25 need to be on clobber list, though ? I could write it so compiler picks the pair, but I don't know how to mov the upper register to the output using such a general form.

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

Last Edited: Mon. Oct 7, 2013 - 07:08 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

indianajones11 wrote:
Exactly why the # to use for rounding is 128 since the LSB of the result spans the full EIGHT bits.

That's what I was sayin'. And that also allows us to get rid of the 16-bit addition neatly:

static  inline  int8_t  fmuls8(int8_t _a, int8_t _b)
{                                    
        int8_t _c;                        
        
	asm volatile
	(
	    "\n"    
		                     
		"fmuls %[a], %[b]"    "\n\t"      
		"sbrc  r0, 7"         "\n\t"      
		"inc   r1"            "\n\t"      
		"mov   %[c], r1"      "\n\t"      
		"clr    __zero_reg__"    "\n\t"         

		: [c] "=r" (_c)                   
		: [a] "a" (_a),  [b] "a" (_b)
	);
	
	return _c;
}

I'm not sure about the corner cases, but it's easy to run it through all possible cases in a brute force way - there are only 2^14 of them, even the AVR should handle it fast enough.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks JW!

Your solution improves the result, this is something I can use!

However, there are a few cases where there is still an error, I made a loop on the entire range and I counted 1792 errors (compared to 63232 without rounding). for example fmuls(1,64):

FMULS(1, 64) = 0x01 fmuls 0x40 = 0x0080
bit 7 set-> result is 1

FMULS(1,-64) = 0x01 fmuls 0xC0 = 0xFF80
bit 7 set-> result is 0

I guess that this is as good as it gets without adding too much complexity.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is no need for inline assembler. You can use __builtin_avr_fmuls which is available in all supported versions of avr-gcc, cf. the documentation.

These built-ins are also available for devices without MUL unit and then implemented in software as call to libgcc.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I guess that this is as good as it gets without adding too much complexity.
There's nothing you or anybody else can do to get better results when you only have 7 bits to rep. numbers, including the final result ( Of course something could be done to deal with overflow/underflow ). Either an app can live with the error that comes from that or you step up to Q1.15 or a FP processor.

A chain is ONLY as strong as its weakest link.

1) Studio 4.18 build 716 (SP3)
2) WinAvr 20100110
3) PN, all on Doze XP... For Now
A) Avr Dragon ver. 1
B) Avr MKII ISP, 2009 model
C) MKII JTAGICE ver. 1

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I counted 1792 errors

You have to decide which way you want to round!
round up, or round away from zero, or !
and make the program do what you want!
JW's code looks ok, but with what you seems to want you need to look at the sign of the result aswell.
(I think I have not had any coffee yet )