Forum Menu




 


Log in Problems?
New User? Sign Up!
AVR Freaks Forum Index

Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Author Message
SprinterSB
PostPosted: Aug 02, 2012 - 09:25 PM
Posting Freak


Joined: Dec 21, 2006
Posts: 1483
Location: Saar-Lor-Lux

joeymorin wrote:
SprinterSB wrote:
The register foorprint can be made a bit smaller and some instructions in the prologue/epiloge be saved (assuming the C ABI).
Also no registers are ever used that might be fixed (R2...R7).
Impressive!
    Cycle count: 102
    Word count: 77
Please notice that this code is completely untested and might contain typos or thinkos. It just passed my brain v0.9beta simulator.

The code served to estimate wether the code size is small enough to make it a reasonable extention for avr-gcc. Because the code size is big — 150 bytes compared to the 200 bytes of the vanilla 64-bit multiply — I don't think it's a good idea to add it to (lib)gcc: Without that widening multiply, code can share the plain __muldi3 and the widening mul just costs two zero-extends.

Maybe the code size can be further reduced at the expense of some additional ticks: Counting in words, a 64 = 32×32 widening multiplication can be built of four 32 = 16×16 widening multiplies, whose code is already contained in libgcc and could be shared.

Also notice that there is already a multiply-addumulate function in libgcc that only consumes 18 bytes and can be reused to perform
Code:
uint64_t __muldi3_6 (uint64_t c, uint16_t a, uint16_t b)
{
    return c + 0x10000ULL * ((uint32_t) a * b);
}


Writing the two factors R and S in base B = 2^16, we have
    P = r·s = (r0 + r1·B) · (s0 + s1·B)
Using the distributive law this results in 4 widening multiplications and some additions:
    P = r0·s0 + r0·s1·B + r1·s0·B + r1·s1·B²
Following an idea of Karatsuba, we can write this as
    b = (r0 + r1) · (s0 + s1)
    a = r0 · s0
    c = r1 · s1
    P = a + (b-a-c)·B + c·B²
leaving us with only 3 multiplications (multiplications with B effectively cost nothing). Thus, we now have 3 MUL + 6 ADD instead of 4 MUL + 3 ADD.
 
 View user's profile Send private message Visit poster's website 
Reply with quote Back to top
skeeve
PostPosted: Aug 03, 2012 - 12:04 AM
Raving lunatic


Joined: Oct 29, 2006
Posts: 2648


SprinterSB wrote:
Writing the two factors R and S in base B = 2^16, we have
    P = r·s = (r0 + r1·B) · (s0 + s1·B)
Using the distributive law this results in 4 widening multiplications and some additions:
    P = r0·s0 + r0·s1·B + r1·s0·B + r1·s1·B²
Following an idea of Karatsuba, we can write this as
    b = (r0 + r1) · (s0 + s1)
    a = r0 · s0
    c = r1 · s1
    P = a + (b-a-c)·B + c·B²
leaving us with only 3 multiplications (multiplications with B effectively cost nothing). Thus, we now have 3 MUL + 6 ADD instead of 4 MUL + 3 ADD.
A problem with this is that the first two sums can overflow 16 bits.
The matter can be dealt with.
I expect that whether the Karatsuba formula reduces the
cycle count depends on the details of how it is dealt with.

_________________
Michael Hennebry
Iluvatar is the better part of Valar.
 
 View user's profile Send private message  
Reply with quote Back to top
nick.parker
PostPosted: Aug 03, 2012 - 12:17 AM
Wannabe


Joined: Jul 05, 2002
Posts: 61
Location: Perth, Australia

Shocked SprinterSB, thanks for your excellent input.
 
 View user's profile Send private message Visit poster's website 
Reply with quote Back to top
joeymorin
PostPosted: Aug 03, 2012 - 01:14 AM
Hangaround


Joined: Jul 17, 2012
Posts: 428


SprinterSB wrote:
Please notice that this code is completely untested and might contain typos or thinkos. It just passed my brain v0.9beta simulator.

Below please find some code to square a single uint32_t into a uint64_t result. It is basically a rewrite of the code provided by nick.parker and later optimised by SprinterSB. I have not done exhaustive testing, but the testing I have done suggests that it is sound, and I believe it to be so. Neverthelss, use at your own risk Wink

I realise this is a pretty specific piece of code, but it will serve my purposes. Currently, under my build environment, GCC 64-bit mult costs 870 ticks (and 502 bytes, not counting mulsi3). The below code costs 79 cycles and 128 bytes. Since the time-critical parts of my application require only 64bit = 32bit^2, this is an excellent solution for me.

Of course, I invite comments and criticism.

Many thanks to nick.parker, SprinterSB, et al.

Cheers,
jj

Code:
#define __zero_reg__ r1
#define __tmp_reg__  r0

.macro DEFUN name
.global \name
.func \name
\name:
.endm

.macro ENDF name
.size \name, .-\name
.endfunc
.endm


.text

;; A[0..3]: In: Operand;
;; Out: Result.High (R25:R22)
#define A0  22
#define A1  A0+1
#define A2  A0+2
#define A3  A0+3

;; P[0..7]: Product
#define P0  18
#define P1  P0+1
#define P2  20
#define P3  P2+1
#define P4  26
#define P5  P4+1
#define P6  30
#define P7  P6+1

;; T[0..7]: Result
;; note T[0..3] are never referenced, as they are aliases for P[0..3]
#define T0  18
#define T1  T0+1
#define T2  20
#define T3  T2+1
#define T4  22
#define T5  T4+1
#define T6  24
#define T7  T6+1

#define zero    17

;; R25:R18 = (uint64_t) R25:R22 ^ 2
;; Ordinary ABI-Function
;; 79 cycles, 64 words (128 bytes)

DEFUN __umulsidi3
    ;; prologue
    push    zero

    ;; need a local zero register
    clr     zero
   
    ;; zero the 6 MSB of the product
    clr     P2
    clr     P3
    movw    P4, P2
    movw    P6, P2

    ;; 0 bytes offset
    mul A0,A0  $  movw P0,r0
    ;; P0 is now complete

    ;; 1 byte offset
    ;; R1 is <= 0xfe = 255*255/256  -> first ADC don't set Carry
    mul A0,A1  $  add P1,r0  $  adc P2,r1
                  add P1,r0  $  adc P2,r1  $ adc P3,zero
    ;; P1, P0 are now complete

    ;; 2 bytes offset
    ;; R1 is <= 0xfe and P3 <= 1  -> first ADC don't set Carry
    mul A0,A2  $  add P2,r0  $  adc P3,r1
               $  add P2,r0  $  adc P3,r1  $ adc P4,zero
    mul A1,A1  $  add P2,r0  $  adc P3,r1  $ adc P4,zero
    ;; P2, P1, P0 are now complete

    ;; 3 bytes offset
    mul A0,A3  $  add P3,r0  $  adc P4,r1  $  adc P5,zero
               $  add P3,r0  $  adc P4,r1  $  adc P5,zero
    mul A1,A2  $  add P3,r0  $  adc P4,r1  $  adc P5,zero
               $  add P3,r0  $  adc P4,r1  $  adc P5,zero
    ;; A0 is done with
    ;; P3, P2, P1, P0 are now complete

    ;; 4 bytes offset
    mul A1,A3  $  add P4,r0  $  adc P5,r1  $  adc P6,zero
               $  add P4,r0  $  adc P5,r1  $  adc P6,zero
    ;; A1, A0 are done with
    mul A2,A2  $  add P4,r0  $  adc P5,r1  $  adc P6,zero
    ;; P4, P3, P2, P1, P0 are now complete

    ;; 5 bytes offset
    mul A2,A3  $  add P5,r0  $  adc P6,r1  $  adc P7,zero
               $  add P5,r0  $  adc P6,r1  $  adc P7,zero
    ;; A2, A1, A0 are done with
    ;; P5, P4, P3, P2, P1, P0 are now complete

    ;; 6 bytes offset
    mul A3,A3  $  add P6,r0  $  adc P7,r1

    ;; A3, A2, A1, A0 are done with
    ;; P7, P6, P5, P4, P3, P2, P1, P0 are now complete

    ;; Move Result P[7..0] to T[7..0] (R25:R18) according to ABI
    ;; P[3..0] already in place
    movw    T4, P4
    movw    T6, P6

    ;; Epilogue
    pop     zero
    clr __zero_reg__
    ret
ENDF __umulsidi3
 
 View user's profile Send private message  
Reply with quote Back to top
nick.parker
PostPosted: Aug 03, 2012 - 01:38 AM
Wannabe


Joined: Jul 05, 2002
Posts: 61
Location: Perth, Australia

gregd99 wrote:
out of interest.... what sort of application are you working on that requires the space and time efficient multiplication.

My hobby, engine management. I wanted faster division and to give a new lease of life to my software by clawing back CPU time, and speeding up some calculation rates. There is lots of table look-up type code (various index calcs require division), and I like to use 'real' physics too.

I traded off 16Kb of flash (4096 x 4B) for speed. The table of reciprocals is scaled by 2^32 to implement division quickly in my application. I've tried other 'pseudo floating' point ideas with a table of reciprocals from say 1/2 to 1, but listing below is faster for my app where i typically divide by less than 4096.

MulU4U4SH32RND is the 32x32 mutliply with an add of 0x00008000 and then shift (ie rounding). Ideally I should incorporate the ">>s" into the mult and also round this too.

Cheers, Nick

This code is way faster (2-3x) than GCCs 'proper division'. I save 100's of cycles per division, and there are many division per 100Hz control loop in my code.

Code:

uint32_t divU4U2(uint32_t n, uint16_t d)
{
  uint8_t s=0;
  uint32_t temp;

  if (d==0) { return MAX_UINT4; }
  if (d==1) { return n; }

  // 'normalise' denominator
  while (d > 4096) { d >>= 1; s++; }

  temp = pgm_read_dword(&reciprocal[d]);  // 32-bit reciprocal
  temp = MulU4U4SH32RND(n, temp) >> s;
  return(temp);
}


if you sacrifice some more precision its faster still to use mulu3u3 and top 24 bits from the table. A 16-bit reciprocal table is too imprecise to be useful for 'larger' denominators.
 
 View user's profile Send private message Visit poster's website 
Reply with quote Back to top
gregd99
PostPosted: Aug 03, 2012 - 04:01 AM
Hangaround


Joined: Oct 10, 2011
Posts: 238
Location: Sydney, Australia

Thanks for the info Nick. That sounds interesting.

My hobby is model railway and I use atmegas in that. the consequence of me getting something wrong is probably less spectacular than in engine management Very Happy

you have probably long since taken this action but.... is there an option for a faster processor in the same footprint that can help with time constraints?

_________________
regards
Greg
 
 View user's profile Send private message  
Reply with quote Back to top
nick.parker
PostPosted: Aug 03, 2012 - 04:14 AM
Wannabe


Joined: Jul 05, 2002
Posts: 61
Location: Perth, Australia

No problems Greg.

I'm still on my first engine Wink
If an AVR cant do it, what are you doing?
I'd probably go Cortex M4. The Cortex M4 kind of invalidates all the stuff discussed in this thread. No fun problems to solve, anymore now you're dealing with DMA issues and more complex peripherals. The M4 is more capable than the Texas DSPs I've used at work - a sign of the times.

For the STMicro STM32F4:
-168MHz clock
-Native floating point
-32 x 32 = 64 in one cycle
-32-bit Division in 2-12 cycles (wow)

AVR's are efficient and highly capable in the right hands. And frankly I enjoy trying to squeeze more out of the devices and learning GCC/AvrStudio etc.

Nick.
 
 View user's profile Send private message Visit poster's website 
Reply with quote Back to top
Display posts from previous:     
Jump to:  
All times are GMT + 1 Hour
Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Powered by PNphpBB2 © 2003-2006 The PNphpBB Group
Credits