I took one look at the original program and was heart-broken.
I figured that if that's what passed as good Assembly Code then this
old dinosaur was ready for the bone-yard!

It honestly took me an entire day going through it line-by-line because this OLD F-RT
did not want to give-up the farm just yet.

I'm glad to see that there are others not following the same confusing coding conventions.
Not that there is anything wrong with them, I'm just not used to seeing such bizzare looking ASM code. Bizzare that is to an old-timer.

I found another optimization for your version Steve.

;ORIGINAL ROUTINE
SUBANYWAY:
SUB RC,X2
ROL ANS
ROL ANSH
XOR ANS,ONE
DEC N
BRNE DIVIDE
;
;ALTERNATE:
SUNANYWAY:
SUB RC,X2
LSL ANS ;CHANGED!
ROL ANSH
DEC N
BRNE DIVIDE

In the Original you were ROLing in a one and inverting it.
In alternate version we are LSL in a zero.

This should save one instruction inside loop (when it goes this route),
for an estimate of 8, and we no longer need the ONE/X1 register so we save by
NOT initializing it and this saves one instruction outside the loop.
For an average savings of 9 clocks and two program statements LDI X1,1 and EOR,X1
(4 bytes?)

There's a speedup that can apply to both of our functions: share storage between the Dividend and the low-word of the Quotient. That way, the ROLs can do dual-duty.

Here's my new division loop:

; *** do {
ip_DivLoop:
; Shift the Quotient AND roll the Dividend bit into the LSB
rol QuotientB0 ;On the first loop, the Dividend bit is garbage
rol QuotientB1
rol QuotientB2 ;High bit is shifted into carry
brcs ip_BigQuotient ;if C=1, Quotient is guaranteed to be larger than Divisor
;Determine if Quotient is smaller than Divisor
cp QuotientB2, DivisorB2
;if( QuotientB2 < DividendB2 ) then don't do subtraction
brcs ip_DontSub
ip_BigQuotient:
sub QuotientB2, DivisorB2 ;Carry is unknown after this
clc ;Ensure that Carry=0 after a successful subtraction
ip_DontSub: ;if this branch _was_ taken, carry is set
;The Carry now holds the calculated Dividend Bit
; *** } while (--DivCount)
dec DivCount
brne ip_DivLoop
; Roll the final Dividend bit in, and shift the first garbage bit out.
rol QuotientB0
rol QuotientB1
; QuotientB0/B1 now holds the DividendB0/B1

This saves 30 cycles and 2 regs without changing the program size.

Wow, way to go Brian! That's an excellent algorithm!
Great work!

I''ve even found a way to optimize one step further.

DIVLUPE: ROL DXY
ROL DXYH
ROL DXY2
CP DXY2,DX2
BRCS NOSUB
DOSUB: SUB DXY2,DX2
CLC
NOSUB: DEC N
BRNE DIVLUPE
ROL DXY ;GRAB FINAL [c]
ROL DXYH

I skip the comparison after the first ROL and go ahead and do the compare.
This eliminates another program step

Here is how I handle the Negative Slope at end of Division.
Rather than covert to a negative then ADD I just go right to a SUBtraction
saving the conversion process.

BRTS SLPNEG ;CHECK NEG-SLOPE FLAG
ADD Y1,DXY
ADC YH,DXYH
RET
SLPNEG: SUB Y1,DXY
SBC Y1H,DXYH
RET

Dan, without the BRCS DOSUB, I think your routine can fail when (x2-x1) > 128.

DIVLUPE: ROL DXY
ROL DXYH
ROL DXY2
; BRCS DOSUB ;Force a Sub, if a '1' was shifted out of quotient
CP DXY2,DX2
BRCS NOSUB
DOSUB: SUB DXY2,DX2
CLC
NOSUB: DEC N
BRNE DIVLUPE
ROL DXY ;GRAB FINAL [c]
ROL DXYH

Step 2: Is the Quotient high-byte[00000001] < Divisor[10000001]?
Yes, skip subtraction. Dividend LSB = 0.

Step 3: Subtraction skipped

Step 4: ROL in Dividend LSB.

===============
This failed because the dividend needs to be compared against the combined 9 bit value of the carry and Quotient high-byte. So step 2 should have been:

Step 2: Is [1 00000001] < Divisor[10000001]?
No, Do subtraction. Dividend LSB = 1.

Couild be(?)
I only tested it on a limited number of data and results came back good.
You predict it will fail when Delta-X exceeds 128?
Let me try it...

Wow, you are right, it failed misserably!!!
(Kids don't attempt this at home!)

I guess we're trying to squeeze blood from a stone at this point.
I have another idea that might cut execution time quite a bit...

Let me see what I can do and throw another bug-ridden example at you

In the Original you were ROLing in a one and inverting it.
In alternate version we are LSL in a zero.

I should have caught that one! I think it was pretty late when I wrote it.

Your other optimization of subtraction instead of negation is great, but I can do one better. My original optimization with rounding and negation had to do with the fact that the processing after the divide involves a complement, adding one for rounding, and negation. But negation is complement and adding one, so some things canceled. By adding your subtraction idea, things simplify even more. Things cancel out if you do the opposite of what you expect. You add 1 if you don't round, and subtract if the answer is positive, and there is no complement involved at all!

Quote:

Having several people refine the same code can produce many insights.

Yes, we have been doing quite well! I have one more that I am working on. I had mentioned earlier that we could pre-shift if we knew the max difference of the data, reducing the number of times through the loop. But the data itself can tell us if we can pre-shift. I now run another loop before the multiply to shift the y difference until the high bit is one. Each shift reduces the number of times through the divide loop by one. At the moment the shift loop is 2 cycles shorter than the divide loop, which makes it worthwhile, especially with smaller y differences. I'm still trying to see if I can refine it though. I'll try and post the code tomorrow, once I've worked out the kinks.

I upoaded a quick rough-draft of my idea for a Division Routine with only 8 iterations which theoretically could be much faster than the "Compare and Subract" routine we are working on now.

I stuck it in a new thread because I thought it might get confusing talling about two totally different routines in the same thread.

Any help would be greatly appreciated. It still fails sometimes and when it doesn't sometimes the accuracy leaves something to be desired.

I have one more that I am working on. I had mentioned earlier that we could pre-shift if we knew the max difference of the data, reducing the number of times through the loop. But the data itself can tell us if we can pre-shift. I now run another loop before the multiply to shift the y difference until the high bit is one. Each shift reduces the number of times through the divide loop by one.

Hey, that's a fantastic idea! You're making the division routine "scalable" if size warrants it, it becomes a 10x8 Divide later it might be just a 6x8 Divide, etc.

If it's not too much bother could you walk me through the logic of your rounding routine?
I've got a print-out of your code right here and I still don't get it.

If absolutely necessary, could you separate the instructions over a period of interrupt calls to get the same job done periodically while other events are allowed to happen?

In other words...what's more important, this calculation or the other events? Or are they equally important?

You guys have been working real hard at that code, which has improved amazingly over time. However, if you can put off needing the result of the calculation for some time, you should be able to use a timer to spread out your calculation over time, thus buying your other instructions the time they need to run.

This is kind of a shot in the dark. It would only work if you don't need to calculate the numbers immediately :-/.

However, if you can put off needing the result of the calculation for some time, you should be able to use a timer to spread out your calculation over time

We don't really need it for anything. We started out just trying to help the OP come up with something. Since then we've just been playing with it to see what we could do with it. Personally I just like trying to squeeze out every last clock cycle from a routine, even if there is really no reason to do so.

Here is my latest version. It still assumes that y2 - y1 is less than 32768 (including in the normalization routine that I added), but this could certainly be changed if needed. The normalization saves 2 cycles for every pre-shift that it does. You can save 3 more clock cycles per shift if you don't shift 1's into the dividend but instead set them with a routine after the shift that gets the proper values from a look-up table. The look-up costs extra cycles, so you wouldn't see savings until the 3rd or 4th shift. But for small y differences the savings would be significant. It would also cost 32 bytes of memory for the table, plus you would need to use the z register, which means saving it somewhere and restoring it.

Interp:
clr RA
clr RB
clt ;clear neg flag
;find (x - x1)
sub x0, x1 ;subtract x1 from x
breq EarlyOut
;find (x2 - x1)
sub x2, x1 ;subtract x1 from x
;find (y2 - y1)
sub y2L, y1L ;subtract y1 from y2
sbc y2H, y1H
breq EarlyOut
brpl NotNeg ;branch if we are not negative
;y dif is negative, so negate and set flag
neg y2L ;negate yDif
adc y2H, RA ;adjust high byte for carry
neg y2H
set ;set neg flag
NotNeg:
ldi x1, 16
Normalize:
dec x1
sec
rol RA
rol RB
lsl y2L
rol y2H
brpl Normalize
Multiply:
;find (x dif * y dif) - 24 bit result
mul y2L, x0 ;multiply low byte
or RA, r0 ;or so we don't lose the partial divedend
or RB, r1
mul y2H, x0 ;multiply high byte
clr x0
add RB, r0 ;add mid byte
adc RC, x0 ;adjust high byte for overflow
sec
Divide: ;24bit by 8bit divide, unsigned
rol RA ;roll the quotient and dividend left
DividePlus1:
rol RB
rol RC
brcs SubAnyways
cp RC, x2 ;compare divisor with high byte
brcs NotLess ;branch if we can't subtract
sub RC, x2 ;do the subtraction
NotLess:
dec x1 ;decrement the loop count
brne Divide ;branch if not done
rol RA ;catch the last bit
rol RB
;Since the output of the divide routine is the complement of
;what we want, and the fact that rounding is adding one and
;negation is complement plus one, we can simplify to adding
;1 if we don't round, and subtracting when we have a positive
;and adding when we have a negative
RoundAndNegate:
inc x2
lsr x2 ;shift dividend
cpc x2, RC ;compare
brcs Round ;branch if we need to round
adiw RA, 1 ;add when we don't want to round
Round:
brtc SubY1
AddY1:
add RA, y1L ;add if we're negative
adc RB, y1H
ret
SubY1:
sub y1L, RA ;subtract if we're positive
sbc y1H, RB
movw RA, y1L
ret
;take care of 9 bit subtraction
SubAnyways:
sub RC, x2 ;do the subtraction
lsl RA ;put 0 into the answer
dec x1 ;decrement the loop count
brne DividePlus1 ;branch if not done
rjmp RoundAndNegate
;y difference is 0 or x - x1 is 0, so answer is y1
EarlyOut:
movw RA, y1L
ret

Total clocks - min 163, max 199
61 commands, 122 bytes

My round and negate algorithm goes something like this:

There are 4 posibilites for what we need to do for the answer after the divide:
1) no negate, no round, add to y1
2) no negate, round, add to y1
3) negate, no round, add to y1
4) negate, round, add to y1

Negating is complement and add 1, and rounding is add 1 (or sub 1 if the number is negative). But the answer after the divide is the complement of the real answer. So the actions we need to take for the for the cases are:
1) cmp, add to y1
2) cmp, add 1 (or sub 1, cmp), add to y1
3) cmp, cmp, add 1, add to y1
4) cmp, add1, sub 1, add to y1

But for the last two, things cancel, leaving:
3) add 1, add to y1
4) add to y1

This should explain my first variation. Now if we take Dan's idea of subtracting, we can use it on the first two cases. That is, we can subtract the negative of the actual answer. To make these two negative, we need these actions (including the previous actions):
1) cmp, cmp, add 1, sub from y1
2) cmp, add 1, cmp, add 1, sub from y1

In the first, the complements cancel, in the second, it is two negations that also cancel. The final result is:
1) add 1, sub from y1
2) sub from y1
3) add 1, add to y1
4) add to y1

What do you think is the optimal way to implement rounding into the following.

LDI N,16 ;CALC (Y2-Y1)(IX-X1)/(X2-X1)
LUPIE: ROL Y2
ROL Y2H
ROL Y2HH
CP Y2HH,X2
BRCS SKPSUB
SUB Y2HH,X2
CLC
SKPSUB: DEC N
BRNE LUPIE
ROL Y2
ROL Y2H
COM Y2
COM Y2H
BRTS SLPNEG ;NEG-SLOPE?
ADD Y1,Y2 ;ANS=Y1+(Y2-Y1)(IX-X1)/(X2-X1)
ADC YH,Y2H
RJMP DONE
SLPNEG: SUB Y1,Y2 ;ANS=Y1-(Y2-Y1)(IX-X1)/(X2-X1)
SBC Y1H,Y2H
RJMP DONE

Ok, Steve. I think we're bordering on extreme navel-gazing here.

My arbritary priorities for my original implementation were in order of size, clarity, and speed.

But in the relentless pursuit of speed, I have some changes for your divide loop. 8)

Insert this code where your "Divide:" label was originally located:

rol RA ;roll the quotient and dividend left
rjmp Divide ;skip over subtraction code and enter Divide loop
;take care of 9 bit and 8 bit subtraction
SubAnyways:
sub RC, x2 ;do the subtraction
lsl RA ;Shift a 0 into dividend
dec x1 ;decrement the loop count
breq DivideDone
Divide: ;24bit by 8bit divide, unsigned
rol RB
rol RC
brcs SubAnyways
cp RC, x2 ;compare divisor with high byte
brcc SubAnyways ;branch if we can subtract
rol RA ;roll the quotient and dividend left
dec x1 ;decrement the loop count
brne Divide ;branch if not done
DivideDone:
rol RB

The goal of this loop is to have _exactly_ one followed-branch per loop.

It saves ~6.8 cycles per function call in my testing. (and one word!)

It saves ~6.8 cycles per function call in my testing. (and one word!)
--Brian

Brian, how did you get these figures? Is there a way to get them from the Studio4 Software -- I've been counting clocks by hand and it's tedious and prone to errors.

Not bad Brian! There are three paths through the loop.
1) 8 bit subtraction, 10 clocks (which is the same as my loop)
2) no subtraction, 9 clocks
3) 9 bit subtraction, 8 clocks

If the divisor is less that 128, the 9 bit subtraction will never be hit. For >= 128, the higher the number, the more often 9 bit subtraction will occur. So I would guess that it would save on average a little less than one clock per loop over my code. You do lose 2 or 3 clocks (the rjmp to start, and the possible branch out if the last loop is a subtraction). So the 6.8 sound about right (it might even be a hair low).

On the rounding, the way I did it is probably optimal (except with one change that I just noticed from Brian's original code).

COM Y2
COM Y2H
LSR X2 ;divide divisor by 2
CPC Y2HH, X2 ;compare to remainder
BRCS TESTNEG ;branch if not rounding
ADIW Y2, 1 ;round
TESTNEG:
BRTS SLPNEG ;NEG-SLOPE?
ADD Y1,Y2 ;ANS=Y1+(Y2-Y1)(IX-X1)/(X2-X1)
ADC YH,Y2H
RJMP DONE
SLPNEG: SUB Y1,Y2 ;ANS=Y1-(Y2-Y1)(IX-X1)/(X2-X1)
SBC Y1H,Y2H
RJMP DONE

And if you skip the complement, reverse both branches to clear instead of set.

Okay, thanks so much Steve. Now I understand! I got lost in all this bit shuffling.

The idea is if the Remainder is more than 1/2 of original Divisor then we Round Upward, makes prefect sense to me now. Thanks for clearing that up for me.

Did you get a chance to look at my amateur attempt at Division-by-Inverse-Multiply routine I tried? Accuracy is terrible, perhaps your rounding modification will improve results.

Since it only requires 8 interations to invert the Dividend, it should save a lot of
time on larger divides like 24x8, 32x8 and beyond.

I have done four different version of the Bilinear Interpolation program that you requested. Each Bilinear Interpolation requires an internal 3 Linear Interpolations. The slowest version uses the standard "Compare and Divide" method for long division. The fastest uses Binary Division but Data Points for your BIN map must be by multiples of 2. The others use two different versions of a "Division by Inversion" method that I have beeb experimenting with.

Here are the results:

METHOD: Size in Bytes @ Speed: Best Case - Worst Case

BINARY METHOD: 100 @ 104 3.8uS to 10 uS
DIV by INV (a): 130 @ 3.8uS to 23 uS
DIV by INV (b): 120 @ 3.8uS to 24 uS
STANDARD: 110 @ 3.8uS to 39 uS

The RPM and MAP readings are Binary Usigned Inputs and Output is an Ignition Timing Rate in the range 0 to 65,536 (2 Bytes) based on a Binary Interpolation of the Four Nearest Data Points in your !2x12 BIN map.

I hope this helps your project along, as I am taking a lot of "heat" for asking too many questions. Although I have never programmed for the AVRs before (my first one arrived about week-and-a-half ago) I think the AVR has an excellent "core" with not too many "quirks." and even though it was quite frustrating at times being under a time constraint, it was an excellent learning experience.

PS. Are you glad you didn't have to recode your entire project in C just for a few complex math routines?

This looks like the one! A reasonable amount of testing done and it
seems to work great :-) Same results in a real AVR via my VB2005
stimulator and Studio4.

As I've said before, the program really is great and I'm very IMPRESSED. Thanks
very much for your help.

Quick Q: If the table entries were only 8-bit, would there be a
reduction in code size/speed? I only ask, because I can *think* I need
BINTERP with both 16-bit and 8-bit contents - I can obviously use the
16-bit version with 8-bit contents, but I was just wondering....

You might want to publish this code in the AVRFreaks Academy section.

Thanks again for everything

You are very welcome. I am sorry it took so long to write, but trying to learn everything I needed to know about Assembly Programming for AVRs in a week what a bit frustrating, but I am extremely delighted that you are happy with the results.

If the Bilinear Interpolation (BINTERP) speed of 4uS to 39 uS is not fast enough I have some new ideas to speed it up even more. BTW - YES!!!! Switching you 8 bit table entries would drastically speed the program up!

I'm not sure what the AVR Academy is, perhaps I'll look into making BINTERP available there.

Glad to hear my BINTERP will be used in real-life control application. It was an excellent learning experience for me. Have a great day!

Just to re-iterate, Dan's done a great job for me.

- Martin

Thank you very much for the recommendation!

Please, pass my name around, I'd like to find a "Paid Gig" doing this someday.

However, I don't imagine there's much call for Super-Fast, Super-Small AVR code is there?

BTW - I took original ButterFly BootLoader (over 1500 words) and did ASM version in under 255 (1/6 size) for practice.

It has been a wonderful learning experience.

MartinM57 wrote:

If you have the spare time, a BINTERP8 would be very nice!

- Martin

I have completed BINTERP8 (8 bit version of BINTERP) and from initial test the best to slowest speeds were 4uS to 20uS - so about twice the speed for a worst-case scenario over the 16-Bit Version at 39uS. Not bad for three internal linear interpretations and I'm not excluding all the overhead of loading all the registers, variables, stack. et al, from the above calculations. If it's possible for you to get me figures for just the routine-time excecution-time, that would be great.

Hope you find BINTERP and BINTERP/8 useful, please let me know where my progeny ends up. Thanks.

Great! Have a wonderful trip and when you get back please let me know.

One thing I'd still like to know is how important speed is in your application? Is there any point in trying to increase the speed of the Bilinear Interpretations?

BTW - Martin, if you can get manage to get my program to fail or produce any errors, please let me know right away.

Dan, in your negation optimization, you can change JMP R248XT to BRTC R248XT. It doesn't save time, but it does save 2 bytes.

I think that it will fail only when the difference between Y1 and Y2 is greater than 32768.

Regards,

Steve A.

The Board helps those that help themselves.

- Log in or register to post comments

TopSorry that should have been an RJMP, would that still save two bytes...

I'm not that familiar with the opcode set yet.

- Log in or register to post comments

TopBTW, thanks Steve for restoring my faith!

Let me explain....

I took one look at the original program and was heart-broken.

I figured that if that's what passed as good Assembly Code then this

old dinosaur was ready for the bone-yard!

It honestly took me an entire day going through it line-by-line because this OLD F-RT

did not want to give-up the farm just yet.

I'm glad to see that there are others not following the same confusing coding conventions.

Not that there is anything wrong with them, I'm just not used to seeing such bizzare looking ASM code. Bizzare that is to an old-timer.

Thanks again!

- Log in or register to post comments

TopI found another optimization for your version Steve.

In the Original you were ROLing in a one and inverting it.

In alternate version we are LSL in a zero.

This should save one instruction inside loop (when it goes this route),

for an estimate of 8, and we no longer need the ONE/X1 register so we save by

NOT initializing it and this saves one instruction outside the loop.

For an average savings of 9 clocks and two program statements LDI X1,1 and EOR,X1

(4 bytes?)

- Log in or register to post comments

TopLooks good Steve!

I see that you're treating Y,Y1,Y2, and the result as signed. Otherwise we're pretty much on the same page.

Although, I have to admit that I have not yet grokked your rounding algorithm. :)

--Brian

- Log in or register to post comments

TopThere's a speedup that can apply to both of our functions: share storage between the Dividend and the low-word of the Quotient. That way, the ROLs can do dual-duty.

Here's my new division loop:

This saves 30 cycles and 2 regs without changing the program size.

--Brian

- Log in or register to post comments

TopWow, way to go Brian! That's an excellent algorithm!

Great work!

I''ve even found a way to optimize one step further.

I skip the comparison after the first ROL and go ahead and do the compare.

This eliminates another program step

Here is how I handle the Negative Slope at end of Division.

Rather than covert to a negative then ADD I just go right to a SUBtraction

saving the conversion process.

.

- Log in or register to post comments

TopThanks Dan!

Having several people refine the same code can produce many insights.

--Brian

- Log in or register to post comments

TopDan, without the BRCS DOSUB, I think your routine can fail when (x2-x1) > 128.

Here's an example in binary:

Given:

Quotient= 01000000 01000000 00000000 (dec 4210688)

Divisor= 10000001 (dec 129)

Consider the following division steps:

ITERATION 1

--

Step 1: LSL Quotient

Q=10000000 10000000 00000000

C=0

Step 2: Is the Quotient high-byte[10000000] < Divisor[10000001]?

Yes, skip subtraction. Dividend LSB = 0.

Step 3: Subtraction skipped

Step 4: ROL in Dividend LSB.

ITERATION 2

--

Step 1: LSL Quotient

Q=00000001 00000000 00000000

C=1

Step 2: Is the Quotient high-byte[00000001] < Divisor[10000001]?

Yes, skip subtraction. Dividend LSB = 0.

Step 3: Subtraction skipped

Step 4: ROL in Dividend LSB.

===============

This failed because the dividend needs to be compared against the combined 9 bit value of the carry and Quotient high-byte. So step 2 should have been:

Step 2: Is [1 00000001] < Divisor[10000001]?

No, Do subtraction. Dividend LSB = 1.

Does that make sense?

--Brian

- Log in or register to post comments

TopCouild be(?)

I only tested it on a limited number of data and results came back good.

You predict it will fail when Delta-X exceeds 128?

Let me try it...

Wow, you are right, it failed misserably!!!

(Kids don't attempt this at home!)

I guess we're trying to squeeze blood from a stone at this point.

I have another idea that might cut execution time quite a bit...

Let me see what I can do and throw another bug-ridden example at you

- Log in or register to post comments

TopI should have caught that one! I think it was pretty late when I wrote it.

Your other optimization of subtraction instead of negation is great, but I can do one better. My original optimization with rounding and negation had to do with the fact that the processing after the divide involves a complement, adding one for rounding, and negation. But negation is complement and adding one, so some things canceled. By adding your subtraction idea, things simplify even more. Things cancel out if you do the opposite of what you expect. You add 1 if you don't round, and subtract if the answer is positive, and there is no complement involved at all!

Yes, we have been doing quite well! I have one more that I am working on. I had mentioned earlier that we could pre-shift if we knew the max difference of the data, reducing the number of times through the loop. But the data itself can tell us if we can pre-shift. I now run another loop before the multiply to shift the y difference until the high bit is one. Each shift reduces the number of times through the divide loop by one. At the moment the shift loop is 2 cycles shorter than the divide loop, which makes it worthwhile, especially with smaller y differences. I'm still trying to see if I can refine it though. I'll try and post the code tomorrow, once I've worked out the kinks.

Regards,

Steve A.

The Board helps those that help themselves.

- Log in or register to post comments

TopI upoaded a quick rough-draft of my idea for a Division Routine with only 8 iterations which theoretically could be much faster than the "Compare and Subract" routine we are working on now.

I stuck it in a new thread because I thought it might get confusing talling about two totally different routines in the same thread.

Any help would be greatly appreciated. It still fails sometimes and when it doesn't sometimes the accuracy leaves something to be desired.

Thanks for any help, this is fun!

- Log in or register to post comments

TopKoshchi wrote:Hey, that's a fantastic idea! You're making the division routine "scalable" if size warrants it, it becomes a 10x8 Divide later it might be just a 6x8 Divide, etc.

- Log in or register to post comments

TopIf it's not too much bother could you walk me through the logic of your rounding routine?

I've got a print-out of your code right here and I still don't get it.

I've been up-all-night perhaps it's just fatigue.

- Log in or register to post comments

TopIf absolutely necessary, could you separate the instructions over a period of interrupt calls to get the same job done periodically while other events are allowed to happen?

In other words...what's more important, this calculation or the other events? Or are they equally important?

You guys have been working real hard at that code, which has improved amazingly over time. However, if you can put off needing the result of the calculation for some time, you should be able to use a timer to spread out your calculation over time, thus buying your other instructions the time they need to run.

This is kind of a shot in the dark. It would only work if you don't need to calculate the numbers immediately :-/.

- Log in or register to post comments

TopWe don't really need it for anything. We started out just trying to help the OP come up with something. Since then we've just been playing with it to see what we could do with it. Personally I just like trying to squeeze out every last clock cycle from a routine, even if there is really no reason to do so.

Regards,

Steve A.

The Board helps those that help themselves.

- Log in or register to post comments

TopHere is my latest version. It still assumes that y2 - y1 is less than 32768 (including in the normalization routine that I added), but this could certainly be changed if needed. The normalization saves 2 cycles for every pre-shift that it does. You can save 3 more clock cycles per shift if you don't shift 1's into the dividend but instead set them with a routine after the shift that gets the proper values from a look-up table. The look-up costs extra cycles, so you wouldn't see savings until the 3rd or 4th shift. But for small y differences the savings would be significant. It would also cost 32 bytes of memory for the table, plus you would need to use the z register, which means saving it somewhere and restoring it.

Total clocks - min 163, max 199

61 commands, 122 bytes

My round and negate algorithm goes something like this:

There are 4 posibilites for what we need to do for the answer after the divide:

1) no negate, no round, add to y1

2) no negate, round, add to y1

3) negate, no round, add to y1

4) negate, round, add to y1

Negating is complement and add 1, and rounding is add 1 (or sub 1 if the number is negative). But the answer after the divide is the complement of the real answer. So the actions we need to take for the for the cases are:

1) cmp, add to y1

2) cmp, add 1 (or sub 1, cmp), add to y1

3) cmp, cmp, add 1, add to y1

4) cmp, add1, sub 1, add to y1

But for the last two, things cancel, leaving:

3) add 1, add to y1

4) add to y1

This should explain my first variation. Now if we take Dan's idea of subtracting, we can use it on the first two cases. That is, we can subtract the negative of the actual answer. To make these two negative, we need these actions (including the previous actions):

1) cmp, cmp, add 1, sub from y1

2) cmp, add 1, cmp, add 1, sub from y1

In the first, the complements cancel, in the second, it is two negations that also cancel. The final result is:

1) add 1, sub from y1

2) sub from y1

3) add 1, add to y1

4) add to y1

I hope that this explains everything.

Regards,

Steve A.

The Board helps those that help themselves.

- Log in or register to post comments

TopWhat do you think is the optimal way to implement rounding into the following.

- Log in or register to post comments

TopOk, Steve. I think we're bordering on extreme navel-gazing here.

My arbritary priorities for my original implementation were in order of size, clarity, and speed.

But in the relentless pursuit of speed, I have some changes for your divide loop. 8)

Insert this code where your "Divide:" label was originally located:

The goal of this loop is to have _exactly_ one followed-branch per loop.

It saves ~6.8 cycles per function call in my testing. (and one word!)

--Brian

## Attachment(s):

- Log in or register to post comments

Topbcb wrote:Brian, how did you get these figures? Is there a way to get them from the Studio4 Software -- I've been counting clocks by hand and it's tedious and prone to errors.

Thanks.

- Log in or register to post comments

TopIn the "I/O View"==>"Processor" section, there's a "Cycle Counter" item that's available while debugging.

--Brian

- Log in or register to post comments

TopWow, how Cool!! And a Stop-Watch too!!

Ya' know, I like to rib people about the software for fun, but the Debug within the Studio4 really is excellent!

- Log in or register to post comments

TopHow do I stop the processor?

BREAK ans SLEEP don't work and I can't find a HALT instruction.

- Log in or register to post comments

TopNot bad Brian! There are three paths through the loop.

1) 8 bit subtraction, 10 clocks (which is the same as my loop)

2) no subtraction, 9 clocks

3) 9 bit subtraction, 8 clocks

If the divisor is less that 128, the 9 bit subtraction will never be hit. For >= 128, the higher the number, the more often 9 bit subtraction will occur. So I would guess that it would save on average a little less than one clock per loop over my code. You do lose 2 or 3 clocks (the rjmp to start, and the possible branch out if the last loop is a subtraction). So the 6.8 sound about right (it might even be a hair low).

On the rounding, the way I did it is probably optimal (except with one change that I just noticed from Brian's original code).

And if you skip the complement, reverse both branches to clear instead of set.

Regards,

Steve A.

The Board helps those that help themselves.

- Log in or register to post comments

TopOkay, thanks so much Steve. Now I understand! I got lost in all this bit shuffling.

The idea is if the Remainder is more than 1/2 of original Divisor then we Round Upward, makes prefect sense to me now. Thanks for clearing that up for me.

Did you get a chance to look at my amateur attempt at Division-by-Inverse-Multiply routine I tried? Accuracy is terrible, perhaps your rounding modification will improve results.

Since it only requires 8 interations to invert the Dividend, it should save a lot of

time on larger divides like 24x8, 32x8 and beyond.

- Log in or register to post comments

TopTo Martin, the original poster (OP):

I have done four different version of the Bilinear Interpolation program that you requested. Each Bilinear Interpolation requires an internal 3 Linear Interpolations. The slowest version uses the standard "Compare and Divide" method for long division. The fastest uses Binary Division but Data Points for your BIN map must be by multiples of 2. The others use two different versions of a "Division by Inversion" method that I have beeb experimenting with.

Here are the results:

METHOD: Size in Bytes @ Speed: Best Case - Worst Case

BINARY METHOD: 100 @ 104 3.8uS to 10 uS

DIV by INV (a): 130 @ 3.8uS to 23 uS

DIV by INV (b): 120 @ 3.8uS to 24 uS

STANDARD: 110 @ 3.8uS to 39 uS

The RPM and MAP readings are Binary Usigned Inputs and Output is an Ignition Timing Rate in the range 0 to 65,536 (2 Bytes) based on a Binary Interpolation of the Four Nearest Data Points in your !2x12 BIN map.

I hope this helps your project along, as I am taking a lot of "heat" for asking too many questions. Although I have never programmed for the AVRs before (my first one arrived about week-and-a-half ago) I think the AVR has an excellent "core" with not too many "quirks." and even though it was quite frustrating at times being under a time constraint, it was an excellent learning experience.

PS. Are you glad you didn't have to recode your entire project in C just for a few complex math routines?

TTYL

- Log in or register to post comments

TopMartinM57 wrote:You are very welcome. I am sorry it took so long to write, but trying to learn everything I needed to know about Assembly Programming for AVRs in a week what a bit frustrating, but I am extremely delighted that you are happy with the results.

If the Bilinear Interpolation (BINTERP) speed of 4uS to 39 uS is not fast enough I have some new ideas to speed it up even more. BTW - YES!!!! Switching you 8 bit table entries would drastically speed the program up!

I'm not sure what the AVR Academy is, perhaps I'll look into making BINTERP available there.

Glad to hear my BINTERP will be used in real-life control application. It was an excellent learning experience for me. Have a great day!

- Log in or register to post comments

TopThanks Dan.

Just to re-iterate, Dan's done a great job for me.

I'm hoping for a few great days as I'm off to Colorado tonight for two weeks skiing - but how will I get on without access to AVRFreaks???? 8)

If you have the spare time, a BINTERP8 would be very nice!

Martin

- Log in or register to post comments

TopMartinM57 wrote:Thank you very much for the recommendation!

Please, pass my name around, I'd like to find a "Paid Gig" doing this someday.

However, I don't imagine there's much call for Super-Fast, Super-Small AVR code is there?

BTW - I took original ButterFly BootLoader (over 1500 words) and did ASM version in under 255 (1/6 size) for practice.

It has been a wonderful learning experience.

MartinM57 wrote:I have completed BINTERP8 (8 bit version of BINTERP) and from initial test the best to slowest speeds were 4uS to 20uS - so about twice the speed for a worst-case scenario over the 16-Bit Version at 39uS. Not bad for three internal linear interpretations and I'm not excluding all the overhead of loading all the registers, variables, stack. et al, from the above calculations. If it's possible for you to get me figures for just the routine-time excecution-time, that would be great.

Hope you find BINTERP and BINTERP/8 useful, please let me know where my progeny ends up. Thanks.

- Log in or register to post comments

TopBINTERP8 received - no time to do the testing, got to pack the bags right now...

I'll be back in just over 2 weeks ... 8)

- Log in or register to post comments

TopGreat! Have a wonderful trip and when you get back please let me know.

One thing I'd still like to know is how important speed is in your application? Is there any point in trying to increase the speed of the Bilinear Interpretations?

BTW - Martin, if you can get manage to get my program to fail or produce any errors, please let me know right away.

TTYL

- Log in or register to post comments

TopDoes anyone else have a need for an ASM program to do Bilinear Interpolations?

- Log in or register to post comments

Top## Pages