Rotate Memory Data in 90 degrees

Go To Last Post
18 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello Folks,

Summer burning here in Florida. Lots of rain - and electrical storm.

 

Some application requires to rotate a block of data bits (mostly a 8 bytes, 64 bits), in 90 degrees.

 

One of them is, for example, using the Max7219 LED driver chip with 8 x 7segments LED display, with Common ANODE, instead of Common Catode, as it was designed for.

 

With common Catode, the Max7218 scans catode by catode (to zeroV), changing the row pins level, that drive the segments, easy.

 

With common Anode, the story changes completely, as the Max7219 always will scan the "Catode" pins to Ground, then you need to connect the Common Anode displays Segments to the Max7219 Catode pins.

 

This way, the Max will scan segments instead of catodes.  All the same segments of all displays will be selected, and then the Max need to send its "segment" rows that will be connected to the Displays Anode pins.

It works upside down.  The problem is, the binary combination to lit segments is completely transversed.

 

Max doesn't know that, but it thinks it selected display #1 common catode, in real it selected segment "A" for all displays. Now it is your turn to select which displays shall lit that segment, and it goes to the Max as its "segments", in real it is driving Anodes of the Displays.  Instead of scanning display by display, and each display will lit exactly its ON segments, with this transverse all displays that need to lit "A" segment, will lit at the same time.

 

In the past I did it, at extra cost of clock cycles, the AVR becomes almost hot red... ;)

 

Here is the trick:

Suppose you have this sequence of bytes:   0xF1, 0x39, 0x27, 0x0E, 0x7B, 0x16, 0x4B, 0x0F.

Let me print this side by side vertically, MSB on top, the first one at the left, column, is 0xF1 vertical.

 

1  0  0  0  0  0  0  0

1  0  0  0  1  0  1  0

1  1  1  0  1  0  0  0

1  1  0  0  1  1  0  0

0  1  0  1  1  0  1  1

0  0  1  1  0  1  0  1

0  0  1  1  1  1  1  1

1  1  1  0  1  0  1  1

 

How nice would be a uC with a special register that you pump 8 bytes, then read 8 bytes rotated in 90 degrees...

 

The trick here is to read the 8 bytes at 90 degrees, horizontally, resulting in:

 

0x80, 0x8A,0xE8,0xCC,0x5B,0x35,0x3F,0xEB

 

The fastest way to do that (according to what I did), is:

 

Suppose the input 8 bytes,  0xF1, 0x39, 0x27, 0x0E, 0x7B, 0x16, 0x4B, 0x0F, are in registers R1-R8, and the resulting transversed bytes will be at R11-R18.

 

    LSL   R1
    ROL   R11
    LSL   R1
    ROL   R12
    LSL   R1
    ROL   R13
    LSL   R1
    ROL   R14
    LSL   R1
    ROL   R15
    LSL   R1
    ROL   R16
    LSL   R1
    ROL   R17
    LSL   R1
    ROL   R18

The above sequence works on R1, it must be repeated more 7 times, for R2-R8.

It will be a total of 128 lines of code (256 bytes of Flash), 128 clock cycles, an AVR running at 8MHz, will take 16 microseconds to do it.

In some cases, it is to much.

 

The indirect addressing using X, Y and Z registers can access regular R1-R31 registers, but it takes more cycles, and unfortunately there is no ASL or ROL indirect addressing.

So, to make things shorter, one could do things like this:

 

     CLR    ZH
     LDI    ZL,1
     LDI    R18,1
LP1:
     LD     R0,Z
     INC    ZL
     LSL    R0
     ROL    R11
     LSL    R0
     ROL    R12
     LSL    R0
     ROL    R13
     LSL    R0
     ROL    R14
     LSL    R0
     ROL    R15
     LSL    R0
     ROL    R16
     LSL    R0
     ROL    R17
     LSL    R0
     ROL    R18
     BRCC   LP1
     RET

 

As trick above, R18 starts with 0x01 and after 8 loops of "ROL" and at the bottom of LP1, this "on" bit will fall into "Carry" bit, exiting the procedure by the BRCC before RET.

 

The above uses less flash bytes, at cost of extra clock cycles for the Rcall and Ret, as well the LD R0,Z that consumes 2 cycles, as well the control loop instructions.

 

Below, a very shorter (flash bytes) version would use two indirect addressing, XH:XL and ZH:ZL to address the registers, will be small, but the 64 loops on LP2 are clock cycles hungry.

LP0:
     CLR    ZH
     LDI    ZL,1
     CLR    XH
     LDI    XL,11
LP1:
     LD     R0,Z
LP2:
     LD     R10,X
     LSL    R0
     ROL    R10
     ST     X,R10
     INC    XL
     CPI    XL,19
     BRCS   LP2
     INC    ZL
     CPI    ZL,9
     BRCS   LP1
     RET

 

Somebody has some other ideas?  other than using external parallel/serial logic registers... :)

 

Another nasty procedure is to transpose bits in a byte... ABCD EFGH to become HGFE DCBA... uhggg!!  I hate bitbanging...

 

Wagner Lipnharski

Orlando Florida USA

Wagner Lipnharski
Orlando Florida USA

Last Edited: Wed. Aug 10, 2016 - 02:21 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,   I had the same situation with the MAX7219.  I considered the 16 register solution outlined first, but using SRAM instead of registers. 

In the end, I mounted the LED 8x8 modules at a 90 degree angle.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can do this non-destructively:

        bst     r1, 7
        bld     r11,7
        bst     r1, 6
        bld     r12,7
        bst     r1, 5
        bld     r13,7
        bst     r1, 4
        bld     r14,7
        bst     r1, 3
        bld     r15,7
        bst     r1, 2
        bld     r16,7
        bst     r1, 1
        bld     r17,7
        bst     r1, 0
        bld     r18,7

Etc.  No faster, no shorter, but preserves r1-r8, if that's needed.

 

By the way, there is no asl instruction.  I suppose you mean lsl.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 10, 2016 - 01:38 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I like the idea to copy bits via T flag, non destructively.  Still 128 instructions.

Thanks for the alert for the ASL...  

I am becoming old, today I fought for 10 minutes with this 3 instructions below that didn't work.

It was necessary to open and read the assembler listing to believe what I already knew and use all the time, but for some reason could not ring a bell today.

   LDS    R17,FLAG5      ; Waiting for Zero
   TST    R17
   BRNE   PC-2

Somedays it is better stay in bed.

 

Oh yeah, yesterday I was showing a friend about the nice thing of using "konstants" in RAM.

When using low 16 registers you can not use immediate values, like:  

 

LDI R4,0x15

 

It is always possible to do

 

.
.
.
    .dseg
    .org sram_start

    N01  .byte 1
    N07  .byte 1
    NF4  .byte 1
.
.
.
    .cseg
    .org $0
.
.
.

start:
    ldi  temp,01
    sts  N01,temp
    ldi  temp,07
    sts  N07,temp
    ldi  temp,0xF4
    sts  NF4,temp
.
.
.
.
.
.
    LDS R4,N07

 

if you have SRAM variables with number's names.

Still the same byte count of 

 

LDI   R16,0x15

MOV  R4,R16

 

but the LDS save you of using a temp high register, at the cost of bytes on SRAM.

 

Wagner Lipnharski

Orlando Florida

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Handy.  I'll put that into my bag of tricks.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why use a register for the loop counter instead of Z?  You do for the nested loop version.

 

For that matter, why not use auto increment?

     clr    ZH          ; 1
     ldi    ZL,1        ; 1
lp1:
     ld     r0,Z+       ;       2
     lsl    r0          ;       1
     rol    r11         ;       1
     lsl    r0          ;       1
     rol    r12         ;       1
     lsl    r0          ;       1
     rol    r13         ;       1
     lsl    r0          ;       1
     rol    r14         ;       1
     lsl    r0          ;       1
     rol    r15         ;       1
     lsl    r0          ;       1
     rol    r16         ;       1
     lsl    r0          ;       1
     rol    r17         ;       1
     lsl    r0          ;       1
     cpi    ZL,9        ;       1
     brcs   lp1         ;       1/2
                        ;       = 20*8-1
                        ;       = 159
     ret                ; 4
                        ; = 165

Saves a register, two words, and nine cycles.

 

You can do the same to help the second version even more:

lp0:
     clr    ZH          ; 1
     ldi    ZL,1        ; 1
     clr    XH          ; 1
     ldi    XL,11       ; 1
lp1:
     ld     r0,Z+       ;    2
lp2:
     ld     r10,X       ;       2
     lsl    r0          ;       1
     rol    r10         ;       1
     st     X+,r10      ;       2
     cpi    XL,19       ;       1
     brcs   lp2         ;       1/2
                        ;       = 9*8-1
                        ;       = 71
     cpi    ZL,9        ;    1
     brcs   lp1         ;    1/2
                        ;    = 76*8-1
                        ;    = 607
     ret                ; 4
                        ; = 615 cycles

It's two words shorter, and 72 cycles faster.

 

I'd say the nested loop version is hardly worth the 7 words of flash you save.  165 cycles vs. 615 cycles.  No contest.

 

It's a bit slower than the 128 cycle flat loop-less approach, but it's also 107 words shorter.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 10, 2016 - 05:19 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Another nasty procedure is to transpose bits in a byte... ABCD EFGH to become HGFE DCBA... uhggg!!  I hate bitbanging...

This is how __builtin_avr_insert_bits() does it:

        add     r24,    r24
        adc     r24,    r1
        mov     r0,     r24
        bst     r0,     1
        bld     r24,    7
        bst     r0,     2
        bld     r24,    6
        bst     r0,     3
        bld     r24,    5
        bst     r0,     5
        bld     r24,    3
        bst     r0,     6
        bld     r24,    2
        bst     r0,     7
        bld     r24,    1

Starts with input in r24.  Result ends up in r24.  AVR GCC uses r1 as an 'always zero' register.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't have time now to play with it, but have you checked if you can make 2 bit's at the time, so both the bit you shift in and out are used (Im thinking on something like a manual mul where the result build up "behind" you.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wagnerlip wrote:

Another nasty procedure is to transpose bits in a byte... ABCD EFGH to become HGFE DCBA... uhggg!!  I hate bitbanging...

 

Probably quicker to do this by brute force for 8-bit values but a neat way to do 32-bit values is...

 

</p>
<pre>
x = (x & 0x55555555) <<  1 | (x & 0xAAAAAAAA) >>  1; </pre><pre>
x = (x & 0x33333333) <<  2 | (x & 0xCCCCCCCC) >>  2; </pre><pre>
x = (x & 0x0F0F0F0F) <<  4 | (x & 0xF0F0F0F0) >>  4; </pre><pre>
x = (x & 0x00FF00FF) <<  8 | (x & 0xFF00FF00) >>  8; </pre><pre>
x = (x & 0x0000FFFF) << 16 | (x & 0xFFFF0000) >> 16;</pre><p>

 

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wagnerlip wrote:

Some application requires to rotate a block of data bits (mostly a 8 bytes, 64 bits), in 90 degrees.

 

</p>
<p> </p>
<p>A[] is the source</p>
<p>B[] is the destination</p>
<p>m is the number of columns in multiples of 8 (so 1 in this case)</p>
<p>n is the number of rows in multiples of 8 (again, 1 in this case)</p>
<p>void transpose8(unsigned char A[8], int m, int n, unsigned char B[8]) {<br />
   unsigned x, y, t;<br />
 <br />
   // Load the array and pack it into x and y.<br />
 <br />
   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m];<br />
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m];<br />
 <br />
   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7);<br />
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7);<br />
 <br />
   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14);<br />
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14);<br />
 <br />
   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F);<br />
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F);<br />
   x = t;<br />
 <br />
   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x;<br />
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y;<br />
}</p>
<p>

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

Last Edited: Wed. Aug 10, 2016 - 10:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For speed use a LUT for: 

 

ABCD EFGH to become HGFE DCBA

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brian Fairchild wrote:

 

Probably quicker to do this by brute force for 8-bit values but a neat way to do 32-bit values is...

Interesting.  So for 8 bit that becomes:

x = (x & 0x55) <<  1 | (x & 0xAA) >>  1; 
x = (x & 0x33) <<  2 | (x & 0xCC) >>  2; 
x = (x & 0x0F) <<  4 | (x & 0xF0) >>  4; 

... or:

        mov     r0,     r24
        andi    r24,    0x55
        andi    r0,     0xAA
        or      r24,    r0
        mov     r0,     r24
        andi    r24,    0x33
        andi    r0,     0xCC
        or      r24,    r0
        mov     r0,     r24
        andi    r24,    0x0F
        andi    r0,     0xF0
        or      r24,    r0

At 12 cycles, that's 4 cycles shorter than what __builtin_avr_insert_cycles comes up with.  This should be posted to the AVR GCC list.  I imagine that, even though I'd not seen it before, this is an old trick.  I'm surprised they aren't using it as a special case.

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That code is old news, it has been here at least a couple of times.  

 

And I don't think that the code you show is correct :( 

 

perhaps some () will solve it :)

Last Edited: Wed. Aug 10, 2016 - 02:57 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@joeymorin,

I believe you left some shifts out of your asm code...

David (aka frog_jr)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
So for 8 bit that becomes:
joeymorin wrote:
andi r0, 0xAA
joeymorin wrote:
At 12 cycles, that's 4 cycles shorter

 

;)  Which AVR model are you using that has ANDI on R0?

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

That code is old news, it has been here at least a couple of times.  

As I said:

I imagine that, even though I'd not seen it before, this is an old trick.

 

frog_jr wrote:

I believe you left some shifts out of your asm code...

I sure did!  Copy/pasted the wrong bit in a hurry...

        mov     r16,    r24
        andi    r24,    0x55
        lsh     r24
        andi    r16,    0xAA
        rsh     r16
        or      r24,    r16

        mov     r16,    r24
        andi    r24,    0x33
        lsh     r24
        lsh     r24
        andi    r16,    0xCC
        rsh     r16
        rsh     r16
        or      r24,    r16

        mov     r16,    r24
        andi    r24,    0x0F
        swap    r24
        andi    r16,    0xF0
        swap    r16
        or      r24,    r16

20 cycles (assuming I got it right this time!).  Boo.  Explains why AVR GCC doesn't do it this way :)

 

A better asm programmer than I will likely see ways to improve this naive approach.

 

theusch wrote:

;)  Which AVR model are you using that has ANDI on R0?

Oy.  It was a long night ;-)

 

OK, I'm having coffee now... I'll be better in a bit ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 10, 2016 - 03:16 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

I imagine that, even though I'd not seen it before, this is an old trick.

 

Look out a copy of 'Hacker's Delight' published by Addison Wesley.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Brian, I see it's available on Amazon and elsewhere.

 

I've noodled around a bit with the algorithm in #10, trying to express it in AVR assembler.  Expressed naively, it's 230 cycles/words.  I guess I'm just not that good an assembler programmer.  I can't get it shorter/faster than the 128-cycle/word brute-force approach by the OP.  I gave up after I got it down to 190 160 134 131 127 cycles/words.   I suppose that's to be expected.  The algorithm in #10 seems suited to languages which lack the notion of a carry bit, like C.  I may pick it up again some day.

 

EDIT:  OK, I got it down to 127 cycles.  That's... wait for it... 1 cycle shorter than the pure, flat, shift/rotate approach.  And actually, I cheated.  That leaves the results slightly out of order.  Result is in contiguous registers, but look like R23:22:21:20:27:26:25:24.  Getting it in the right order would cost 4 more cycles.  Not likely an issue if it will be assembler code using the results to feed an LED array.  If it is, the input can be maintained out of order so that the result ends up in order.

 

I'd bet the likes of @sparrow2 could best that without much effort, though ;-)

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 11, 2016 - 01:38 AM