Is this the tightest method for 328P in assembler

Go To Last Post
48 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Please note that this algorithm is designed to accommodate this hardware configuration, but by no means am I advocating it is the most practical method, especially considering all 8 LED's lit would gobble 50% of the boards potential. It is simply to address a temporary need, but efficient code design in any context is of interest. To the discerning eye, you may have noticed some pins are reversed biased. I know I was surprised that it worked. Don't know if this is characteristic of other bar graph displays, but haven't been able to find a datasheet for this particular model.

 

Of particular interest is the need to re-write the state of an input pin (pulled high or not). In the simulator I had set PORTB7, but because it wasn't for output masked it out with DDRB. This changed the state of PORTB7 which I hadn't expected.

 

00 930f   	push	TMP  		; Preserve callers value
02 934f   	push   	R20  		; Scratch register
        
            ; Isolate low nibble and write new value to pins 3:0 on port "B".
        
04 b145   	in 	R20, PORTB 	; Get current state of pins
06 7f40   	cbr 	R20, 15  	; Strip 3:0 will become low nibble of TMP
08 930f   	push 	TMP  		; We're going to need high nibble later
0a 700f   	cbr 	TMP, 0xF0 	; Strip high nibble
0c 2b04   	or 	TMP, R20 	; and replace with existing pin states

         ; 	in 	R20, DDRB
         ; 	and 	TMP, R20

0e b905   	out 	PORTB, TMP 	; Write updated register to port
10 910f   	pop 	TMP  		; Retrieve original value
        
           ; Isolate high nibble and update value to pins 7:4 on Port "D"
        
12 b14b   	in 	R20, PORTD 	; Get current state
14 704f   	cbr 	R20, 0xF0 	; Strip 7:4 replaced by high nibble of TMP
16 7f00   	cbr 	TMP, 15
18 2b04   	or 	TMP, R20 	; Combine high nibble with low nibble "D"

         ; 	in 	R20, DDRD
         ; 	and 	TMP, R20

1a b90b   	out 	PORTD, TMP 	; Write updated value
        
1c 914f   	pop 	R20
1e 910f   	pop 	TMP  	; Restore callers value
20 9508   	ret

 

NOTE: The solution has a minor change to it that can be seen in #24

This topic has a solution.

Last Edited: Mon. Dec 11, 2017 - 12:19 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What difference would it make if it used less cycles? The display only needs to be updated 100 times/second - any faster and it is faster than a human can make sense of it.
You’ve not told us your register usage rules.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What difference would it make if it used less cycles?

 

Considering the hundreds of code segments that comprise an entire working application, paying attention to minute details is essential, especially when dealing with limited resources indicative of many embedded systems.

 

The display only needs to be updated 100 times/second.

 

In this case, that doesn't apply as segments are not being strobed. Up to all 8 LEDs can be lit statically, only dependant upon state of appropriate pins in ports "B" & "D".

 

You’ve not told us your register usage rules.

 

In this particular case TMP is defined as R16, but as this snippet is non-destructive, I'm not sure what you mean by usage rules.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
	in 	R20, PORTB 	; Get current state of pins
	cbr 	R20, 15  	; Strip 3:0 will become low nibble of TMP
	push 	TMP  		; We're going to need high nibble later
	cbr 	TMP, 0xF0 	; Strip high nibble
	or 	TMP, R20 	; and replace with existing pin states
	out 	PORTB, TMP 	; Write updated register to port
	pop 	TMP  		; Retrieve original value

I don't see a reason to push/pop TMP here.  Do the OR the other way around, into R20 (leaving TMP with the original contents.)

 

	in 	R20, PORTD 	; Get current state
	cbr 	R20, 0xF0 	; Strip 7:4 replaced by high nibble of TMP
	cbr 	TMP, 15
	or 	TMP, R20 	; Combine high nibble with low nibble "D"

Don't you need to swap nybbles here; your wiring diagram shows the "high" bits of the LED connected to the "low" bits of PORTB (Digital Pins D8..D11)

 

 

I'm not sure what you mean by usage rules.

TMP has to be a "high" register for CBR to be usable.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't see a reason to push/pop TMP here.

Could you provide an example please. I'm having a little trouble visualizing how to OR without stripping conflicting bits.

 

Don't you need to swap nybbles here; your wiring diagram shows the "high" bits of the LED connected to the "low" bits of PORTB (Digital Pins D8..D11)

Might you be referring to a different MPU as the 328P only has 8 pins PD8:0. D11:08 are actually the LSB of PORTB PB3:0 (Arduino 17:14)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think you’re trying to teach granny how to suck eggs. I used to write code by counting cycles but the concept doesn’t scale well. You end up writing ‘brittle’ code that is hard to maintain. The secret to optimisation is to only optimise the code that requires it. There would have to be a compelling reason why i would ‘optimise’ code running leds - if the project was going to sell a million units and i needed to squeeze the code into a smaller micro to save some cents, then that would be an economic reason that would justify my effort and cost. Writing code that works and is maintainable is a better and more economic use of my time.

If the display only rarely gets updated, then less reason to count cycles or code size.

In the olden days, the processors back then had small numbers of registers. Nevertheless, you would devise rules as how you passed and returned parameters to functions. With RISC processirs, you have a large number of registers, so in order to write sensible and maintainable code, you need to define some rules as to how the registers are used to pass parameters. For the AVR you might use r16..19 to pas up to a 32 bit var and maybe X as the pointer. You define registers the called function can trash as pushing and popping are expensive - the whole idea of RISC was to decrease the amount of memory accesses by keeping stuff in regs.

Thus rather than pushing regs you trash, use other regs. Think about how you can use the xor function for your use if you REALLY want to trim cycles.

After you’ve optimised your code, consider the time it took and whether there was any payoff. Did you measure the energy savings? How about the performance? Did the app work better from all that effort?

Last Edited: Sun. Dec 10, 2017 - 07:56 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ah.   You may be right.   I think I forgot about maintaining the high bits of the PORTB.

(and it looks like I got PORTB/PORTD mixed up too, although you'll still need a SWAP somewhere...)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would agree 100% with Kartman.    Concentrate on maintainable code.    Learn how to identify hot spots.    Optimise in ASM if worthwhile.

 

Yes,  your static LEDs will not care whether you take 1us or 50ms to update.

However I write to a TFT several million times in a second.    This is worth investigating as a hot spot.

 

This is a typical macro for writing to a Uno with an OPEN-SMART shield:

#define BMASK         0b00101111
#define DMASK         0b11010000

#define write_8(x) {                          \
        PORTD = (PORTD & ~DMASK) | ((x) & DMASK); \
        PORTB = (PORTB & ~BMASK) | ((x) & BMASK);} // STROBEs are defined later

I can re-write it for your hardware as:

#define BMASK         0b00001111
#define DMASK         0b11110000

Or as simply:

#define BMASK         0b00001111
#define DMASK         ~BMASK

I can write a trivial program to use the macro.   Then inspect the generated ASM.   e.g.

/*
 * optimise_C_write.cpp
 *
 * Created: 10-Dec-17 08:31:26
 * Author : David Prentice
 */ 

#include <avr/io.h>

#define BMASK         0b00101111
#define DMASK         0b11010000

#define write_8(x) {                          \
        PORTD = (PORTD & ~DMASK) | ((x) & DMASK); \
        PORTB = (PORTB & ~BMASK) | ((x) & BMASK);} // STROBEs are defined later

int main(void)
{
    /* Replace with your application code */
    uint8_t val = 0;
	DDRB |= BMASK;
	DDRD |= DMASK;
	while (1)
    {
		val++;
		write_8(val);
    }
}

Inspecting the ASM in AS7:

    uint8_t val = 0;
00000046  LDI R24,0x00		Load immediate
		val++;
00000047  SUBI R24,0xFF		Subtract immediate
		write_8(val);
00000048  IN R25,0x0B		In from I/O location
00000049  ANDI R25,0x2F		Logical AND with immediate
0000004A  MOV R18,R24		Copy register
0000004B  ANDI R18,0xD0		Logical AND with immediate
0000004C  OR R25,R18		Logical OR
0000004D  OUT 0x0B,R25		Out to I/O location
0000004E  IN R25,0x05		In from I/O location
0000004F  ANDI R25,0xD0		Logical AND with immediate
00000050  MOV R18,R24		Copy register
00000051  ANDI R18,0x2F		Logical AND with immediate
00000052  OR R25,R18		Logical OR
00000053  OUT 0x05,R25		Out to I/O location
00000054  RJMP PC-0x000D		Relative jump

I can use the simulator to count cycles.   Or just do it by hand.   i.e. 12 cycles

 

As a macro it avoids any CALL and RET.   Quite honestly,  an inline 12 instructions is nothing much to worry about.    Ok  a subroutine would only use a LD, RCALL for each invocation.   The Flash use is smaller (for multiple use) but the CALL overhead is expensive in cycles.

 

I can't see much to improve on what the C compiler has done.    But at least you start with a "working system".    Any tweaks you make can be instantly compared to the known behaviour.

 

In short.    Learn how to write clear C or C++ code.    Learn how to "inspect" the generated code in the AS7 Simulator.

 

I have virtually no knowledge of ARM assembler.   I would never consider writing a whole ARM program in ASM.

I don't know much AVR assembler.   For example,   I have never used CBR instruction.

 

However,   I have no qualms at "inspecting" a short ASM sequence in an unknown processor.    And spotting a possible tweak.    Then reading the processor documentation to study the syntax of the instructions that I want to tweak.    Then do the Simulation to test it.

 

Many people can read Shakespeare.   You are unlikely to be able to write Shakespeare.    Reading ASM is a lot easier than writing (from scratch).

 

David.

Last Edited: Sun. Dec 10, 2017 - 09:01 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Although your points are valid, but they are not tantamount to the question. Anyone with a modicum of experience would immediately recognize, this is probably the least practical use of hardware ever (which I did elude to that somewhat in OP), which pretty much renders the code useless. The point is, this is what I plumbed together and concocted this snippet in order that it works. Now I'm interested to get input from those possibly more experienced, "is there a better way", which by association might equate to tighter code. I could have condensed this down to one instruction;

 

      out PORTD, TMP

 

but the would have meant delving into superfluous rhetoric about wanting to use an edge triggered button on INT0 and Arduino LED for something else. It's all about taking an abstract paradigm, conceptualizing the most reasonable methodology based on hardware constraints and then tweaking wherever possible. Is it time consuming, yes it is, but I believe the payoff will prove itself in time, even at the assembly level.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What a load of bollox. If the example was how to calculate a crc or aes128 etc, then there could have been useful discussion.

I gave you a hint to use a xor function - have you investigated that?

Last Edited: Sun. Dec 10, 2017 - 09:41 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 Learn how to write clear C or C++ code.

Although I have a reasonable amount of experience with C & C++, it is of no interest to me. However, I do appreciate examples of corresponding high level code and even if I have to disassemble it myself. Fact is, implementers have a comprehensive knowledge of hardware and instruction set, so by studying these examples I can glean valuable insight. Unfortunately, far too many HLL coders are unable to give examples that have a 1 to 1 correlation, probably due to lack of knowledge the instruction set and I refuse to burden myself with the idiosyncratic nature of AS7, especially not generating code indiscriminately.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok,   I was trying to offer some practical help.    e.g. by showing that a C compiler produces code that runs 2.5 faster than your example.

 

If we post AS7 views,   it does not mean that you have to use AS7.    AS4 would have been perfectly adequate.

You could achieve the same result from inspecting the LSS file.

 

		write_8(val);
  90:	9b b1       	in	r25, 0x0b	; 11
  92:	9f 72       	andi	r25, 0x2F	; 47
  94:	28 2f       	mov	r18, r24
  96:	20 7d       	andi	r18, 0xD0	; 208
  98:	92 2b       	or	r25, r18
  9a:	9b b9       	out	0x0b, r25	; 11
  9c:	95 b1       	in	r25, 0x05	; 5
  9e:	90 7d       	andi	r25, 0xD0	; 208
  a0:	28 2f       	mov	r18, r24
  a2:	2f 72       	andi	r18, 0x2F	; 47
  a4:	92 2b       	or	r25, r18
  a6:	95 b9       	out	0x05, r25	; 5

I started with 6502 ASM.     I would write whole programs in 6502 ASM.     I was very fond of the 6502.

 

Yes,  writing efficient ASM code is satisfying.    Especially if you produce dramatic results.    e.g. 6000% compared to the Basic interpreter of the day

 

Modern processors and modern compilers do a pretty good job.    You have to trade off your effort with the result obtained.    Typically  10% to 50%.

You probably use more sophisticated algorithms with a HLL than you would do with ASM.   In which case the HLL beats ASM hands down.

 

You say that you have C and C++ experience.    So you should understand the principle of an application spending "95% of its time in 5% of the code"

Whether you write all 100% of code in ASM you only need to worry about the efficiency of the 5% hot spots.

It is more practical to write 95% in HLL and only 5% in ASM.

 

David.

Last Edited: Sun. Dec 10, 2017 - 10:34 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hilarious.

 

Quebracho seems to be the hardest wood.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR_Coder_1 wrote:

...implementers have a comprehensive knowledge of hardware and instruction set, so by studying these examples I can glean valuable insight.

 

I wouldn't bother too much with using the output from a compiler as being a good example of good assembly code.

 

Very few, if any compilers, directly target the chip being used. The usual route is to target an abstract 'ideal' processor, hold the output of that stage in some sort of internal representation, and to then translate that onto the final processor. Optimisation takes place on both the internal representation and the final processor. As such, whilst the final code can be 'efficient', by which I mean fast and/or small, it may not be 'obvious' as the compiler, given lots of registers to play with, may play games with the persistence of values over large areas of code and other such tricks. A good assembly programmer on the other hand will tend to write code which is to all intents and purposes as good as the compiler, often a smidge or so faster and/or smaller, but with more obvious and visible structure. I know which I'd rather maintain.

 

 

'This forum helps those who help themselves.'

 

pragmatic  adjective dealing with things sensibly and realistically in a way that is based on practical rather than theoretical consideration.

Last Edited: Sun. Dec 10, 2017 - 11:53 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I bet in 20 years time he'll grow up to be a HLL programmer like the rest of us. Some of us did all this 40+ years ago when you didn't get a choice and Asm was the only option. But we grew up to realise that HLL with small sections of hand optimised Asm if really needed is actually the "sensible" solution. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Brian Fairchild wrote:
I wouldn't bother too much with using the output from a compiler as being a good example of good assembly code.

 

Rubbish.   You can pick up several tips and tricks from inspecting the Compiler output.   Mostly from the final translation of internal to processor stage.

 

Yes,  the dramatic optimisations come from the machine independent analysis.    These tend to be too big for my head to contain.

The final machine specific stage is possible to follow and occasionally improve.

 

The ARM C and C++ Compilers do an excellent job.

GCC and G++ for ARM and for AVR make a pretty good job too.

 

Early compilers were not so clever.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:05 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, as there doesn't seem to be an example that duplicates functionality exactly and this thread like so many others have spun off into the C & C++ evangelistic oblivion, here is what I've concocted with my limited experience with HLL.

 

#include <avr/io.h>

void ShowBin (unsigned short Dev) {
        unsigned Mask;
       
        PORTB = (Dev & 15) | (PORTB & 0xf0);
        PORTD = (Dev & 0xf0) | (PORTD & 15);
}

int main () {
        unsigned short DevNum = 120;
       
        SMCR = 1;     // Select idle
        PRR = 0x6f;     // Turn off all clocks except PRTWI
        DDRB = 0x4f;    // Set Arduino LED and LSB of port to output
        DDRD = 0xf0;    // Set MSB of port to output
        PORTD = PORTB = 0;   // Guarantee all LEDs are off.
       
        while (DevNum--) {
                ShowBin (DevNum);  
                }
        }       

I will concede, that C is far more maintainable, as the same functionality was produced with near 90% less code, or at least as far as lines go. However the object code came in at 208 bytes vs 188 for the assembly equivalent. I would assume something like -O3 could be applied to this, but don't know where that would be done. All this talk about how much better C is at doing things, well I'm not feeling it based on this example. However, I'm not a proficient HLL coder and don't know all the nuances that could be applied to this, especially as it applies to optimization flags. Credibility and tangibility are synonymous.

 

PS: I'm using AS7

 

Attachment(s): 

Last Edited: Sun. Dec 10, 2017 - 03:21 PM
This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:06 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow wrote:

 

Don't mix hex and decimal that is very confusing

 Yes, I have developed a few inconsistent habits and I agree from a readers point of view and even for myself, consistent representations should be maintained.

 

Now on to why XOR doesn't work. The objective is to replace the MSB of PORTD with MSB of TMP and LSB of PORTD with LSB without changing opposing bits.

 

Consider;      TMP = HHLH LHHH = D7

 

                PORTB = HLHH LHHL = B9 -> B7

                                B9 xor D7 = 6E

                                6E and 0F = 0E

                                0E xor D7 = D9 != B7

 

                 PORTD = HLLL HHLH = 8B -> DB

                                 8B xor D7 = 5C

                                 5C and F0 = 50

                                 50 xor D7 =  87 != DB

 

  Although TMP is preserved and eliminates stack usage, but it also negates intent of algorithm. In all fairness though, I did not specifically elude to this criteria and obviously shouldn't have assumed readers would glean this from code.      

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:07 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR_Coder_1 wrote:
The objective is to replace the MSB of PORTD with MSB of TMP and LSB of PORTD with LSB without changing opposing bits.

 

Consider;      TMP = HHLH LHHH = D7

 

                PORTB = HLHH LHHL = B9 -> B7

 

                 PORTD = HLLL HHLH = 8B -> DB

 

You mention PORTD twice in the first sentence, then you show PORTB and PORTD, and you appear to be showing the initial values of PORTB and PORTD but they are incorrect as PORTB is B6 and PORTD is 8D.

 

Your post is very confusing, so carefully rewrite it to begin with actual values and comments on each line so that it makes sense to everyone including yourself.

 

 

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

Last Edited: Sun. Dec 10, 2017 - 08:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think we’ve seen this OP before. Last time it didn’t end well and I can’t see this thread ending well either.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

larryvc wrote: Your post is very confusing, so carefully rewrite it to begin with actual values and comments on each line so that it makes sense to everyone including yourself.

Very valid point, but after 18 some odd posts of "bollocks", "hilarious", "one day I'll grow up and program in C" and so on, I was kind of losing interest.

 

@sparrow2  not sure why exclusive oring a number twice in this fashion slipped by me, but it did. Notice where you used ANDI, is replaced with CBR in order it works as required.

 

ShowBin:
        push 	TMP  		; Preserve callers value
        push 	R20  		; Scratch register

   ; Replace LSB of port with LSB of TMP.

        in 	R20, PORTB 	; Read current contents
        eor 	R20, TMP 	; So second EOR will revert unmasked to original
        cbr 	R20, 0xF 	; Strip bits to be replaced by TMP
        eor 	R20, TMP 	; Any bits not masked by CBR will revert to
        out 	PORTB, R20 	; original value.

   ; Replace MSB of port with MSB of TMP.

        in 	R20, PORTD
        eor 	R20, TMP
        cbr 	R20, 0xF0 	; Same as above except masking MSB
        eor 	R20, TMP
        out 	PORTD, R20

        pop 	R20
        pop 	TMP  		; Restore callers value
        ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

On one hand you give us a diatribe on the importance of efficient code in an embedded environment, then you persist with pushing and popping regs like you would running Forth on a Z80!
I’m glad you’ve grasped the use of xor, next is to understand the use of registers on a RISC machine to minimise push/pop unless you really want to do RPN.
Then you can bask in the glory of highly efficient code.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

And the winner is, wait for it, the modern day HLL tools...

#include <avr/io.h>

#define F_CPU 16000000UL

int main(void)
{

	DDRB = 0xFF;
	DDRD = 0xFF;

	PORTB = 0xB6;
	PORTD = 0x8D;

	uint8_t  tmp = 0xD7;

	uint8_t tmplow = tmp & 0x0F;
	uint8_t tmphigh = tmp & 0xF0;

	PORTB = tmplow | (PORTB & 0xF0);
	PORTD = tmphigh | (PORTD & 0x0F);

	while (1);

}
	DDRB = 0xFF;
  80:	8f ef       	ldi	r24, 0xFF	; 255
  82:	84 b9       	out	0x04, r24	; 4
	DDRD = 0xFF;
  84:	8a b9       	out	0x0a, r24	; 10

	PORTB = 0xB6;
  86:	86 eb       	ldi	r24, 0xB6	; 182
  88:	85 b9       	out	0x05, r24	; 5
	PORTD = 0x8D;
  8a:	8d e8       	ldi	r24, 0x8D	; 141
  8c:	8b b9       	out	0x0b, r24	; 11
	uint8_t  tmp = 0xD7;

	uint8_t tmplow = tmp & 0x0F;
	uint8_t tmphigh = tmp & 0xF0;

	PORTB = tmplow | (PORTB & 0xF0);
  8e:	85 b1       	in	r24, 0x05	; 5
  90:	80 7f       	andi	r24, 0xF0	; 240
  92:	87 60       	ori	r24, 0x07	; 7
  94:	85 b9       	out	0x05, r24	; 5
	PORTD = tmphigh | (PORTD & 0x0F);
  96:	8b b1       	in	r24, 0x0b	; 11
  98:	8f 70       	andi	r24, 0x0F	; 15
  9a:	80 6d       	ori	r24, 0xD0	; 208
  9c:	8b b9       	out	0x0b, r24	; 11
  9e:	ff cf       	rjmp	.-2      	; 0x9e <main+0x1e>

All kidding aside, the code in your post above does the job just fine.

 

EDIT: Before anyone else makes a comment about this post read post #30 first.  You are all missing the point.

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

Last Edited: Mon. Dec 11, 2017 - 09:19 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR_Coder_1 wrote:
@sparrow2  not sure why exclusive oring a number twice in this fashion slipped by me, but it did. Notice where you used ANDI, is replaced with CBR in order it works as required.

Note that in sparrow2's post #21 the masks were corrected and they are appropriate for ANDI.  Both CBR and ANDI work fine, only the nibbles of the masks are swapped accordingly.

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

Last Edited: Mon. Dec 11, 2017 - 02:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that CBR and ANDI are the same machine instruction (with the assembler complementing the argument for you, or not.)

 

 

 8e:	85 b1       	in	r24, 0x05	; 5
  90:	80 7f       	andi	r24, 0xF0	; 240
  92:	87 60       	ori	r24, 0x07	; 7
  94:	85 b9       	out	0x05, r24	; 5

Your C code is producing code optimized for constant values;  inappropriate to the original task

And it's not saving registers that the original code saw fit to save.  Most of the "improvements" suggested (including the lovely EOR method) mostly just avoid those save/restores, so the C code isn't quite fair...

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

After a bit of brain scratching and examples from respondents, I finally came up with a C equivalent to @sparrow2 example. Although @larryvc example uses a constant, it was enough to start me on a train of thought, however from what I can glean from disassembly R25 has been modified and I require the routine to be non-destructive.

 

void ShowBin (uint8_t Dev) {

        PORTB = ((PORTB ^ Dev) & 0xf0) ^ Dev;
                
  80: 95 b1        in 	r25, 0x05
  82: 98 27        eor 	r25, r24
  84: 90 7f        andi r25, 0xF0
  86: 98 27        eor 	r25, r24
  88: 95 b9        out 	0x05, r25
  
        PORTD = ((PORTD ^ Dev) & 0xf) ^ Dev;
                
  8a: 9b b1        in 	r25, 0x0b
  8c: 98 27        eor 	r25, r24
  8e: 9f 70        andi r25, 0x0F
  90: 89 27        eor 	r24, r25
  92: 8b b9        out 	0x0b, r24
  94: 08 95        ret
  }

As you can see, this object code is identical to that of post #19, except R25 isn't preserved.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

Note that CBR and ANDI are the same machine instruction (with the assembler complementing the argument for you, or not.)

Note that I was pointing out that sparrow2's ANDI with a mask of 0xF0 was achieving the same goal of clearing the lower nibble as the OP's CBR with a mask of 0X0F. 

westfw wrote:

Your C code is producing code optimized for constant values;  inappropriate to the original task

And it's not saving registers that the original code saw fit to save.  Most of the "improvements" suggested (including the lovely EOR method) mostly just avoid those save/restores, so the C code isn't quite fair...

Oh give me a break,  the point being made here is that with a little creativity on the part of the OP, the code produced by the HLL tools can easily be made to work for his case.  His choice not to use the many tools at his disposal for experimentation and learning purposes and not heeding the advice given by many here is the real mystery.

 

EDIT:  This post was written before seeing the OP's last post.

 

EDIT2: BTW, I'm out.

 

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

Last Edited: Mon. Dec 11, 2017 - 05:20 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Kartman wrote:
... then you persist with pushing and popping regs like you would running Forth on a Z80!

...next is to understand the use of registers on a RISC machine

Played with this code a bit more and realized just how restricted and odd the OP's criteria are and how right on the money the first comment is in light of the fact that I used Forth to program Z80 derivatives used in avionics glass panels for 757s.

 

The second comment is one that the OP definitely needs to understand and put to good use.

 

Now I'm out, until I'm not again.

 

 

 

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:08 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR_Coder_1 wrote:
except R25 isn't preserved.
See:

 

https://gcc.gnu.org/wiki/avr-gcc...

 

So R25 is a "Call-Used" register - so there's no requirement (by the C compiler) for it to be preserved. When the compiler makes a (R)CALL to code it expects that the call-clobbered registers may be changed so, if it knows it is holding something important in R25 it will arrange to preserve it before calling to the function itself and restoring after the return. In the ABI note also the "call-saved" registers - if you use any of those then it is YOU that must preserve (PUSH/POP) them so they are not changed.

 

As you may have spotted it is therefore "better" to start your own register allocation (as the C compiler does) by first making use of the clobbered registers. Just switch to also using the call-saved ones when you have to (and then arrange to preserve them).

Last Edited: Mon. Dec 11, 2017 - 09:40 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

 

Last Edited: Mon. Dec 11, 2017 - 09:09 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

In ASM you you can make/use the model you want/need.

Just make sure that you have a plan!!!

This is one of the fundamental reasons I use pure assembly, but the operative word is to have a plan.  However @clawson provided a link to https://gcc.gnu.org/wiki/avr-gcc#ABI and even though I don't contemplate any mixed language paradigms, but I believe it maybe worthwhile taking a closer look.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am fascinated by this thread.   Sorry if I am digressing from the OP's original purpose.

 

Sparrow2 as always offers excellent advice.   So I compared timing of XOR against AND/OR.   Woo-hoo.  10 cycles versus 12 cycles.

This would make a noticeable difference to the performance of my TFT library on a Uno.

 

However the standard "Adafruit Tests" took longer.   drawPixel() was quicker but fillRect() was slower.

 

I compared the fillRect hot spot in the LSS with both styles.   And saw that GCC had made a significant optimisation when it realised that fillRect() as doing multiple loops.

 

Ok,  Larry spotted the same behaviour.   But I am still amazed by how well GCC performs.

I did some comparison timing in the AS7 Simulator.

/*
* optimise_C_write.cpp
*
* Created: 10-Dec-17 08:31:26
* Author : David Prentice
*/

#include <avr/io.h>

#define BMASK         0b00001111
#define DMASK         (~BMASK)

#define write_8(x) {                          \
    PORTB = (PORTB & ~BMASK) | ((x) & BMASK); \
    PORTD = (PORTD & ~DMASK) | ((x) & DMASK); \
}

#define writexor_8(x) {                          \
    PORTB = ((PORTB ^ (x)) & ~BMASK) ^ (x); \
    PORTD = ((PORTD ^ (x)) & ~DMASK) ^ (x); \
}

#define NOP() asm("nop")

void write_call(uint8_t x)
{
    write_8(x);              //use MACRO in a function
}

inline void write_inline(uint8_t x)
{
    write_8(x);              //use MACRO in a function
}

void writexor_call(uint8_t x)
{
    writexor_8(x);           //use MACRO in a function
}

inline void writexor_inline(uint8_t x)
{
    writexor_8(x);           //use MACRO in a function
}

int main(void)
{
    /* Replace with your application code */
    uint8_t val = 0, cnt;
    uint16_t val16 = 0;
    DDRB |= BMASK;
    DDRD |= DMASK;
    while (1)
    {
        NOP();                //handy for Breakpoint
        val++;                //+0 clever how it avoids instruction
        // write a single value
        NOP();
        write_call(val);      //+20 CALL and RET
        NOP();
        writexor_call(val);   //+19 CALL and RET
        NOP();
        write_inline(val);    //+14
        NOP();
        writexor_inline(val); //+10
        NOP();
        val++;                //+1  stops GCC remembering value
        NOP();
        write_8(val);         //+13 straight macro
        NOP();
        writexor_8(val);      //+10 straight macro
        NOP();
        // now write the same value but multiple times in a loop
        for (cnt = 100; cnt--; ) write_8(val);    //+1100
        NOP();
        for (cnt = 100; cnt--; ) writexor_8(val); //+1300
        NOP();
        // real life situation.  writing 16-bit value on 8-bit data bus
        val16++;
        uint8_t hi = val16 >> 8, lo = val16 & 0xFF;
        NOP();
        for (cnt = 100; cnt--; ) { //+2109 i.e. +21 per iteration
            write_8(hi);
            write_8(lo);
        }
        NOP();
        for (cnt = 100; cnt--; ) { //+2299 i.e. +23 per iteration
            writexor_8(hi);
            writexor_8(lo);
        }
        NOP();
    }
}

The moral of the story is that the Simulator is an excellent tool for timing different ideas.   Oh,  and I am sticking with my AND/OR macro because any program is more sensitive to any "inner loop" than single statements.

 

Yes,  the obvious speed up of a loop is to unroll it to a certain degree.  e.g. 8 (or 4).  

This would make 100 write16 as 12x(3+8x14) + 4x(3+1x14)  i.e. 12 unrolled loops + 4 single loops

Unrolling is dramatic for short loop bodies.   The 3-cycle loop-control is significant.

 

David.

 

Edit.   I forced the write16 loop to use 16-bit values i.e. val16

Last Edited: Mon. Dec 11, 2017 - 04:51 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:10 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My apologies.   I was following my own agenda.  i.e. writing 16-bit pixels to an 8-bit mixed address bus.

I gave up with the OP a long time ago.

 

My last point was that GCC can reduce the 12 cycles to 8 cycles in a multiple loop with the AND/OR macro.

Whereas the XOR method is 10 cycles regardless.

 

In practice,   XOR is better for drawPixel() and AND/OR is better for fillRect().    So my strategy would be to use XOR as default but use AND/OR in a moderately unrolled loop for fillRect().     Bear in mind that "most" TFT operations involve a filled rectangle e.g. fillRect(), drawHLine(), drawVLine(), ...

 

Regarding Hot Spots.    Graphics performance on a TFT comes down to the  behaviour of one macro (or 12 ASM instructions).

In a 16kB library you are talking about 0.1% code is occupying about 99% of the execution time.

So yes,  it is worth making some effort to optimise that 0.1%

 

Since my library runs on ARM, Mega, XMega, Expressif I have no intention of writing 100% in Assembler.

Incidentally,   writing to a mixed bus on Xmega, ARM, Expressif all take advantage of "GPIO steering" hardware.   Which is effectively AND/OR.   

 

This example has bits that "line up" with the PORT bits.    In real life mixed ports often have random bit positions.   At least ARM and Expressif have a barrel shifter.    Which means that there is not too much of a penalty with shifts.    The AVR has BLD/BST instructions

 

David.

Last Edited: Mon. Dec 11, 2017 - 01:39 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
use MUL as a simple barrel shifter
???

I can see using it as a shifter, but not a "barrel" shifter.

Please enlighten me...

David (aka frog_jr)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Shift one is 1-cycle

Shift two is 2-cycle

Shift three is 3-cycle e.g. three LSL

Shift four is 1-cycle SWAP

You can probably do Shift three with a SWAP and a LSR

 

In practice,   I only worry about the Uno.    The Mega2560 is just painful.   If people want to use a Mega2560 with a Uno Shield they have to suffer the consequences.

 

MUL might come in handy one day.   But you still have to load the multiplicand in 1-cycle and multiply in 2-cycles.    No real improvement on LSL, LSR and SWAP.   The pin mapping is known at compile time.

 

I have never really thought about this.   But BLD, BST is always 2-cycles per bit and easy to maintain.    Just difficult to persuade GCC to use it.    I have to resort to gobbledygook.

 

David.

Last Edited: Mon. Dec 11, 2017 - 04:48 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

frog_jr wrote:

sparrow2 wrote:
use MUL as a simple barrel shifter
???

I can see using it as a shifter, but not a "barrel" shifter.

Please enlighten me...

 

He means a shifter that can shift an arbitrary number of bits always taking the same time. I remember, many years ago, when Intel introduced a one cycle barrel shifter with the 80386. It was a great speedup for multibit shifts.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

---

Last Edited: Mon. Dec 11, 2017 - 09:16 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I can't get enough nipples myself :-) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
I would say that it's a "simple barrel shifter"
This is probably just semantics...

I would call that a logical shift left (with zero fill).

 

I consider a barrel shifter one that can rotate bits

e.g. MSB(s) rotated to LSB(s)

David (aka frog_jr)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
let say that OP wanted to low and high nipple

 

Low and high nipple? You mean left and right :P

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

why are Sparrow2's replies all "---"?  Or is something wrong with my browser??

 

@Kartman - I think I know who you are referring to.  sad

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dksmall wrote:
why are Sparrow2's replies all "---"?  Or is something wrong with my browser??

Your browser is most likely just fine. It seems sparrow2 has edited all hist posts, eradicating their contents.

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

 

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]