C Style and Efficiency

Go To Last Post
18 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This problem came up in a anthor thread recently on avrfreaks. Rather than hijack it, I have started a new one on C coding style and perceived efficiencies. Please treat this post and any discussion it generates as a learning exercise.

The OP in the original thread asked for code to convert an array of 8 bits to a byte value - quite a simple task at first glance. I'm going to present three different solutions (2 from the thread and my own as well) and discuss each one. Here are the 3 sets of code in a single test case.

#include 

int main(void) {

    unsigned char my_str[] ={0,0,0,0,0,1,1,1};
	unsigned char my_char;
	volatile unsigned char i;

	/* first solution */
    for (i = 0, my_char =0; i < sizeof(my_str); my_char <<= 1)
    my_char += my_str[i++]; 
	//printf("%x\n", my_char);

   /* second solution */
	for (i = 0; i < 8; i++) {
		my_char <<= 1;
		if (my_str[i]) my_char |= 0x01;
	}
	//printf("%x\n", my_char);

	/* my solution */
	my_char = 0;
	for (i = 0; i < sizeof(my_str); i++) {
		my_char = (my_char * 2) + my_str[i];
	}
	//printf("%x\n", my_char);
}

If you compile and run this code with the MinGW C compiler on a PC, then the first solution will result in the answer 0xe instead of 0x7. Why - because the shift operation in the for loop is always executed one extra time. This code uses "side-effects" in a for loop and they can bite you.

If you compile this code with WinAVR then you get different results. Note I use WinAVR20060124 because it usually has the best optimization. I have not yet had a chance to execute this code. Here are the 3 sets of code, all 9 words long.

first solution:
  e8:	98 2f       	mov	r25, r24
  ea:	f9 01       	movw	r30, r18
  ec:	27 e0       	ldi	r18, 0x07	; 7
  ee:	81 91       	ld	r24, Z+
  f0:	98 0f       	add	r25, r24
  f2:	21 50       	subi	r18, 0x01	; 1
  f4:	99 0f       	add	r25, r25
  f6:	27 ff       	sbrs	r18, 7
  f8:	fa cf       	rjmp	.-12     	; 0xee 
second solution:
  e8:	f9 01       	movw	r30, r18
  ea:	27 e0       	ldi	r18, 0x07	; 7
  ec:	99 0f       	add	r25, r25
  ee:	81 91       	ld	r24, Z+
  f0:	81 11       	cpse	r24, r1
  f2:	91 60       	ori	r25, 0x01	; 1
  f4:	21 50       	subi	r18, 0x01	; 1
  f6:	27 ff       	sbrs	r18, 7
  f8:	f9 cf       	rjmp	.-14     	; 0xec 
my solution:
  e8:	98 2f       	mov	r25, r24
  ea:	f9 01       	movw	r30, r18
  ec:	27 e0       	ldi	r18, 0x07	; 7
  ee:	99 0f       	add	r25, r25
  f0:	81 91       	ld	r24, Z+
  f2:	98 0f       	add	r25, r24
  f4:	21 50       	subi	r18, 0x01	; 1
  f6:	27 ff       	sbrs	r18, 7
  f8:	fa cf       	rjmp	.-12     	; 0xee 

So given code size and performance are equal, what separates these 3 solutions? Of course correctness of the answer, but also code readability and maintainability.

In my view first solution is hard to read and you need to think about what is happening to both the loop variable and the loop content at the same time because i and my_char are being changed in both. The answer on my PC is incorrect as well. I didn't test on an AVR yet but I suspect it is also incorrect.

The second solution uses a loop without side effects but then proceeds to use bit manipulations to create the correct result. This only works if the input array is using base 2 and in my view less readable because you have to figure out what the bits are doing. I know bit twiddling is part of AVR programming but it is unnecessary to make it any harder than it needs to be.

My solution uses a loop without side effects and a simple arithmetic expression to calculate my_char. This same code could be used with a different number base (e.g 8 or 10) by changing the number 2 to the correct base (or using a constant).

Some people try to "optimize" their C code thinking it is going to help. In many cases it actually doesn't. Modern compilers have very good optimization techniques and you should let the compiler do its job, rather than try to out guess them. I would humbly suggest that your "job" as a programmer is to produce correct, readable and maintainable code.

If something doesn't perform well, the first thing to look at is your data structures and algorithm. Usually a different choice results in faster code (c.f. bubblesort versus quicksort). This is optimization "in the large".

If you are tight on space then it might be time to go look at the code generated by the compiler and see if you can optimize something. This should be the path of last resort and is optimization "in the small".

--Mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

try the optimized 2nd solution:

 /* second solution - optimized */
   for (i = 0; i < 8; i++) {
      my_char <<= 1;
      my_char |= my_str[i];
   }
   //printf("%x\n", my_char);

Although it is like solution #1, it has no side effects. It is also the same as your solution (shl by 1 is the same as mul by 2). However, we are talking about "Loading Bits" here, so I feel the shift and add, or shift and OR operation makes more sense in the context.

[edit]
Also, while the code size may be the same, execution speed is not.

Solutions 1&3 both take 66 cycles, while solution 2 takes 73-81 cycles.
Based on the code emitted above, I expect that the "optimized" version of # 2 would run at a constant 65 cycles

I'd be curious to know how this code fares:

   char *ptr = my_str;
   i=8;
   do
   {
      my_char <<= 1;
      my_char |= *ptr++;
      i--;
   }
   while(i);

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Fri. Feb 1, 2008 - 08:16 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
try the optimized 2nd solution:

 /* second solution - optimized */
   for (i = 0; i < 8; i++) {
      my_char <<= 1;
      my_char |= my_str[i];
   }
   //printf("%x\n", my_char);

This generates 8 words of code which is the best result so far and it does not have a loop expression with side-effects. I think most people can live with the extra word (and if you use the latest compilers they tend to generate larger code anyhow).

I know we could get into a religious debate here but I'm a firm believer in writing readable code first and then applying optimizations only when absolutely needed. Because bit twiddling of registers is inherent in controlling an AVR, people tend to use that everywhere else in their code when it is unnecessary. So which is better:

      my_char <<= 1;
      my_char |= my_str[i];
or
      my_char = my_char * 2 + my_str[i];

or is it just a matter of perspective?

This is just one of the area of coding style. There are much worse ones such as assignments in a condition and using the prefix or postfix increment/decrement operator in conditions or complex statements. These result in confusing and often erroneous code.

--Mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Check my edit above, I posted another version.

In general I agree with you, code should be written to be readable, and therefore maintainable. However, when you are optimizing, sometimes you have to through readability out the window in order to achieve better speed, or more compact size.

It's also about context.. I actually find the optimized version of #2 to be more readable/obvious, than yours. Because what we are doing is a logical operation of loading bits, not an algebraic one of adding values. Had the operation been to say take decimal value input, and store that in a variable, then I would have used your method, showing an algebraic operation, instead of a logical one.

[edit]for my "new" version, there is no need to obfuscate with the pointer... just noticed that the optimizer picked up on the incrementing index, and simply incremented the pointer, instead of adding an offset each time. The savings in that version should be in how the loop is managed. I would expect the code to look something like:

  movw r30, r18
  ldi  r18, 8
loop:
  add  r25, r25
  ld   r24, Z+
  or   r25, r24
  dec  r18
  brne loop
; total 57 cycles

In the end, when optimizing you need to look at the generated code first, and then make changes to your C code to try and improve on the output.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike,

I thought I would actually write some real code instead of trying to post obtuse code.

#include 

char my_str[] = {0,0,0,0,1,0,1,1};
char my_str2[] = {0,1,1,1};
char my_str3[] = {1,0,0,0,1,0,1,1};

unsigned char atob(char *s, char len)
{
	unsigned char result;
	for (result = 0; len--; )
		result += result + *s++;
	return result;
}

unsigned char mike(char *s, char len)
{
	unsigned char i, result;
	for (i = 0, result = 0; i < len; i++)
		result = result * 2 + s[i];
	return result;
}

int main()
{
	printf("atob(\"00001011\", 8) = %02x\n", atob(my_str, sizeof(my_str)));
	printf("atob(\"0111\", 4) = %02x\n", atob(my_str2, sizeof(my_str2)));
	printf("atob(\"10001011\", 8) = %02x\n", atob(my_str3, sizeof(my_str3)));
	printf("mike(\"00001011\", 8) = %02x\n", mike(my_str, sizeof(my_str)));
	printf("mike(\"0111\", 4) = %02x\n", mike(my_str2, sizeof(my_str2)));
	printf("mike(\"10001011\", 8) = %02x\n", mike(my_str3, sizeof(my_str3)));
} 

I ran it with Borland C. I compiled with avr-gcc -Os and atob() was 11 words and mike() was 25 words. Obviously both functions assume the char[] contains 1 or 0 not ascii characters.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If there is one of these sequences in the program, and it runs once a second, it doesn't really matter whether it is 8 or 12 or 16 words/cycles.

If it is run repeatedly, like over a buffer every second, then one would take care to write efficiently, and that might vary a lot on the app and the compiler's code generation model. In this particular case, for example, I would take care to ensure that i & my_char ended up in registers, either implicitly due to the compiler's code generation model or explicitly. I use very little local ("automatic") data in my AVR apps, so I wouldn't do that. [that's my heresy; you asked about style]

No-one used a pointer as the "best" solution? If one could just keep pointing to the source array, then it skinnies down the source quite a bit, though i may still be needed to count.

Speaking of counting, why not just unroll the loop? A little bigger, but certainly faster, and no counting.

Everyone thinks conditional decision on 1/0 is "best"? Hmmmm--if the data is known, then a simple OR will suffice. Or a mask. Is that really always worse than the conditional?

Now, if this is a logic analyzer and has to scream through this stuff, then see if you can entice your compiler to use an index register pair for the pointer and not refresh it every pass. Or simply drop into ASM and unroll the loop. Lessee--Clear result. Set the pointer. 2-cycle LD with post-increment. AND mask the low bit [optional]. OR into result. If no mask is needed then it is about 30 cycles total, unrolled. Alternative method: BLD/BST.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
No-one used a pointer as the "best" solution? If one could just keep pointing to the source array, then it skinnies down the source quite a bit, though i may still be needed to count.

Actually I did, however it won't make a difference since the emitted code shows that the compiler already did that optimization. (there is no add offset operation, LD Z+ is used instead)

theusch wrote:

Speaking of counting, why not just unroll the loop? A little bigger, but certainly faster, and no counting.

indeed, and this can be done by telling the compiler to unroll loops in it's optimization

theusch wrote:

Everyone thinks conditional decision on 1/0 is "best"? Hmmmm--if the data is known, then a simple OR will suffice. Or a mask. Is that really always worse than the conditional?

Who's everyone? Again look at my final solution, as well as the 'optimized #2 version. There is no "if" anywhere to be seen. Though a conditional would be required if you're doing any value=>1 vs zero=>0. (though in assembly you could get rid of the conditional by adding 0xff to the result and then rotating through carry [I doubt the compiler would perform this optimization] resulting in he same length as the optimized 1/0 solutions)

theusch wrote:

Now, if this is a logic analyzer and has to scream through this stuff, then see if you can entice your compiler to use an index register pair for the pointer and not refresh it every pass.

the code/compiler is already doing that, no coaxing required

theusch wrote:

Or simply drop into ASM and unroll the loop. Lessee--Clear result. Set the pointer. 2-cycle LD with post-increment. AND mask the low bit [optional]. OR into result. If no mask is needed then it is about 30 cycles total, unrolled. Alternative method: BLD/BST.

Not sure if BLD/BST would be any quicker. You still need the LD Z+ at 2 cycles, and one of each BLD/BST making it 4 cycles per itteration. Unrolled my final solution should run at 32 cycles (4 cycles per itteration, [SHL, LD Z+, OR]) Unrolling however, loses the a bility to work with variable bit lengths. (wasn't a requirement, was just pointing out a limitation)

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Who's everyone? Again look at my final solution, as well as the 'optimized #2 version.

Had my response in the window for hours, writing a bit when I got time. ;)

Quote:

Unrolling however, loses the a bility to work with variable bit lengths. (wasn't a requirement, was just pointing out a limitation)

Well, I guess the title is >>C<< style and efficiency. Style is always going to be "beauty in the eye of the beholder". While >>I<< might think that my style is acceptable-to-good, I might get roundly criticized by the right-thinkers.

The same certainly applies to my "model" of how I structure AVR apps. If it is needed to get the job done fastest, then an entirely different "feel" or "style" might be used. Including dropping into ASM if it yields significant savings.

Anyway, getting back to the quote about variable length. Interesting that we are counting cycles, yet implementing as a function. Now we are back to compiler-brand "models" of code generation. Inlining would reduce the overhead to "call" the fragment. As you mentioned, one could ask for unrolling. Do we know that the data is pure with all bits but the low bit clear? [I can't help thinking that one of these loop-guts--a few cycles/words-- each time when the data was >>created<< would make most of this go away. The bad "style" was to store the data in such a manner in the first place, some could speculate. So now we are trying to make a silk purse out of a sow's ear?] If one relaxes the style-points requirements one can create multiple entry points into an unrolled solution.

The BLD/BST gains when there is the masking requirement (the data bytes are not "pure"), and does in fact gain because neither the initialization of the result nor the shift is needed along with eliminating the mask.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

well if working with unpure data, but only concerned with bit 0, you could use a LSR, ROL pair to do the work. Same effect/time as BLD/BST. As long as we're working with a byte boundary, there is no need to initialize the result to 0, as any garbage bits will have been shifted out by the end. If working on less than a byte boundary, clearing or masking will need to be performed for both techniques.

bld/bst example

  ld r16, Z+
  bst r16, 0
  bld r17, 7
.
. repeat 6 times, with decrementing bit number
.
  ld r16, Z+
  bst r16, 0
  bld r17, 0

lsr/rol example

  ld r16, Z+
  lsr r16 ; load bit 0 int C
  rol r17 ; shift left, and load C into bit 0
.
. repeat 6 times
.
  ld r16, Z+
  lsr r16
  rol r17
; at this point any data that was in r17 has 
; been shifted out and replaced by the bits
; we shifted in

Now, if we were working with a source bit other than bit 0 or bit 7, BLD/BST would be a better option. Just as the shift option is better when working with arbitrary lengths. Each method has it's strengths.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Well, I guess the title is >>C<< style and efficiency. Style is always going to be "beauty in the eye of the beholder". While >>I<< might think that my style is acceptable-to-good, I might get roundly criticized by the right-thinkers.
Yes I was trying to have a debate about style and which styles are good etc. Very quickly we dived into code generation and optimization.
glitch wrote:
In general I agree with you, code should be written to be readable, and therefore maintainable. However, when you are optimizing, sometimes you have to through readability out the window in order to achieve better speed, or more compact size.
My point was that you don't necessarily have to. We have several examples in this thread where the compiler optimizer did its job and writing extra code like conditionals or using pointers actually didn't make any difference to the generated code and in my opinion obfuscated the source code.
glitch wrote:
It's also about context.. I actually find the optimized version of #2 to be more readable/obvious, than yours. Because what we are doing is a logical operation of loading bits, not an algebraic one of adding values. Had the operation been to say take decimal value input, and store that in a variable, then I would have used your method, showing an algebraic operation, instead of a logical one.
This is a valid point.

I guess it can be hard to discuss this type of thing in forum because everyone simply expresses their opinion or tries to defend a position. Instead I have added some pointers well-known references that have stood the test of time.

  • Code Complete by Steve McConnell
  • Obfuscated C and Other Mysteries by Don Libes
  • C Traps and Pitfalls by Andrew Koenig
  • Effective C by Scott Myers (out of print I think)
There are a number of articles on the Internet that use examples from one or more of these books e.g. Best practices for programming in C.

--Mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If we're listing books, don't forget:

http://www.amazon.com/Writing-So...

(OK, OK, M$ - but it's very good)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mikeperks wrote:

glitch wrote:
In general I agree with you, code should be written to be readable, and therefore maintainable. However, when you are optimizing, sometimes you have to through readability out the window in order to achieve better speed, or more compact size.
My point was that you don't necessarily have to. We have several examples in this thread where the compiler optimizer did its job and writing extra code like conditionals or using pointers actually didn't make any difference to the generated code and in my opinion obfuscated the source code.

True the compiler often does a good job, and it did so with the simplistic example we were working with here. But this is by no means guaranteed.

Also, you seemed to have skipped my follow-up comment to that, which hopefully adds more context.

glitch wrote:
In the end, when optimizing you need to look at the generated code first, and then make changes to your C code to try and improve on the output.

So yes, optimizing can, and usually does, obfuscate the code. But in certain cases it is necessary. However, you cannot assume a certain construction will happen from a certain sequence (unless you are very familiar with your compiler, and even then...) as such, you need to look at the generated code to guide you in where and what to change in the source, to achieve better output.

I'm a firm believer that one should only optimize where absolutely necessary. Because the compiler usually does a very good job on it's own, unless the initial code is very poor in construction.

Having said that, to say that changes in code makes no difference, that's not true, and I showed that by counting the actual run-time cycles of each method. While they all generated the same number of words, they had quite different execution times. So if your goal was speed, making some of those subtle changes could result in a substantial increase in performance. if the routine is called often.

In the end it really depends on what your optimization goal is, speed, or size. I was always looking at it as a speed issue, while you seemed to be looking at it as a size one, as you only ever looked at word-count. You quite often have to trade one for the other, though in this case the size seems to remain relatively constant, and only speed is affected.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It drives me crazy, why to the hell anybody should do such screwy things. :shock:

Why the bits are stored as one byte per bit at first?

I would ask at first for the practical usage, before I solve a task.

So I wonder, why nobody ask this important question.

E.g. if you build a software UART or other serial interfaces, its many more clever to shift every bit in, instantly as it arrived, instead wasting memory, code space and time (and brain of the forum). :wink:

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Why the bits are stored as one byte per bit at first?

I would ask at first for the practical usage, before I solve a task.

So I wonder, why nobody ask this important question.


The Emperor has no clothes, eh, Peter? Well, great minds do think alike--if you look at one of my replies above...
Quote:

[I can't help thinking that one of these loop-guts--a few cycles/words-- each time when the data was >>created<< would make most of this go away. The bad "style" was to store the data in such a manner in the first place, some could speculate. So now we are trying to make a silk purse out of a sow's ear?]

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
True the compiler often does a good job, and it did so with the simplistic example we were working with here. But this is by no means guaranteed.

Indeed. I tried the following code (as a way to handle "unpure" data where anything not 0 is treated as 1):

unsigned char my_str[] ={0,0,0,0,0,1,1,1}; 
unsigned char my_char = 0; 
unsigned char *pchar;
unsigned char mask; 
pchar = my_str;
for (mask = 0b10000000; mask != 0; mask >>= 1)
{
    if (*pchar++)
    {
        my_char |= mask;
    }
}

But instead of testing for mask being equal to 0, the compiler saw that the loop would happen 8 times, so it tested for a final condition of pchar being 8 greater than my_str. This resulted in:

lsr mask
cp pcharLow, endLow
cpc pcharHigh, endHigh
brne next

As you can see, the cp and cpc are totally unnecessary. On the same pass that the compare is 0, the lsr will also result in 0.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Why the bits are stored as one byte per bit at first?

I would ask at first for the practical usage, before I solve a task.

I would imagine that the original thread started as a homework question.

A practical use would be for a function:

unsigned int atob(char binary_string[]);

However the only time that you tend to want an atob() function is in an expression evaluator for an assembler. In which case you commonly parse the radix and use a universal function like strtoul().

I would be very surprised if the efficiency of this sort of function would ever affect a real application. It does keep the forum busy for a while though.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In the real world, I likely wouldn't optimize such a function (actually I'd likely never have such a construct) But it is fun, sometimes, to optimize the code posted in the forums to generate the smallest, or fastest output. It's like a mini vacation from all the usual questions ;) It also usually spawns a good conversation, which hopefully everyone can benefit from.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
In the end, when optimizing you need to look at the generated code first, and then make changes to your C code to try and improve on the output.
I agree with this previous comment. Sometimes the compiler surprises me and generates bigger code than I would have thought and other times it optimizes so well (like combining return clauses - not just statements).

glitch wrote:
In the real world, I likely wouldn't optimize such a function (actually I'd likely never have such a construct) But it is fun, sometimes, to optimize the code posted in the forums to generate the smallest, or fastest output. It's like a mini vacation from all the usual questions ;) It also usually spawns a good conversation, which hopefully everyone can benefit from.
Absolutely agree again. I was using this example as a case study only. I wouldn't write code this way either or even name the variables so poorly - I just kept the original names to make it easy to transition over. I think we all probably learned some things about writing code naturally first rather than trying to outguess the compiler optimizer and then examing the generated code if needed when performance or space became an issue. Otherwise leave well alone.

Lately I have been working on a C-based stub bootloader to fit into 128 words so every instruction counts. Reading a lss file with inlined and optimized functions is "interesting" to say the least - but that is a different story.

--Mike