Optimizing Flexible I/O

Go To Last Post
8 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have run into an interesting issue ( entirely of my own making of course ). My most recent project has a number of identical modules that each tie to different output pins, but should be processed in the same way. Now for the problem. Being sometimes overly "correct" in my software designs, I have decided to try to completely separate the module data from the processing logic - including the I/O logic. That leaves me in the untenable situation of being required to pass the I/O register data into the processing logic. I do this in the form of a structure:

typedef uint8_t io_register_t;

typedef struct {
	volatile io_register_t * const out;
	volatile io_register_t * const in;
	volatile io_register_t * const dir;
	const uint8_t mask;
	const bool inverted;
} xio_general_io_t;

When the processing logic needs to access the I/O, it sends the structure to a function:

xio_general_io_t output = CreateXIO( D, 1 );
xio_output( &output );
xio_write( &output, true );

This works beautifully and is quite flexible as it gives full control over the I/O. It even optimizes quite nicely when used directly. It is when it passed in as part of a larger structure that things get less than optimal. Certainly, it boils down to the fact that C does not optimize through multiple pointer accesses ( for the most part ); but, unfortunately, optimization is simply not a word I would be inclined to use with the results I am getting.

I'm not really looking for a way to trick the compiler into producing better code - I was asking for it by choosing the most flexible ( if ridiculously indirect ) method I could. Really I am wondering if other people have run into similar situations and, if so, how they dealt with them. Is there a clean way to follow good software engineering practices without excessive overhead?

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can you read the IO port pointer from the struct into temp variables and access it via that?

I have run into something similar, where I have single code to use multiple buses, but I am lazy and just have globals for IO port pointer and bus bits, not a pointer to struct that defines ports and bits. Still, it is a pain to run the buses at different speeds, so I may have to duplicate code for two different speeds, although it is simple to just have slow code for all buses.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

millwood wrote:
I didn't find the code creating any excessive overhead but thought that it wasn't that friendly to use, for me.

I'll see if I can't construct a compromise application to show the overhead - as I say, it optimizes to direct sbi/cbi accesses if I use it too directly...

I can definitely see advantages to choosing a different interface for the peripherals, however, in this case I am controlling still cameras, I just need two I/O lines ( focus and shutter ), so I don't have any reasonable recourse to a different interface - it's just plain digital I/O. As the code is now I am controlling two, but I would like it to be expandable up to, say, a dozen without having to touch the main processing logic.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, here are a couple of tests that show both the overhead and ( sometimes ) excellent optimization. xio_test.c does simple direct I/O access. The reference version compiles to 76 bytes and the structure based version to 78 bytes. I haven't looked at the .lss files as I don't really care about a single word of memory, but they function the same. The more complicated, xio_test_2.c, is more interesting. The compiler is not able to optimize the accesses as it is in the direct case. The reference version compiles to 130 bytes of program space. The structure based version, however, compiles to 372 bytes of flash and 42 bytes of SRAM ( because it cannot optimize out the created structure variables ). So in this case the structure based approach is twice the size and has a fairly hefty SRAM footprint. Again, function is identical.

This was compiling with avr-gcc 4.3.4 ( bingo's script ) and binutils 2.20. Certainly it would be possible to reduce memory overhead by creating special purpose structures for input and output, but that would also reduce the flexibility without solving the difficulty with the compiler being unable to optimize the accesses in the first place.

Martin Jay McKee

Attachment(s): 

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So here's how it stands now. I changed the structure to hold a base register and the functions calculate the register address from that with an offset. The result is a slight increase in flash usage ( 372 bytes -> 378 bytes ) on xio_test_2.c, but a very reasonable reduction in SRAM usage ( 42 bytes -> 26 bytes ). I was also thinking that generalizing to be able to use multiple bits in each structure might be a good idea so took care of that while I was making the other change - the code size change is the result of the two. I guess the last major question involves the use of inline; if I have a chance tomorrow I'll move the functions into a .c and see what happens if I remove inline. As far as CreateXIO and its use of pins instead of masks, I converted CreateXIO to use masks and added a CreateXIOPin macro to take the port, pin, and if the pin is inverted and calculate the appropriate masks - so both options are available.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I reuse source code rather than object code:

play_with_b1.c:
#include "play_with_b1.h"
#define PLAYMATE_LET B
#define PLAYMATE_BIT 1
#define NAME play_with_b1
#include "play_with_template.c"

play_with_d6:
#include "play_with_d6.h"
#define PLAYMATE_LET D
#define PLAYMATE_BIT 6
#define NAME play_with_d6
#include "play_with_template.c"

play_with_template.c:
// #define pval, pclr, etc.
#include "port_bits.h"

static unsigned char count;

void NAME(void)
{
    // clear PLAYMATE pin after it is hi 100 times
    if(pval(PLAYMATE) && ++count> 100)
                            pclr(PLAYMATE) 
}

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

millwood wrote:
the issue with using offsets is that you may run into a case where the offsets may not be the same.

Yes, I thought of that... Considering the reduction in SRAM usage, I felt it was too good to pass up, but I restructured the code slightly as well.

typedef struct {
	volatile xio_register_t * port_base;
	xio_register_t mask;
	xio_register_t invert_mask;
} xio_general_io_t;

#define CreateXIOPin( port, pin, inverted )\
{\
	&PIN##port,\
	1U<<pin,\
	( 1U<<pin ) & ( inverted ? ( ( xio_register_t ) -1L ) : ( ( xio_register_t ) 0L ) )\
}

#define CreateXIO( port, mask, invert_mask )\
{\
	&PIN##port,\
	mask,\
	invert_mask & mask\
}

//
// Hardware Register Access
//
#define XIO_OUTPUT_REGISTER_OFFSET 2
#define XIO_DIRECTION_REGISTER_OFFSET 1
#define XIO_INPUT_REGISTER_OFFSET 0

#define XIO_OUTPUT_REGISTER( io ) ( io->port_base + XIO_OUTPUT_REGISTER_OFFSET )
#define XIO_DIRECTION_REGISTER( io ) ( io->port_base + XIO_DIRECTION_REGISTER_OFFSET )
#define XIO_INPUT_REGISTER( io ) ( io->port_base + XIO_INPUT_REGISTER_OFFSET )

By wrapping the accesses to the registers in macros it is possible to write all the actual access code without worrying about the definition of the registers in the structure. Though there is no conditional compilation framework at the moment, it would be easy enough to modify either the offsets or the whole access structure. By simply redefining the structure, the register access macros and the CreateXIO/CreateXIOPin macros, it would be easy to deal with register addresses that don't follow a pattern.

skeeve wrote:
I reuse source code rather than object code:

Somehow I've just never been comfortable doing it that way - too tied to the compilers ability to check everything in the static typing system I suppose to depend on the preprocessor for much. If so, it would probably be a direct result of the hours I have spent ( not so unenjoyable as you might think ) elbow deep in C++ template heavy code. Still, there is absolutely no doubt that the code inclusion route will end up with code much less costly than what I'm seeing at the moment.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mckeemj wrote:
skeeve wrote:
I reuse source code rather than object code:

Somehow I've just never been comfortable doing it that way - too tied to the compilers ability to check everything in the static typing system I suppose to depend on the preprocessor for much. If so, it would probably be a direct result of the hours I have spent ( not so unenjoyable as you might think ) elbow deep in C++ template heavy code. Still, there is absolutely no doubt that the code inclusion route will end up with code much less costly than what I'm seeing at the moment.
Before I trusted the preprocessor,
I often used code generation to accomplish the same thing.
The code generators were in python.

Heavy-duty emplate code can be very interesting.
May you live in interesting times.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods