[C,ASM] check CARRY flag from C

Last post
55 posts / 0 new

Pages

Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all : )

I'm looking how to read CARRY flag from SREG after certain operation - LEFT SHIFT.

Sample code is following:

char carry;
char data;

data = data<<1;
// There I need to read bit that was shifted out into carry

It is clear to me this is job for inline ASM, but I never work with ASM in C.

Can you give me some advice?

David

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do not even dream of accessing the Carry flag in C. It is not guaranteed to be valid in C.

If you want to know whether a carry is shifted out, you simply:

char data;
// just check bit 7 before the shift.
char carry = (data & 0x80) != 0;
data = data<<1;

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David's reply covers the vast majority of possible uses, but if you believe you need better and you tell us why do you think you need the carry bit, we might come up with alternative solutions.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Generally, I'm trying to write MFM coder as simple as it is possible. Input is byte array, output is MFM-coded stream.

If it helps you, there is whole procedure - I'm not saying that is the best solution, but until now i haven't wrote better solution.

Base idea, if I understood it OK, is that data bits (x,y,z,...) are coded to (x,x NOR y, y NOR z, z, z NOR .......)

So I came with this solution. David, your code will work of course, but I think variant with use of CARRY flag will save some instructions.

void FDDIO_encodeMfm() {
    char carry = 0;
    char prvek = FDD_sector[0];
    prvek = (prvek << 1);
    //!!! carry = ?; !!!
    // Data x
    FDD_sectorMfm[0] |= carry;
    
    for(int i = 0; i < 658;i++) {
        prvek = FDD_sector[i];
        for(c = 0; c < 9; c++) {
            // CLOCK  x NOR y
            FDD_sectorMfm[i] = (short) (FDD_sectorMfm[i] << 1);
            FDD_sectorMfm[i] |= carry;
                if(c==8 && i!= (658-1)) prvek = FDD_sector[i+1];
            prvek = (prvek<<1);
            //!!! carry = ?; !!!
            FDD_sectorMfm[i] |= carry;
            FDD_sectorMfm[i] ^= (1<<0); 
            
            // DATA y
            if(c < 8) {
                FDD_sectorMfm[i] = (FDD_sectorMfm[i] << 1);
                FDD_sectorMfm[i] |= carry;	// bit 14 16bit slova, 15. je pro ulozeni clocku ktery uz souvisi s dalsim prvkem
            }
        }
    }
}

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Forgotten global defs:
short FDD_sectorMfm[658];
char FDD_sector[512];

.. I have tested almost the same code on PC (java) and it works OK..

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You never assign carry apart from initialising to zero.

I should just write the algorithm to be accurate. Do not worry about the efficiency.

Once you are working, by all means optimise it.

I notice that you have multiple array references. Some compilers may spot this. OTOH, you might just as well introduce a temporary variable.

Note that you can equally well use a 16-bit int. Then the 'carry' is in bit 8.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
            prvek = (prvek<<1); 
            carry = SREG & (1<<7) != 0; 

This could give you the current state of the carry bit, but you must realize what "current" means. It is certainly not the state of the carry bit immediately after the shift since, at the very least, the asignment to prvek has happened since then. There could also be several opcodes run in preparation of reading SREG that might have changed its value. And worst case, an interrupt might have occurred in between which means hundreds or even thousands of opcodes could have been run.

Davids solution of using a 16 bit int sounds easiest to me (however, you need to change the declaration of prvek to unsigned char, which is probably advisable anyways):

unsigned int temp = prvek << 1;;
prvek = temp;
carry = temp >> 8;

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
And worst case, an interrupt might have occurred in between which means hundreds or even thousands of opcodes could have been run.
If you use a compiler that does not preserve SREG in interrupts, then you should consider to use a different one. ;-)

(of course the central point "reading Carry from SREG in C is pointless" is true nevertheless)

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, you can read SREG as any other IO register, but this will not bring you any closer with your

Quote:
I'm trying to write MFM coder as simple as it is possible (..)with GCC

idea because GCC does not understand SREG flags and their brbx functionality.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hey guys I dont want to work with 16bit integer unless it is not efficient. There must be way to do it in inline ASM this is what I'm asking for : ).
I'm not saying it must be only SREG bit saving line, but what if whole left shift and then SREG bit save section, or more, will be in inline ASM?

Koshchi it is as you wrote - there is some operation(s) after LSHIFT and it is not guaranteed that in CARRY will be correct val.

Again, see this in pseudocode and compare efficiency:

prvek <<= 1;
prvekMfm |= (SREG&1);

... which I need to convert into inline ASM or generic way:

carry = (prvek >>7);
prvek <<= 1;
prvekMfm |= carry;

:?: 8)

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you actually looked at what code the compiler generates for an implementation in C? Do that, and then tell what needs further optimization.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
which I need to convert into inline ASM
Why you "need to"? What is wrong with a C solution? With all the other stuff in that function do you really think a few cycles matter?

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You do not need to do all calculations in 16-bit. Simply do a cast whenever you want to 'find' the carry.

Forget about efficiency. Get your algorithm working correctly.

Quite honestly, the C compiler will translate straightforward code very well. Only if it needs extra performance do you ever look at the ASM.

A human can often spot the odd peephole optimisation that will make a significant improvement. But believe me, the Compiler has done most of them already.

The main advantage of the 'tweak generated code' approach is that you have already got a working algorithm. So your testing regime only has to compare against the original.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I need this it running as fast as it is possible. There is working AVR 8bit algorithm.

void FDDIO_encodeMfm() {
    
    unsigned int prvekMfm = FDD_sectorMfm[0];
    unsigned char prvek = FDD_sector[0];
    unsigned char carry = 0;
    prvek <<= 1;
    carry = ((prvek&128)>>7);
    // Data x
    prvekMfm |= carry;
    
    for(int i = 0; i < 524;i++) {
        prvek = FDD_sector[i];
        prvekMfm = FDD_sectorMfm[i];
        for(c = 0; c < 9; c++) {
            // CLOCK  x NOR y
            prvekMfm <<= 1;
            prvekMfm |= (uint8_t) carry;
            if((c==8) && (i < (524-1))) {
                prvek = FDD_sector[i+1];
            }
            carry = ((prvek&128)>>7);
            prvek <<= 1;
            prvekMfm |= carry;
            prvekMfm ^= (1<<0);
            
            // DATA y
            if(c < 8) {
                prvekMfm <<= 1;
                prvekMfm |= carry;	// bit 14 16bit slova, 15. je pro ulozeni clocku ktery uz souvisi s dalsim prvkem
            }
        }
        FDD_sectorMfm[i] = prvekMfm;
    }
}

It counts about 24ops to code 1 data bit with -O3... Data bits are streamed from FDD_sector[] to FDD_sectorMfm[]. I believe there is faster way.

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
   unsigned int prvekMfm = FDD_sectorMfm[0]; 
    unsigned char prvek = FDD_sector[0]; 

Exactly how do you think you are going to use arrays that have no elements?

 carry = ((prvek&128)>>7); 

You say that you want to be as fast as possible, then you choose the most inefficient method suggested in this thread. Besides that, you get the bit after that bit has already been lost.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Koshchi wrote:

   unsigned int prvekMfm = FDD_sectorMfm[0]; 
    unsigned char prvek = FDD_sector[0]; 

Exactly how do you think you are going to use arrays that have no elements?

Steve, you should rethink that. ;-)

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Exactly how do you think you are going to use arrays that have no elements?

Steve? he's just assigning element 0 from those two global arrays to each of prvekMfm and prvek surely?

BTW:

 carry = ((prvek&128)>>7);

Haven't checked but what does:

 carry = !!(prvek&128);

yield?

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

 carry = ((prvek&128)>>7);

Haven't checked but what does:

 carry = !!(prvek&128);

yield?

gcc supports C99, so also worth a check is:
 carry = (_Bool)(prvek&128);

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I know nothing about MFM. But I am sure there will be algorithms in the public domain.

I would also guess that the whole operation can be done with one or two XORs.

You just need to ask Uncle Google.

I have not looked at the generated ASM. But most compilers will do this in about three instructions.

    if (val & 128) carry = 1;
    else carry = 0;

Personally, I think you have written your algorithm fairly clearly (for the compiler). I have made no attempt to understand it myself (human). I just guess that there are better algorithms.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have tried all variants mentioned there but
carry = ((prvek&128)>>7); is really the fastest.

    47:     carry = ((prvek&128)>>7);
MOV R26,R18		Copy register 
LSL R26		Logical Shift Left 
MOV R19,R26		Copy register 
ROL R19		Rotate Left Through Carry 
CLR R19		Clear Register 
ROL R19		Rotate Left Through Carry 

In short it seems in 6 ops R19 contains CARRY, which will be ORed with desired var later (7ops). What I want is to OR this var straight with CARRY (1op)
>>> the same as "ROL" op works.

I believe this is temporary solution until I find how to save CARRY straight to register with desired var. In better some GCC/ASM geek there will tell me how to do it : )

Unfortunatelly MFM coding is very obsolete and was realized very often only by hardware controllers, so there are not any codes nor avr sample codes. Anyway it is really old thing and is described in datasheets from ~70's....

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it's inevitable for you: here are the lines that save your life:

unsigned char carry (unsigned char c)
{
    static unsigned char cy;
    asm ("lsl %0 $ clr %1 $ rol %1"
         : "+r" (c), "=r" (cy));
         
    return c;
}

avr-gcc compiles that to 5 instructions, including the STS and the RET. And if you just need a value != 0 for the carry, you can do

unsigned char carry (unsigned char c)
{
    static unsigned char cy;
    asm ("lsl %0 $ sbc %1,%1"
         : "+r" (c), "=r" (cy));
         
    return c;
}

You then can subtract the carry because it is always 0 or −1.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why not do the whole bit loop in ASM ? It's not long, or overly complicated in ways that make it easier to write in C than in assembler.

Is this for one of those projects that connects modern storage (flash card) to an old computer's floppy interface? I find several references to such projects on the net (none particularly complete, alas.)

Are you delivering the bits in real time to an IO port, or just making an intermediate copy? If latter, isn't that wasting a lot of time unpacking and repacking bits only to unpack them again later? (assuming that they'll be transmitted eventually...)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There's not much more in that function than the bit loop, so it might be easier to write the whole function in asm as a separate asm source. There is a chapter in avr-libc manual on how to do that - read also the links contained there, especially on register usage. It could help to avoid the tedious syntax of inline assembler (which has its own chapter in the avr-libc manual, too).

Then, I would unroll the bit loop, which would help to avoid both the tests and also the 16-bit shifts.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Guys I have optimized my code above to not use 16bit operands and achieved decrease from 24ops/bit to 14ops/bit which is great : ))

Thank you all for your time, it seems my idea (carry f) is not real.

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I missed recent posts .. will think about them later. 8-)

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I started to look at things from an optimized ASM PoV, and noticed that on AVR, "carry" is not so special as it is in some architectures. The fact is, the CPU can test any bit in a register pretty quickly. And then you can stop doing explicit math when conditional jumps are just as quick. So getting fixated on dealing with the carry bit might not make a lot of sense. I got the following C code that is nice and obvious:

#define LEN sizeof(FDD_sector)

void MFM_enc_new() {

  uint16_t *mfm_outptr = &FDD_MFMnew[0];
  uint8_t *inptr = &FDD_sector[0];
  uint8_t thisbyte;

  for(uint16_t i = 0; i < LEN; i++) {
    uint16_t ow;
    thisbyte = *inptr++;  /* get byte, point to next byte */
    /*
     * 7 bits are easy
     */
    for (uint8_t b=0; b<7; b++) {
	if (thisbyte & (0x80)) {  // 1 --> 10 (next bit don't-care
	    ow += ow + 1; // shift in Data 1
	    ow += ow;     // shift in CLK 0
	} else {
	    ow += ow;    // shift in Data 0
	    if (thisbyte & (0x40)) {  // CLK depends on next bit
		ow += ow;   // Shift in 0 for 01
	    } else {
		ow += ow + 1; // Shift in 1 for 00
	    }
	}
	thisbyte <<= 1;
    }
    /*
     * 8th bit may have to look at the next byte
     */
    if (thisbyte & 0x80) {  // Still shift in 10 for a 1 bit
	ow += ow + 1; // shift in 1
	ow += ow;     // shift in 0
    } else {
	ow += ow;    // shift in 0
	if (*inptr & 0x80) {
	    ow += ow;
	} else {
	    ow += ow + 1;
	}
    }
    *mfm_outptr++ = ow;
  }
}

Analysis is a bit harder, since the code path isn't constant with different data. I think it averages out to less than 14 ops/bit, and there may be some opportunities for additional optimization, cycle-wise.

I ran it over some "lazy dog" data and it produced the same results as your sampled code...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Wow it looks very interesting. Yes I am working on something that you have mentioned..
I let run your code on 514 elements 8bit array and it took about 59000 ops. This code is about 57000 but there are no your magic yet. Yes, analysis tracking in disasm is very hard on optimized code.

void FDDIO_encodeMfm() {
    
uint8_t prvekMfm;
    uint8_t prvek = FDD_sector[0];
    uint8_t carry;
    
    
    
    for(int i = 0; i < 524;i++) {
		carry = ((prvek&128)>>7);
		prvek <<= 1;
		// Data x
		prvekMfm = carry;
		
		//prvekMfm = FDD_sectorMfm[i];
		for(c = 0; c < 8; c++) {
            // CLOCK  x NOR y
            prvekMfm <<= 1;
            prvekMfm |= carry;
            if((c==7) && (i < (524-1))) {
                prvek = FDD_sector[i+1];
            }
            carry = ((prvek&128)>>7);
            prvek <<= 1;
            prvekMfm |= carry;
            prvekMfm ^= (1<<0);
			if(c == 3) {
				FDD_sectorMfm[i*2] = prvekMfm;
				prvekMfm = carry;
				continue;
			}
            // DATA y
            if(c == 7) {
                 break;
            }
			prvekMfm <<= 1;
			prvekMfm |= carry;	// bit 14 16bit slova, 15. je pro ulozeni clocku ktery uz souvisi s dalsim prvkem
        } // for
		FDD_sectorMfm[i*2+1] = prvekMfm;
		if(i<(524-1)) {
			prvek = FDD_sector[i+1];
		}
    } // for
}

Jan, I tried unroll whole loop but suprisingly it led to op count increase (about 7-10%).

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Unrolling loops or inlining too much wil lead to run-out of registers and more variables will live in the frame (on stack). That's expensive.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sometimes maybe, but this is a textbook example of an algorithm where unrolling brings only benefits. It's not just that you spare the loop counting. but you can also:
- avoid the conditionals
- avoid the 16-bit shifts
- perform the NOR bits' negation at once for all bits in one output byte
- I did not think too much about it, but maybe it would be faster to use sbrs and ori rather than the shifts - this would need to be tried

if (prvek & 128) prvekMFM |= 0b11000000;
if (prvek & 64) prvekMFM |= 0b01110000;
[...]
prvekMFM ^= 0b01010101;

And as I said, this is such a small and simple routine that I see little reason not to write it all in assembler.

Table lookup may be even faster, but I leave that as an excercise to Sith to try.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that the MFM output is 16bits, which will interfere with straightforward ori/xor-for-nor-negation implementations.

Although, it might be really interesting to do a byte-at-a-time output (4bits input) unrolling to avoid all the 16bit math entirely (at the expense of a extra special-case.) Which would help in either C or assembler, which is usually a good sign... (beware endianness, though!)

(Hmm. And at 3bits plus a special case for each output byte, a lookup table is looking more and more attractive. Although beware that extracting a bitfield is not a very efficient operation in AVR C.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here it is unrolled, handling the output bytes separately, and only setting the one bits. Looks like less than 6 instructions per bit.
The compiler does some impressive optimization here, omitting the initialization of the output bytes by noticing the first operation can load the whole byte instead of just doing an "or."

This has been an interesting example of how you don't need to write assembler to get efficient code, but ... you sure need to THINK like you're writing in assembler. I expect that an experienced AVR assembler programmer would have come up with something like this algorithm sooner than we did; it would have been more immediately obvious that looking at explicit bits used fewer instructions than shifting the bits to a common location.

I wonder what the best implementation on one of the 32bit RISC CPUs (ARM, PIC32) would look like. (Hmm. PIC32 looks pretty nice with this source. The compiler apparently notices that it can do destructive "andi" operations on the input byte (in its register), so the lack of a separate bittest instruction doesn't end up hurting.)

/*
 * Compute MFM encoding of one bit ("inbit") in "thisbyte", putting the
 * two output bits in the "output" byte bit positins outbith/outbitl
 * It's assumed that the output was initialized to zero, so we only
 * set the bits that need to be 1.  The code will NOT be small and efficient
 * if the bit numbers are not constants!
 * We could have computed the output bit positions, but it's clearer to
 * just have them specified explicitly
 */
#define dobit(output, inbit, outbith, outbitl)          \
    if (thisbyte & (1<<(inbit))) {                      \
	output |= 1<<(outbith);				\
    } else {                                            \
        if (0 == (thisbyte & (1<<((inbit)-1)))) {	\
            output |= 1<<(outbitl);                     \
        }                                               \
    }

void MFM_enc_new() {

    uint8_t *mfm_outptr = (uint8_t*)&FDD_MFMnew[0];
    uint8_t *inptr = &FDD_sector[0];
    uint8_t thisbyte;

    for(uint16_t i = 0; i < LEN; i++) {
        uint8_t hb=0, lb=0;
        thisbyte = *inptr++;  /* get byte, point to next byte */
        dobit(hb, 7, 7, 6);
        dobit(hb, 6, 5, 4);
        dobit(hb, 5, 3, 2);
        dobit(hb, 4, 1, 0);  

        dobit(lb, 3, 7, 6);
        dobit(lb, 2, 5, 4);
        dobit(lb, 1, 3, 2);
        /*
         * 8th bit (bit0)  may have to look at the next byte
         */
        if (thisbyte & 1) {  // Still shift in 10 for a 1 bit
            lb |= 1<<1;
        } else {
            if (0 == (*inptr & 0x80)) {
                lb |= 1;
            }
        }
        *mfm_outptr++ = lb;             /* output in correct byte order */
        *mfm_outptr++ = hb;
    }
}
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hmmm--I don't see any mentions of BLD/BST pairs, whether in the C generated code or ASM suggestions. But I have to study the MFM thing to see if applicable.

You can put lipstick on a pig, but it is still a pig.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't think BLD/BST are useful. The thing is, the AVR architecture can access/test any bit in a register variable in a single instruction. Moving the bits around to carry or the T-bit doesn't make anything faster, it just wastes time moving the bits around. (OK, it might be slightly faster. You can branch based on carry/T setting, while SBRC/SBRS are skips and require one extra instruction (and probably two cycles) to act on a decision.) (they might be more useful for generating SMALL code, which would be a different problem.)
(I would expect most C compilers to have a hard time making use of T, though. Some microcontroller compilers have single-bit variables, but standard C (and gcc in particular) isn't one of them!)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just compiled westw's algorithm with avr-gcc. It takes 33933 cycles for LEN=514 (and zeroed buffer). 29821 cycles for buffer full of 0xFF, 30849 cycles for buffer of 0xAAs.

I really cannot be bothered to invent any algorithm. But I bet that Uncle Google may find something better for you.

Be realistic. That 30k cycles is some 4ms for an 8MHz AVR. If you want to achieve 40us, you need a 100x improvement on the algorithm. However if you are doing a floppy disk, presumably you get 9 x 5 sectors in a second @ 300 RPM. So it takes 22ms to read a sector.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
This has been an interesting example of how you don't need to write assembler to get efficient code,
I keep repeating - you need to write assembler not to get efficient code, but to gain control over how efficient the code is (the bonus is what you mentioned: that in simple algorithms like this you start to think slightly differently and will see the opportunities to optimize).

uint8_t FDD_sector[] = "lazy dog";

#define LEN sizeof(FDD_sector)

uint16_t FDD_MFMnew[LEN];
uint16_t FDD_MFM_1[LEN];

void MFM_encode_1(uint16_t nrdata, uint8_t * inbuffer, uint8_t * outbuffer);  // in tmp1.S

int main(void) {
  MFM_enc_new();  // 1319 cyc - westfw's first routine
  MFM_enc_new2();  // 574 cyc - westfw's second routine
  MFM_encode_1(LEN, FDD_sector, (uint8_t *)FDD_MFM_1);  // 329 cyc
}

tmp1.S:

	.text
.global	MFM_encode_1
	.type	MFM_encode_1, @function
MFM_encode_1:
     movw   r30, r22   ;Z - input pointer
	 movw   r26, r20   ;X - output pointer
                       ;r25:r24 - input data counter
     ldi    r21, 0b01010101
     ld     r20, Z+
MEX01:
     clr    r18
	 clr    r19
     sbrc   r20, 7
     ori    r18, 0b11000000
     sbrc   r20, 6
	 ori    r18, 0b01110000
	 sbrc   r20, 5
	 ori    r18, 0b00011100
	 sbrc   r20, 4
	 ori    r18, 0b00000111
     sbrc   r20, 3
     ori    r18, 0b00000001
	 sbrc   r20, 3
	 ori    r19, 0b11000000
     sbrc   r20, 2
	 ori    r19, 0b01110000
	 sbrc   r20, 1
	 ori    r19, 0b00011100
	 sbrc   r20, 0
	 ori    r19, 0b00000111
     sbiw   r24, 1
	 breq   MEX02      // last round?
	 ld     r20, Z+  
	 sbrc   r20, 7
	 ori    r19, 0b00000001
	 eor    r18, r21
	 eor    r19, r21
	 st     X+,  r19   // little endian
	 st     X+,  r18
	 rjmp   MEX01
MEX02:
	 eor    r18, r21
	 eor    r19, r21
	 st     X+,  r19   // little endian
	 st     X+,  r18
     ret

(sorry for formatting, it's damn' AS4's editor's maniacal use of tabs)

As I said, table lookup may be faster, but that's still left for Sith as a homework ;-)

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nice!
So that's what you meant by saving the NOR inversion...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
Nice!
Thanks.

Tleskejte mi, mám to rád! ;-)

westfw wrote:
So that's what you meant by saving the NOR inversion...
On the other hand, it took me some time to get your deMorganisation of that NOR into the conditionals... :-P

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

However if you are doing a floppy disk, presumably you get 9 x 5 sectors in a second @ 300 RPM. So it takes 22ms to read a sector.

Since this is 2011, I'm guessing OP is not really dealing with floppy disks. In poking at this, my guess is that it is a magnetic stripe on a card. If so, the number of "sectors" is probably quite limited.
http://www.aimglobal.org/technologies/card/mfmencoding.asp
"Modified Frequency Modulation (MFM) -- Encoding for Magnetic Stripes"

Lee

You can put lipstick on a pig, but it is still a pig.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It would be interesting to know what the actual application is.

If I have done my sums correctly, you read a sector, decode it, put it in a buffer, ...
If you interleave and stagger the sectors, you have finished your housekeeping before the next sector comes by.
If you get the staggering wrong, you have to wait a whole new revolution (200ms) to catch the sector.

A hard disk rotates faster and has more physical sectors to the track. Your CPU power needs to be better.

If this is a card reader, I doubt if speed is ever going to be relevant. Mind you, these youngsters can read a 16x2 LCD at 600 times a second.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was curious as to whether the compiler would produce code that was close to wek's, given the same algorithm.
The answer is "close, but not quite" (sorta what you'd expect!):

/*
 * MFM encoding of x,y,z,... is x, x nor y, y. y nor z, ...
 * That means that each bit affects three output bits (but only creates
 * two output bits.)  ORing in a 111 in the proper bit position sets the
 * x/y/z (data) bits, and will create the OR'ed bits in between them (clock
 * bits)  When we're done, we'll invert the clock bits to get the NOR vals.
 */
#define dobit6(output, inbit, bits)                     \
  if (thisbyte & (1<<(inbit))) {                        \
    output |= bits;                                     \
  }

void MFM_enc_new() {

    uint8_t *mfm_outptr = (uint8_t*)&FDD_MFMnew[0];
    uint8_t *inptr = &FDD_sector[0];
    uint8_t thisbyte;

    for(uint16_t i = LEN; i > 0; i--) {
        uint8_t hb=0, lb=0;
        thisbyte = *inptr++;  /* get byte, point to next byte */

        dobit6(hb, 7, 0b11000000);      /* first bit only does two bits */
        dobit6(hb, 6, 0b01110000);
        dobit6(hb, 5, 0b00011100);
        dobit6(hb, 4, 0b00000111);

        dobit6(hb, 3, 0b00000001);      /* bit 3 affects hb and lb */
        dobit6(lb, 3, 0b11000000);      /* bit 3 affects hb and lb */

        dobit6(lb, 2, 0b01110000);
        dobit6(lb, 1, 0b00011100);
        dobit6(lb, 0, 0b00000111);
        if (i != 0) {
          /*
           * 8th bit (bit0)  may have to look at the next byte
           */
          if (*inptr & 0x80) {
            lb |= 1;
          }
        }
        lb ^= 0b01010101;               /* Convert OR'ed clocks to NORs */
        hb ^= 0b01010101;
        *mfm_outptr++ = lb;             /* output in correct byte order */
        *mfm_outptr++ = hb;
    }
}
    for(uint16_t i = LEN; i > 0; i--) {
        uint8_t hb=0, lb=0;
        thisbyte = *inptr++;  /* get byte, point to next byte */
  c8:	81 91       	ld	r24, Z+

        dobit6(hb, 7, 0b11000000);	/* first bit only does two bits */
  ca:	87 ff       	sbrs	r24, 7
  cc:	02 c0       	rjmp	.+4      	; 0xd2 <_Z11MFM_enc_newv+0x14>
  ce:	20 ec       	ldi	r18, 0xC0	; 192
  d0:	01 c0       	rjmp	.+2      	; 0xd4 <_Z11MFM_enc_newv+0x16>
  d2:	20 e0       	ldi	r18, 0x00	; 0
	dobit6(hb, 6, 0b01110000);
  d4:	86 fd       	sbrc	r24, 6
  d6:	20 67       	ori	r18, 0x70	; 112
	dobit6(hb, 5, 0b00011100);
  d8:	85 fd       	sbrc	r24, 5
  da:	2c 61       	ori	r18, 0x1C	; 28
	dobit6(hb, 4, 0b00000111);
  dc:	84 fd       	sbrc	r24, 4
  de:	27 60       	ori	r18, 0x07	; 7

	dobit6(hb, 3, 0b00000001);	/* bit 3 affects hb and lb */
  e0:	83 fd       	sbrc	r24, 3
  e2:	02 c0       	rjmp	.+4      	; 0xe8 <_Z11MFM_enc_newv+0x2a>
  e4:	90 e0       	ldi	r25, 0x00	; 0
  e6:	02 c0       	rjmp	.+4      	; 0xec <_Z11MFM_enc_newv+0x2e>
  e8:	21 60       	ori	r18, 0x01	; 1
  ea:	90 ec       	ldi	r25, 0xC0	; 192
	dobit6(lb, 3, 0b11000000);	/* bit 3 affects hb and lb */
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello again : )

Guys you have helped me a lot, this is very juicy discussion, isn't it? : ) Currently I don't have much time to be there, but keep rolling : )

westfw, wek: ~ 5-8 ops per bit is much better than I was expecting. Amazing job! :!:
Until now I have not any clue that op like SBRC exists.

For now I have learned that focusing on the one angle of view and try to optimize it is not proper way.

Lookup table is great idea, surely I will think about it.

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is it possible to say what the source of this MFM data is? This would give an idea of whether you really need "real time" decode or is it sufficient to "chew on a buffer at a time"

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

	dobit6(hb, 3, 0b00000001);	/* bit 3 affects hb and lb */
  e0:	83 fd       	sbrc	r24, 3
  e2:	02 c0       	rjmp	.+4      	; 0xe8 <_Z11MFM_enc_newv+0x2a>
  e4:	90 e0       	ldi	r25, 0x00	; 0
  e6:	02 c0       	rjmp	.+4      	; 0xec <_Z11MFM_enc_newv+0x2e>
  e8:	21 60       	ori	r18, 0x01	; 1
  ea:	90 ec       	ldi	r25, 0xC0	; 192
	dobit6(lb, 3, 0b11000000);	/* bit 3 affects hb and lb */

The compiler is trying to be smart, but in this particular case it did not pay off. It would be interesting to see how would this compile with different vintages of the compiler (I am not quite sure but I believe I used WinAVR20071227 and my results matched David's, who I believe used the 20100110 vintage).

This is, partially, the reason why I preach "gain control through asm".

When I was younger, I used to believe that the compiler constructs several possible asm solutions for a given C source, and then performs a profiling step to chose the optimal one... Naive, am I... ;-)

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sure, I'm working on FDD emu - it speaks for itself.

edit: sorry I misunderstood your question - source of the data is the memory card and in simple --- this data would be reproduced in MFM code by MCU straight to FDD controller. But I believe you got it.

edit2: I did not say I i need decode the data in real time, but it is good to have efficient code. 8-)

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Given the recent confessions re posting subject matter paranoia:

Quote:
Is it possible to say what the source of this MFM data is?

Not only that - it might rule out that this is actually an IAD (Involuntary Auxiliary Device) :wink: for an ATM (Automatic Teller Machine), or some such.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had to try the lookup table that Jan adviced, otherwise I wouldn't fall asleep. :))

It is really BIG speed up - it running 3.2ops per bit :O.

Used code:

const uint16_t MFM_lookupTable[256] = {
	0x5555, 0x5552, 0x5549, 0x554A, 0x5525, 0x5522, 0x5529, 0x552A, 0x5495, 0x5492, 0x5489, 0x548A, 0x54A5, 0x54A2, 0x54A9, 0x54AA, 0x5255, 0x5252, 0x5249, 0x524A, 0x5225, 0x5222, 0x5229, 0x522A, 0x5295, 0x5292, 0x5289, 0x528A, 0x52A5, 0x52A2, 0x52A9, 0x52AA, 0x4955, 0x4952, 0x4949, 0x494A, 0x4925, 0x4922, 0x4929, 0x492A, 0x4895, 0x4892, 0x4889, 0x488A, 0x48A5, 0x48A2, 0x48A9, 0x48AA, 0x4A55, 0x4A52, 0x4A49, 0x4A4A, 0x4A25, 0x4A22, 0x4A29, 0x4A2A, 0x4A95, 0x4A92, 0x4A89, 0x4A8A, 0x4AA5, 0x4AA2, 0x4AA9, 0x4AAA, 0x2555, 0x2552, 0x2549, 0x254A, 0x2525, 0x2522, 0x2529, 0x252A, 0x2495, 0x2492, 0x2489, 0x248A, 0x24A5, 0x24A2, 0x24A9, 0x24AA, 0x2255, 0x2252, 0x2249, 0x224A, 0x2225, 0x2222, 0x2229, 0x222A, 0x2295, 0x2292, 0x2289, 0x228A, 0x22A5, 0x22A2, 0x22A9, 0x22AA, 0x2955, 0x2952, 0x2949, 0x294A, 0x2925, 0x2922, 0x2929, 0x292A, 0x2895, 0x2892, 0x2889, 0x288A, 0x28A5, 0x28A2, 0x28A9, 0x28AA, 0x2A55, 0x2A52, 0x2A49, 0x2A4A, 0x2A25, 0x2A22, 0x2A29, 0x2A2A, 0x2A95, 0x2A92, 0x2A89, 0x2A8A, 0x2AA5, 0x2AA2, 0x2AA9, 0x2AAA, 0x9554, 0x9552, 0x9548, 0x954A, 0x9524, 0x9522, 0x9528, 0x952A, 0x9494, 0x9492, 0x9488, 0x948A, 0x94A4, 0x94A2, 0x94A8, 0x94AA, 0x9254, 0x9252, 0x9248, 0x924A, 0x9224, 0x9222, 0x9228, 0x922A, 0x9294, 0x9292, 0x9288, 0x928A, 0x92A4, 0x92A2, 0x92A8, 0x92AA, 0x8954, 0x8952, 0x8948, 0x894A, 0x8924, 0x8922, 0x8928, 0x892A, 0x8894, 0x8892, 0x8888, 0x888A, 0x88A4, 0x88A2, 0x88A8, 0x88AA, 0x8A54, 0x8A52, 0x8A48, 0x8A4A, 0x8A24, 0x8A22, 0x8A28, 0x8A2A, 0x8A94, 0x8A92, 0x8A88, 0x8A8A, 0x8AA4, 0x8AA2, 0x8AA8, 0x8AAA, 0xA554, 0xA552, 0xA548, 0xA54A, 0xA524, 0xA522, 0xA528, 0xA52A, 0xA494, 0xA492, 0xA488, 0xA48A, 0xA4A4, 0xA4A2, 0xA4A8, 0xA4AA, 0xA254, 0xA252, 0xA248, 0xA24A, 0xA224, 0xA222, 0xA228, 0xA22A, 0xA294, 0xA292, 0xA288, 0xA28A, 0xA2A4, 0xA2A2, 0xA2A8, 0xA2AA, 0xA954, 0xA952, 0xA948, 0xA94A, 0xA924, 0xA922, 0xA928, 0xA92A, 0xA894, 0xA892, 0xA888, 0xA88A, 0xA8A4, 0xA8A2, 0xA8A8, 0xA8AA, 0xAA54, 0xAA52, 0xAA48, 0xAA4A, 0xAA24, 0xAA22, 0xAA28, 0xAA2A, 0xAA94, 0xAA92, 0xAA88, 0xAA8A, 0xAAA4, 0xAAA2, 0xAAA8, 0xAAAA
};
void MFM_enc_new3() {
    uint8_t *inptr = &FDD_sector[0]; 
    uint8_t thisbyte; 

    for(uint16_t i = 0; i < LEN; i++) { 
        uint8_t hb=0, lb=0; 
		uint16_t thisByteShort = MFM_lookupTable[FDD_sector[i]];
		lb = thisByteShort;
		hb = (thisByteShort>>8);
        /* 
         * 8th bit (bit0)  may have to look at the next byte 
         */ 
		if ((~thisbyte & 1) && (~(*inptr) & 0x80 )) { 
            lb |= 1;
        }
		FDD_sectorMfm[i*2] = hb;
		FDD_sectorMfm[i*2+1] = lb;
    } 
}

David

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Okay, homework done, now the marking.

First, if I am not mistaken (I don't have a compiler at hand now to try), you have a bug there. Two points down for not having run it on a non-trivial testcase.

Second, you already have the lowermost bit of the input byte in the lowermost bit of the output byte (inverted). You don't need to include that into the "correction" conditional (but it needs to be rethought).

Third, read the input value only once. I doubt the compiler can optimize two reads so far apart - although I may be wrong, gcc is surprisingly smart in these things.

Fourth, except in extreme cases you want to make the table in FLASH.

Five, what's wrong with asm, anyway?

I believe the execution time can be cut to half with a bit of thinking.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
this is very juicy discussion, isn't it?

Yes, it was fun. Thank you (as the original poster) for sticking around, answering questions, and allowing the topic to be derailed from the original "carry bit" discussion into the (much more interesting) general MFM encoder discussion...

(The question has always been about ENCODING to MFM, rather than decoding. That makes it a bit safer.)

(There are a bunch of "floppy emulator" projects out there, to be used in putting modern storage technology on legacy products (that expect to talk to a floppy drive.) After all, a $5 1GB SDCard will hold about 1000 floppy images!)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello, wek, you were right, there is corrected code:

void MFM_enc_new3() {
	uint8_t *mfm_outptr = &FDD_sectorMfm[0];
    uint8_t *inptr = &FDD_sector[0];
	uint16_t thisByteShort = pgm_read_word(&MFM_lookupTable[*inptr++]);

    for(uint16_t i = 0; i < LEN-1; i++) { 
		uint16_t nextByteShort = pgm_read_word(&MFM_lookupTable[*inptr++]);
		if (~thisByteShort & 2) { 
			if(~nextByteShort & 0x8000 ) {
				thisByteShort |= 1;
			}
        }
		*mfm_outptr++ = (thisByteShort>>8);
		*mfm_outptr++ = thisByteShort;
		thisByteShort = nextByteShort;
    }
}

& elmnts of the lookup table must be masked w 0xFFFE!

I am sure that condition is necessary to mod "inter-byte" clock bit - but am I missing something? Your comments are welcome again.

There is nothing wrong with ASM, I like it's explicity, but I'm just not familiar with it and don't have time to learn it to some sufficient level.

David Sedláček

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I compiled your new function. It takes 17984 cycles with avr-gcc.

Out of interest, I counted CodeVision cycles. MFM_enc() takes 64838 cycles. MFM_enc3() takes 51398 cycles. (Just selecting maximal speed optimisation)

I quite agree. There is nothing wrong with ASM. OTOH, getting the best algorithm in a HLL means that you can be up and running on any platform.

Then you can tweak the generated code. For example, the avr-gcc code has some obvious improvements. CodeVision needs a little more work because the compilation model is less efficient in the first place.

Yes. This has been a fun thread. It is nice to have some efficient code. All the same, I doubt that it will be of any practical advantage unless your MFM is generated very fast.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sith wrote:
I am sure that condition is necessary to mod "inter-byte" clock bit - but am I missing something?

void MFM_enc_new3b() {
   uint8_t *mfm_outptr = (uint8_t *)&FDD_MFM_1[0];
   uint8_t *inptr = &FDD_sector[0];
   uint8_t inb;
   uint16_t thisByteShort;
   
   inb = *inptr++;

   for(uint16_t i = 0; i < LEN-1; i++) {
     thisByteShort = pgm_read_word(&MFM_lookupTable[inb]);
     inb = *inptr++;
     if (inb & 0x80) {
       thisByteShort &= 0xFFFE;
     }
     *mfm_outptr++ = thisByteShort & 0xFF;
     *mfm_outptr++ = (thisByteShort>>8);
   }
   thisByteShort = pgm_read_word(&MFM_lookupTable[inb]);
   *mfm_outptr++ = thisByteShort & 0xFF;
   *mfm_outptr++ = (thisByteShort>>8);
} 

Use with the original table.

Jan

Pages