Delaying - but not what you think!

Go To Last Post
84 posts / 0 new

Pages

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

Rookie mistake :(

 

When I was applying the mask to X and Y in the main loop, I was clearing the higher bit representing the base address of the buffer.  The result is that both X and Y were pointing at the GP register file.  I was clobbering state.  The machine was running amok!

 

New code below.  Tested, and 'works', although I haven't confirmed if the delay is correct.  But the input is in fact now duplicated on the output.

 

#ifndef F_CPU
  #define F_CPU 20000000
#endif

#define __SFR_OFFSET 0
#include <avr/io.h>

; Set to desired delay
#define DELAY_NS 20000

; Must be a power-of-two no greater than half of the available SRAM
#define BUF_SZ_BYTES (((RAMEND + 1) - RAMSTART) / 2)

; 2 <= DELAY_BYTES < BUF_SZ_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define BITS_PER_US (F_CPU / 2000000)
#define DELAY_BYTES ((DELAY_NS * BITS_PER_US) / 8000)
#if (DELAY_BYTES >= BUF_SZ_BYTES)
  #warning DELAY_US is too long
  #undef DELAY_BYTES
  #define DELAY_BYTES (BUF_SZ_BYTES - 1)
#endif
#if (DELAY_BYTES < 2)
  #warning DELAY_US is too short
  #undef DELAY_BYTES
  #define DELAY_BYTES 2
#endif

; Create a symbol reflecting the real delay.  Examine it with avr_objdump -t
; or similar.
.equ real_delay_ns, (DELAY_BYTES * 8000) / BITS_PER_US

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.section  .bss
.balign   BUF_SZ_BYTES
.comm    dl, BUF_SZ_BYTES

; Determine  address  linker  will  select  for  buffer  (used  for  condtional
; compilation below).   Can't think  of a  way to extract  this from  the .comm
; declaration of dl above.   I expect it can't be done.   Rather, would need to
; specify a custom section and use --section-start= when building.  Meh.
#if BUF_SIZE_BYTES > RAMSTART
  #define DL BUF_SIZE_BYTES
#else
  #define DL RAMSTART
#endif

.section .text

.global __vector_default
        rjmp    reset
.global __vector_default

reset:

.global __do_clear_bss

.global main

main:

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     hi8(dl+DELAY_BYTES)
        ldi     XL,     lo8(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     hi8(dl)
        ldi     YL,     lo8(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    XH,     hi8(BUF_SZ_BYTES-1)       ; 1
        andi    YL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    YH,     hi8(BUF_SZ_BYTES-1)       ; 1
#if DL < 0x100
        ori     XL,     lo8(dl)                   ; 1
        ori     YL,     lo8(dl)                   ; 1
#else
        ori     XH,     hi8(dl)                   ; 1
        ori     YH,     hi8(dl)                   ; 1
#endif
        rjmp    loop                              ; 2
                                                  ; = 16

; Catch-all
__vector_default:
        reti

 

New .hex file for m168 attached.  Built for 20 MHz, and a 20 us delay.  Also, built without using -nostartfiles, so the full CRT is linked.  This clears the buffer to zero since it is in .bss.  It's not necessary, but ensures no spurious output at the beginning with the random contents of SRAM after a power-up.  I've built and tested it both with and without the CRT and it works either way.

 

EDIT:  Whoops, had the wrong hex file attached.  It was built for a 1 ms delay.  Replaced with new file built for 20 us.

 

I must say thanks to everyone for all the (what looks to me like considerable) effort.

 

I should get chance to try this tomorrow.

 

I'll rev up the ol' frequency generator to see how it fares.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

"The AVR's UART doesn't work right" gets old after a while.

You're not kidding.  I feel as though my patience and tolerance has gone down rapidly over the last year ;-)  ... I'm always impressed when the likes of you or Cliff keep at it patiently.  I'm trying.  At least, that's what my wife says.

 

El Tangas wrote:

Luckily there were 2 clocks to spare wink now there is

If there were no slack cycles, it could still have to be made like that, but like this is quite beautiful, at the limit of the MCU laugh

Well actually, I was being lazy, since I had 16 cycles to play with.   This is a little uglier, but it saves two cycles for future use, and still handles most any device.

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
#if BUF_SZ_BYTES <= 0x100
        andi    XL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    YL,     lo8(BUF_SZ_BYTES-1)       ; 1
#else
        andi    XH,     hi8(BUF_SZ_BYTES-1)       ;     1
        andi    YH,     hi8(BUF_SZ_BYTES-1)       ;     1
#endif
#if DL < 0x100
        ori     XL,     lo8(dl)                   ; 1
        ori     YL,     lo8(dl)                   ; 1
#else
        ori     XH,     hi8(dl)                   ;     1
        ori     YH,     hi8(dl)                   ;     1
#endif
        rjmp    .+0                               ; 2
        rjmp    loop                              ; 2
                                                  ; = 16

I hope it doesn't offend your sense of limitless beauty ;-)

 

I've tidied up a few silly things, but I'm not going to post yet another copy of this thing unless someone asks ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

 

I hope it doesn't offend your sense of limitless beauty ;-)
 

 

Well, I guess I can live with that, lol.

 

After all, we could save 2 more cycles if BUF_SZ_BYTES is set to 0x100 by incrementing only XL and YL like theusch did in his code. With 4 cycles we could do all kinds of stuff cheeky

Better store this code snippet for future use.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, but then the maximim delay would be 256*8*0.1 = 204.8 us. Good enough for the OP, but by using 1 kB on, say, a 328P, that goes up to 819.2 us ;-)
If I ever think of something useful to do with those 4 cycles, I can always change it. I suppose we could toggle a pin, or increment a whole port, so that the MSB would signal at F_CPU/4096. Is that useful? ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 12:03 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Yes, but then the maximim delay would be 256*8*0.1 = 204.8 us. Good enough for the OP, but by using 1 kB on, say, a 328P, that goes up to 819.2 us ;-) If I ever think of something useful to do with those 4 cycles, I can always change it. I suppose we could toggle a pin, or increment a whole port, so that the MSB would signal at F_CPU/4096. Is that useful? ;-)

 

Well, please don't get bored with it on my account :)

 

an LED that says "I am processing input" (as opposed to just waiting for input). You know how people love a flashing LED. But only if its blue.

 

Or a way to attach an analog pot to adjust delay...

 

Or a couple pins to divide down FAST input (like 5Megs+)... and mul it back up again after processing

 

I'm not really taking the p...

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, please don't get bored with it on my account :)

On the contrary, this was an interesting nut to crack :)

 

 

an LED that says "I am processing input" (as opposed to just waiting for input). You know how people love a flashing LED. But only if its blue.

Hmmm.  We can just about do it:

; once for init
        eor     r30,    r30
        mov     r31,    r30
        sbi     DDRD,   7
; in the loop
        adiw    r30,    1                         ; 2
        sbrc    r31,    7                         ; 1/2
        out     PIND,   r31                       ; 1

The second part takes 4 cycles total, regardless.  If this takes our four extra cycles in the loop, the effect is that Z is incremented every 16 cpu cycles.  Z will overflow every 65,536*16 = 1,048,576 cycles.  Bit 7 of r31 will be low for 524,288 cycles, and then high for 524,288 cycles.  When it is low, the out is skipped, and PORTD remains unchanged.  When it is high, the out is executed, and PD7 is toggled.  The first toggle will turn PD7 high, the next will turn it low, etc.  It will be toggled at F_CPU/16, for a frequency of F_CPU/32 = 625 kHz.  Since bit 7 of r31 will be high an even number of times, the last toggle before Z overflows and (bit 7 of r31 is again 0) will leave PD7 low.

 

The overall effect is that an LED across PD7 and GND will appear to flash at a frequency of F_CPU / (16 * 65536 * 2) = 9.54 Hz.  That's pretty fast, but it's slow enough to perceive.  The off half of the flash will indeed be off, while the 'on' half will actually have the LED pulsing a square wave at 625 kHz.

 

Is that useful? ;-)

 

Note that PD3-7 (and PD2 if you don't enable XCK) will also toggle, but since they aren't outputs, only the pull-ups will be enabled/disabled.

 

As has been noted, there is always XCK (PD4), which outputs a 10 MHz square wave while the loop is running.  As @El Tangas has mentioned, it isn't necessary for MSPI to function in the manner we are using it here, so it need not be made an output.

 

Or a way to attach an analog pot to adjust delay...

That wouldn't need any cycles from the loop.  It can be done, similar to using SPI/TWI/whatever to set a new delay can be done as I suggested in #29.  So set a new delay with a POT, you'd also need a switch or button to 'latch' a the new value.  The switch would trigger an interrupt to read the POT, compute new offsets for X and Y, then restart the delay line.  For that matter, the button could just be on /RESET, with the POT position read only once on startup.  To set a new delay, change the POT, and just reset the AVR.

 

You could constantly poll the POT for a new value, but that would continually break the timing of the main loop.

 

If this basic solution works for you, I might take a crack at adding the variable delay.  I've got some unanticipated projects coming up, so it may not happen fast.

 

Or a couple pins to divide down FAST input (like 5Megs+)... and mul it back up again after processing

Are you proposing a kind of digial heterodyne?

I'm not really taking the p...

Go on.  Of course you are ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 03:28 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hey, it just occurred to me that we can unroll the loop and get more or less all the extra cycles we need to do anything smiley

 

Using the 256 byte buffer technique, for example, unrolling by 8, the first 7 unrolled cycles could use auto address increment instructions, while only the last needs manual increment. This saves even more cycles and leaves the flags alone, so we can interleave any code we want and only need to be careful with the timings. We would have nearly 50% CPU time available.

 

I think it would actually be possible to control the delay with a pot without even interrupting the program flow.

 

Naturally, I leave any actual coding as an exercise to the interested reader ;-) I wouldn't be able to test it anyway, I really need to buy an oscilloscope.

Last Edited: Thu. Aug 4, 2016 - 10:39 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The trouble there is that there are >>two<< pointers.  One for the head of the circular buffer, and one for the tail.  They won't wrap at the same time, so it's not easy to predetermine where to unroll the loop.  It would be harder still if the delay were variable.

 

However, just by moving to a 256 byte buffer, we can avoid auto-increment altogether, as Lee has done with his software-only solution, and we save a few cycles permanently:

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        inc     XL                                ; 1
        inc     YL                                ; 1
        ; 4 cycles available
        rjmp    loop                              ; 2

This is a loop we >>can<< unroll indefinitely and with impunity, giving us 6 free cycles instead of 8.  Mind you, there is the minor wrinkle that the inc instructions will affect the S, V, N, and Z flags, but careful coding could manage that restriction.  For example, since the C flag is unaffected by inc, it could be used for inquiry whenever possible.

 

If we unrolled to the end of flash, an m168 would have up to 2K words of flash available in 6-cycle chunks to do other work, with the remaining 6K words consumed by the unrolled loop. (!!)

 

Now, since Y is preloaded with the address of the start of the circular buffer (while X is loaded with an offset representing the delay), we could use incremental constant offsets in the indexing for each pass through the unrolled loop:

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available

        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+1                       ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available
.
.
.
        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+62                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available

        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+63                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 4 cycles available

        inc     XL                                ; 1
        subi    YL,     -64
        rjmp    loop                              ; 2

This buys another cycle per pass, and we can unroll by 64.  In this way, we'd have 63*7+4 = 445 cycles to play with (We can unroll by more, in chunks of 64, offering 445*chunks-2 cycles to play with).  However, it doesn't prevent the condition flags from being modified, since X is still incremented every pass.  Again, since X is offset from Y and not 'synchronised' with the unrolled loop, we can't use the same technique.  Some of those would be involved in preserving flags across the unrolled passed to avoid the impact of the inc.

 

Then again, if we constrained the granularity with which the delay could be set, we could take advantage of the same offset technique for the head pointer.  Since X doesn't support index operations with an offset, we'd have to switch to using Z for the head, but no matter.  Then we'd have 8 cycles to play with for every pass, with no need to preserve flags across passes.  For example:

 

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      Z,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        ; 8 cycles available

        lds     r16,    UDR0                      ; 2
        st      Z+1,    r16                       ; 2
        ld      r16,    Y+1                       ; 2
        sts     UDR0,   r16                       ; 2
        ; 8 cycles available
.
.
.
        lds     r16,    UDR0                      ; 2
        st      Z+62    r16                       ; 2
        ld      r16,    Y+62                      ; 2
        sts     UDR0,   r16                       ; 2
        inc     ZL                                ; 1
        ; 8 cycles available

        lds     r16,    UDR0                      ; 2
        st      Z+63    r16                       ; 2
        ld      r16,    Y+63                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 4 cycles available

        subi    ZL,     -64                       ; 1
        subi    YL,     -64
        rjmp    loop                              ; 2

This would give us 63*8+4 = 508 cycles, and no need to preserve flags.  However, it would mean that the delay could only be set to multiples of 64 bytes, or 512 bits.  At 10 MHz sampling, that's a granularity of 51.2 us.  Not especially useful to the OP.

 

We could reduce the granularity from 64 bytes to 32 bytes, or 25.6 us, leaving us 31*8+4 = 252 cycles to play with, and no need to preserve flags.  So it's a trade-off between available cycles and granularity of delay.  As is, the fully rolled-up loop has the best possible granularity at 0.8 us, which is still greater than the 0.5 us the OP had specified.

 

Personally I don't see an advantage to trading granularity for cpu cycles.  I would stick with by-64 unrolled loop and the restriction on preserving flags across passes within the unrolled loop.  445 cycles should be enough to, as you suggested, handle the POT directly in the loop, even with the restriction on flags.  I'd still want an input to 'latch' a new delay value from the POT, since even the 445 cycles might not be enough to do proper filtering on the ADC input, so the >>length<< of the delay line would fluctuate with any noise on the ADC input.

 

Note that for all of these unrolling options, there is a bit of wiggle room w.r.t. the cycles used between passes.  That is, while there are nominally 7 cycles between passes in the first example, since the MSPI is buffered, we could use, say, 11 in one gap, and 3 in the next gap.  So long as we don't allow the buffers to empty or overflow, all will be well.  Careful cycle counting will be required, and the whole unrolled loop must be cycle-accurate overall.

 

Naturally, I leave any actual coding as an exercise to the interested reader ;-)

Well, @El Tangas, I kind of hate you now ... This is going to tend to draw my attention away from a multitude of projects and responsibilities ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 02:59 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
If we unrolled to the end of flash, an m168 would have up to 2K words of flash available in 6-cycle chunks to do other work, with the remaining 6K words consumed by the unrolled loop. (!!)

Depending on which loop you use it is about 10 words or so.  Or maybe 20, without counting.  256 full unroll x 20 is 5000 words, 10000 bytes.

 

(so we've gone from less-than-perfect C solutions; to my brute-force with 256 entries and full-port manipulation; to USART-as-SPI to get a bit-level manipulation and only 2/3 pins; to freeing up more cycles to e.g. read setpoint by unrolling)

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Lol, what a thorough analysis. I'd say this one is more than solved now, leave something for the OP to do and get back to work cheeky

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My proposed loop is 5 words, 10 bytes, 64 long, 2 extra words at the bottom.  644 bytes.  Plus as many as 7 more words per pass, x64 = 448 more bytes.  1,092 bytes per by-64 unrolled loop.  Multiple loops can be serialised up to the limit of flash.  15 by-64 unrolled loops would be as long as 16,326 bytes (if all the extra cycles are 1-to-1 with words), leaving 58 bytes (29 words) for init (on an m168).  The up side is that there are about 6,660 cycles which can be used in this massive collection of unrolled loops for other tasks.

 

I'm dizzy.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
My proposed loop is 5 words, 10 bytes, 64 long, 2 extra words at the bottom.

I count 6 words in the base loop (LDS, STS, LD, ST).  And you need words to use the other 8 cycles, right?  So more like 10...  [probably missed something...]

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

FEEDBACK FROM OP:

 

Thanks, Assembler dudes!

 

Tried it today with a 1.9Mhz signal, and it does exactly what's required.

 

Speculatively tried it on a 3.6 MHz signal, also good. Mr. Nyquist, I think, forbids my trying it on 7.1 Mhz, but its no problem to divide it down.

 

As far as developing the ideas further, I might need to read a book or two first, since I don't speak that lingo.

 

Bugger, they always told me C was close to the metal!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
Bugger, they always told me C was close to the metal!

[don't let sparrow2 see this]  Well, it can be.  But one needs to be pretty familiar with the code generation model of the particular C toolchain.

 

And as I think I mentioned earlier

I've often posted in "you can't do this in C" threads; [almost?] always on the C side.

 

But C has no concept of buffer pointer wrapping.

It is indeed hard to get to min machine cycles in C where a better algorithm/code sequence with a particular micro's instruction set uses concepts not part of the C language.  The mentioned wrapping; the concept of Carry; unsupported operand widths; and others.

 

You've put a sow's ear to the task of making a silk purse.  So to get best results the proposed solutions reserve two of three AVR pointer registers exclusively to the task.  GCC is pretty good about that in "memcpy" type loops so we might have gotten that far.  But the wrapping to mod 256 or whatever isn't part of C.  Thus the ASM-only (or implemented as inline fragments) solutions above will give you better performance in this case than C only. 

 

Now, I don't think anyone has explored the SPI version in C, which is a bit more forgiving (with "extra" cycles) than the straight polling approaches.

 

[If one starts with a buffer of exactly 256 bytes on a mod 256 boundary, then the pointer wrapping could be:

 

-- if starting address is 0x100, then wrap by assigning 0x01 to the _H register.  LDI one cycle one word.

--  same as above for "odd" pages 0x300 0x500 ...

-- if starting address is 0x200 or other "even" pages, then wrap by clearing the low bit of _H register.  CBR one cycle one word

 

Hmmm--odd or even one could use method 1 I guess.]

 

[[ Wouldn't that come out the same as ST X, rrr INC _L and use  ST X+,rrr LDI XH, HIGH(buffer) ?  ]]

 

[edit]  As reloading _H takes same words/cycles as INC _L, the latter is probably better because buffer on mod 256 isn't necessarily needed (although the width needs to be 256).  That would be at the expense of a little more one-time work at setup to take care of wrap of the input pointer offset from output pointer.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Thu. Aug 4, 2016 - 07:27 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

I count 6 words in the base loop (LDS, STS, LD, ST).    And you need words to use the other 8 cycles, right?  So more like 10...  [probably missed something...]

Whoops.  Actually, 7 words:

00000040 <loop>:
  40:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  44:	0c 93       	st	X, r16                          ; 2
  46:	08 81       	ld	r16, Y                          ; 2
  48:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  4c:	a3 95       	inc	r26                             ; 1
                                                     
  4e:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  52:	0c 93       	st	X, r16                          ; 2
  54:	09 81       	ldd	r16, Y+1	; 0x01          ; 2
  56:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  5a:	a3 95       	inc	r26                             ; 1 
.
.
.
  5c:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  60:	0c 93       	st	X, r16                          ; 2
  62:	0e ad       	ldd	r16, Y+62	; 0x3e          ; 2
  64:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  68:	a3 95       	inc	r26                             ; 1

  6a:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  6e:	0c 93       	st	X, r16                          ; 2
  70:	0f ad       	ldd	r16, Y+63	; 0x3f          ; 2
  72:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  76:	a3 95       	inc	r26                             ; 1
  78:	c0 5c       	subi	r28, 0xC0	; 192           ; 1
  7a:	e2 cf       	rjmp	.-60     	; 0x40 <loop>   ; 2

... because I still use the inc for X.

 

So 9/18 words/bytes for the last bit, and 7/14 words/bytes for the rest.  4 cycles free in the last, 7 cycles free in the rest.  Assuming worst case 1:1 ratio of words to cycles for the interleaved code, that's 13/26 words/bytes total for the last, and 14/28 words/cycles for the rest.  With a full boat of 64, that's 28*63+26 = 1,790 bytes.

 

If we serialise several, only the last one needs the rjmp.  The rest would have an extra 2 cycles to play with, which could take up to 2/4 words/bytes, so 1,792 bytes.  An m168 could hold 9 of these by-64 unrolled loops, totalling a worst case of 16,126 bytes, leaving 258/129 bytes/words for init.  In those 16,126 bytes, there would be a total of 4,021 cpu cycles to play with, with as many as 4,021/8,042 bytes/words with which to do so.

 

I'm still dizzy.

 

The OP doesn't need the AVR to do anything else, except perhaps runtime adjustment of the delay.  That doesn't require any loop unrolling at all, as it can be handled by an interrupt, or even just a device reset.

 

nobba wrote:

Mr. Nyquist, I think, forbids my trying it on 7.1 Mhz, but its no problem to divide it down.

Overclock to 30 MHz. 

 

Glad it's working out though.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
/*
connect pin XCK to SCK
use MISO MOSI for input, TXD for output

connect SS to another port pin
*/

uint8_t buffer[0x100];
uint8_t jin, jout;

// set up left as an exercise for the reader

while(1) {
    if(UCSR0A USART0A & _BV(UDRE0)) {
        UDR0=buffer[jout++];
        //  UCSR0A USART0A=_BV(UDRE0);  useless, possibly counterproductive
    }
    if(SPSR & _BV(SPIF) {
        buffer[jin++]=SPDR;
        // SPIF cleared automatically
    }
}

 

Iluvatar is the better part of Valar.

Last Edited: Tue. Aug 9, 2016 - 04:48 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Neat.

 

Would preclude the use of SPI to receive new delay values from a master, but still neat.

 

But this:

        USART0A=_BV(UDRE0);

... is unnecessary isn't it?  UDREn is self-clearing.  Writing it to '1' has no effect.

 

And I assume you meant UCSR0A rather than USART0A?

 

It's going to need some coaxing to hit 16 cycles average.  Best results seem to be with -O1:

  while(1) {
    if(UCSR0A & (1<<UDRE0)) {
  98:   80 81           ld      r24, Z                                  ; 2
  9a:   85 ff           sbrs    r24, 5                                  ; 1/2
  9c:   0b c0           rjmp    .+22            ; 0xb4 <main+0x24>      ; 2
      UDR0 = buffer[jout++];
  9e:   a0 91 00 01     lds     r26, 0x0100                             ; 2
  a2:   81 e0           ldi     r24, 0x01       ; 1                     ; 1
  a4:   8a 0f           add     r24, r26                                ; 1
  a6:   80 93 00 01     sts     0x0100, r24                             ; 2
  aa:   b0 e0           ldi     r27, 0x00       ; 0                     ; 1
  ac:   af 5f           subi    r26, 0xFF       ; 255                   ; 1
  ae:   be 4f           sbci    r27, 0xFE       ; 254                   ; 1
  b0:   8c 91           ld      r24, X                                  ; 2
  b2:   88 83           st      Y, r24                                  ; 2
    }
    if(SPSR & (1<<SPIF)) {
  b4:   0d b4           in      r0, 0x2d        ; 45                    ; 1
  b6:   07 fe           sbrs    r0, 7                                   ; 1/2
  b8:   ef cf           rjmp    .-34            ; 0x98 <main+0x8>       ; 2
      buffer[jin++] = SPDR;
  ba:   a0 91 01 02     lds     r26, 0x0201                             ; 2
  be:   81 e0           ldi     r24, 0x01       ; 1                     ; 1
  c0:   8a 0f           add     r24, r26                                ; 1
  c2:   80 93 01 02     sts     0x0201, r24                             ; 2
  c6:   8e b5           in      r24, 0x2e       ; 46                    ; 1
  c8:   b0 e0           ldi     r27, 0x00       ; 0                     ; 1
  ca:   af 5f           subi    r26, 0xFF       ; 255                   ; 1
  cc:   be 4f           sbci    r27, 0xFE       ; 254                   ; 1
  ce:   8c 93           st      X, r24                                  ; 2
  d0:   e3 cf           rjmp    .-58            ; 0x98 <main+0x8>       ; 2

The no-action case takes 9 cycles.  The UDR0 only case takes 20 cycles.  The SPDR only case takes 22 cycles.  The UDR0+SPDR case takes 34 cycles.

 

Since the MISO MOSI and TXD will effectively be synchronised, you'd get better results by only checking on of the flags and doing the read and write together.  I've tried it, and it is better, but not yet fast enough.  Checking SPSR instead of UCSR0A is a bit faster as well, since it is within range of in/out.

 

The real problem is the pointer and index arithmetic which AVR GCC seems determined to perform.

 

EDIT:  MISO to MOSI

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Tue. Aug 9, 2016 - 03:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
But this:

        USART0A=_BV(UDRE0);

... is unnecessary isn't it?  UDREn is self-clearing.  Writing it to '1' has no effect.

 

And I assume you meant UCSR0A rather than USART0A?

Correct.  Maybe.  Correct.

Also, documentation seems to imply that UDREn is not cleared by an ISR.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Also, documentation seems to imply that UDREn is not cleared by an ISR.

That is true.  Writing UDRn i.e. placing something into the TX buffer can clear UDREn.  If there is no TX operation under way, the write to UDRn goes directly to the transmitter's shift register, so the TX buffer is still empty, and UDREn remains set.  A second write to UDRn during the TX operation in progress will go to the TX buffer, thus clearing UDREn.

 

Note my edits in post #68.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
The no-action case takes 9 cycles.  The UDR0 only case takes 20 cycles.  The SPDR only case takes 22 cycles.  The UDR0+SPDR case takes 34 cycles.

 

Since the MISO and TXD will effectively be synchronised, you'd get better results by only checking on of the flags and doing the read and write together.  I've tried it, and it is better, but not yet fast enough.  Checking SPSR instead of UCSR0A is a bit faster as well, since it is within range of in/out.

 

The real problem is the pointer and index arithmetic which AVR GCC seems determined to perform.

An else would eliminate the 34-cycle case.

If 22 cycles is not fast enough, I suppose four lines of in-line assembly

(two each for load and store) would do the trick for an aligned 256-byte buffer.

 

Note that though both SPIs have to run at the same frequency,

they do not have to have the same phase.

By activating SS at the right time, the delay can be selected to within a bit time.

Note that since SS is an input, one will need another wire to control it.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, I would just do this in assembler:

loop:
        in      r17,    SPSR    ; 1
        sbrs    r0,     7       ; 1/2
        rjmp    skip            ; 2
        in      r16,    SPDR    ; 1
        st      Y,      r16     ; 2
        inc     YL              ; 
skip:
        lds     r17,    UCSR0A  ; 2
        sbrs    r17,    5       ; 1/2
        rjmp    loop            ; 2
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     ZL              ; 1
        rjmp    loop            ; 2

Cycle counts are 9, 12, 15, and 18 for the four cases.  This will still be too long.  Needs to be <= 16.

 

I don't see an advantage to testing the bit first.  In C, yes, since it would in theory obviate the need to be exactly 16 cycles, but that seems not to be possible.  Even in assembler it seems difficult to remain under 16 cycles.

 

Maybe:

loop:
        in      r17,    SPSR    ; 1
        sbrc    r0,     7       ; 1/2
        rjmp    loop            ; 2
        in      r16,    SPDR    ; 1
        st      Y,      r16     ; 2
        inc     YL              ; 1
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     ZL              ; 1
        rjmp    loop            ; 2

Counts are 4 and 14, so works fine, and doesn't need to pad with nops since the flag is tested.  But I'd just as soon do it the same as the MPSI-only code posted earlier:

loop:
        in      r16,    SPSR    ; 1
        st      Y,      r16     ; 2
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     YL              ; 1
        inc     ZL              ; 1
        rjmp    .+0             ; 2
        rjmp    .+0             ; 2
        nop                     ; 1
        rjmp    loop            ; 2

This will still allow a granularity of 1 cpu cycles, or 50 ns @ 20 MHz, with your suggestion of the careful selection of SPI phase, and a well timed driving low of /SS.  Very smart, by the way.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think the granularity is 2 CPU cycles.

That said, 100ns is only 0.005 of 20 microseconds.

Note that 256*8 bits is enough for a 200 microsecond delay.

 

According to documentation, /SS is an input in slave mode.

It needs a wire to something.

An internal pull-up would be counter-productive.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

According to documentation, /SS is an input in slave mode.

Yes, I meant to drive it low with another pin, as you suggest in the code you posted in #67.

 

I think the granularity is 2 CPU cycles.

I suppose so.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

On further thought, I suspect that playing with the SPI mode can get down to singe-cycle resolution.

For subcycle resolution, one might need to adjust Fcpu.
 

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When you declared you thought the granularity was 2 cycles, I first thought "No it's 1", but when I took a few moments to explain how, I couldn't.  So I'm curious what you've come up with.

 

My back-of-the-napkin analysis gets stuck on the fact that the MSPI clock itself has a minimum period of 2 cpu clock cycles.

 

While we can control the phase of the MSPI clock on XCK w.r.t. TXD, that will adjust the phase relationship to either +1 or -1 cycles, so again a granularity of 2.

 

The same is true of the SPI doing the listening.  Phase relationship between SCK and MISO MOSI can be +1 or -1 cycles.

 

In truth, though, I haven't sat down with pencil and paper and the timing diagrams and seriously tried to work it out.

 

I suppose if there were a way to insert an additional 1 cycle of delay between XCK and SCK, or between TXD and MISO MOSI, it could be done.  I'm struggling to think of a way to do that.  The AC has it's own synchroniser which inserts 0.5-1.5 cycles of latency.  The variability in that figure is there to account for asynchronous input, but in our case the input would be synchronous so we can probably count on a fixed latency of 1 cycle.  The trouble is that the m168 and friends do not have a way to tie ACO to an output.  I can't think of another trick to get that one extra cycle.

 

However this may all be moot:

 

 

I read that to mean that we can't reliably clock the SPI slave at F_OSC/2.

 

EDIT:  MISO to MOSI

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Tue. Aug 9, 2016 - 03:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd forgotten about the SPI speed limit.

The next MSPI speed available is Fcpu/4 .

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I feel as though that limit is given with the assumption that the master would be a separate device, likely operating from a different clock.  Even if a separate master and slave are running on clocks which are nominally the same speed, mutual jitter and drift between the two clocks would be inevitable, making the master and slave effectively asynchronous.  Thus, I figure, the stipulation of < 2 cpu clock cycles, rather than <= 2 cpu clock cycles, which I would expect of a synchronous system.

 

Now, since in this case the master and slave are two separate peripherals on the same device, driven by the same underlying system clock, it may in fact be that the real limit would be <= 2 cpu clock cycles.  If that is so, it would seem to be possible to achieve both a high sample rate of F_OSC/2, and a granularity of 2 cpu clock cycles.  Whether or not this proves to be true would, I imagine, depend upon the precise phase relationship between TXD and MISO MOSI.  That is, if TXD switches at the same instant as MISO MOSI is latched, the effect would be metastability, despite the synchroniser.  However, if they are out-of-phase by any significant fraction of a cpu clock cycles, it might work consistently and reliably.  Only a bench test will reveal which is true.  I have not performed a bench test.

 

Notwithstanding an answer to the above, I'm still curious how you proposed to get 1-cycle granularity.

 

 

For my part, @skeeve, you got me thinking.  I figured it might be possible to get 4-bit granularity without involving SPI, by using swap:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y,      r17     ; 2
        ld      r17,    Z       ; 2
        sts     UDR0,   r17     ; 2
        inc     YL              ; 1
        inc     ZL              ; 1
        mov     r16,    r18     ; 1
        rjmp    loop            ; 2 = 18

18 cycles.  Not quite.  Although on an AVR where UDRn is within reach of in/out, it would be perfect.  The t2313/4313 is such a beast, but it doesn't have enough page-aligned SRAM for our purposes, although with auto-increment, the cycles used by the inc instructions could be used by andi instructions instead, allowing the use of a smaller buffer.  The m8/16/32/64/128 has enough SRAM, but doesn't support MSPI.  The only one I can find which meets all the criteria is the t1634.  Enough page-aligned SRAM, support for MSPI, and UDR0 in I/O space.

 

For m168 and friends, unrolling by 2 is sufficient:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y,      r17     ; 2
        ld      r17,    Z       ; 2
        sts     UDR0,   r17     ; 2
        rjmp    .+0             ; 2
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r16,    r17     ; 1
        andi    r16,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r18     ; 1
        std     Y+1,    r17     ; 2
        ldd     r17,    Z+1     ; 2
        sts     UDR0,   r17     ; 2
        subi    YL,     -2      ; 1
        subi    ZL,     -2      ; 1
        rjmp    loop            ; 2 = 32

 

Now we have two methods.  One method can be used when the desired delay is an even number of nybbles, the other when we want an odd number of nybbles.

 

I've also added the ability to change the delay at run-time by means of button inputs.  There are 4 inputs:

  • longer by 1 nybble
  • shorter by 1 nybble
  • longer by 10 nybbles
  • shorter by 10 nybbles

 

For simplicity of coding, when a button is pressed the delay line is stopped, reconfigured for the new length, and restarted.  The buttons are handled by a pin change interrupt and ISR, although they could just as easily have been handled by polling the interrupt flag and jumping out of the loop.  There are just enough free cycles in the method 1 loop.  It would be a bit more complicated in the method 2 loop, as there aren't enough cycles at the right time in the loop.  Unrolling further would solve that problem.

 

I've done some testing, and I believe it to be reasonably free of bugs.  I have not tested with a scope.  Instead, I used the delay line as a serial loop-back (using an UNO).  Also, serial debugging was peppered into various places to verify correct operation.  That debug code has since been stripped.  Testing with an appropriate scope or LA will be necessary to fully confirm correct operation.

 

Although I tried to make it easily configurable, no great effort was expended to make it a work of art, nor particularly clever.  As such, I rely on the linker for the vector table and for placement of the ISR.  The remainder of the CRT is not needed, but no effort is made to prevent it from being linked.

 

A simple build shell script is included:

#!/bin/bash

SRC=delay_line
TARGET=atmega168
F_CPU=20000000

avr-gcc -Wall -g -save-temps -mmcu=${TARGET} -DF_CPU=${F_CPU} -Wl,-Map,${SRC}.map ${SRC}.S -o ${SRC}.elf
avr-objcopy -O ihex ${SRC}.elf ${SRC}.hex
avr-objdump -t ${SRC}.elf > ${SRC}.sym
avr-objdump -Sz ${SRC}.elf > ${SRC}.lss

 

The .sym file shows the symbols contained within the .elf.  This is so that you can confirm various timing symbols that are created in the .S.  For example, when built for 20 MHz:

01312d00 l       *ABS*	00000000 delay_line_f_cpu
00989680 l       *ABS*	00000000 delay_line_sample_rate
00000190 l       *ABS*	00000000 delay_line_granularity_ns
00000640 l       *ABS*	00000000 delay_line_minimum_ns
00004e20 l       *ABS*	00000000 delay_line_default_ns
00031e70 l       *ABS*	00000000 delay_line_maximum_ns

Note that the symbols are in hexadecimal.  The granularity of 190 is actually 0x190, or 400 decimal.  That's now better than the 500 ns stipulated by the OP.

 

Included in the .zip file is a .hex built for an m168 at 20 MHz, a default delay of 20 us, and with buttons on port B:

  • down by one: PB0
  •   up by one: PB1
  • down by ten: PB2
  •   up by ten: PB3

 

EDIT: changed MISO to MOSI.  Surprised nobody caught that, going all the way back to @skeeve's post #67.

EDIT: changed preprocessor arithmetic to properly handle non-integral clock speeds (e.g. 18.432 MHz)

EDIT: added generalised hardware config macros for USART

EDIT: fixed bug introduced when USART config macros were added :(

EDIT: fixed bug responsible for out-of-bounds buffer access, leading to spurious edges

Attachment(s): 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Mon. Aug 15, 2016 - 07:59 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A SPI slave can be set to read a bit on either the rising or falling edge of SCK.

At a bit rate of F_cpu/2, those edges are one cycle apart.

Iluvatar is the better part of Valar.

Last Edited: Wed. Aug 10, 2016 - 03:20 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
A SPI slave can be set to read a bit on either the rising or falling edge of SCK. A a bit rate of F_cpu/2, those edges are one cycle apart.

 

The key word there is OR.

You can choose either clock polarity, but not both edges - Both edges is a DDR link, which modest MCUs still lack.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

A SPI slave can be set to read a bit on either the rising or falling edge of SCK.

A a bit rate of F_cpu/2, those edges are one cycle apart.

Ah yes, that was a failure of visualisation on my part.

 

Who-me wrote:

You can choose either clock polarity, but not both edges

You don't need both edges at the same time.  You can algorithmically select the correct polarity (and correct delay before enabling the slave by driving /SS low with another I/O pin) to achieve the desired delay.

 

I may take a crack at it one day.  For now, 8-cycle granularity will have to do ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Who-me wrote: You can choose either clock polarity, but not both edges

You don't need both edges at the same time.  You can algorithmically select the correct polarity (and correct delay before enabling the slave by driving /SS low with another I/O pin) to achieve the desired delay.

That seems to assume the source pulse, is Clock-synchronised/locked, which was not what I took from the OP's post.

Any granularity is going to have to add a sampling error as well, as mentioned above.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Any granularity is going to have to add a sampling error as well, as mentioned above.

The granularity applies to the step size with which the nominal length of the delay line can be configured.  This is a separate issue from the jitter inherent with sampling an asynchronous signal.  In either case (the solution with 8-cycle [4-sample] granularity I posted above, or @skeeve's proposed 1-cycle [half-sample] granularity solution involving MSPI >>and<< SPI), that jitter will be +/- 0.5 clock cycles, or +/- 0.25 samples.  I touched on the issue of jitter in multiple posts in this thread.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Tue. Aug 9, 2016 - 03:42 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had a few minutes today, so I hooked up an UNO to a scope to satisfy myself that this thing worked the way I expected.  Imagine my surprise when it did.  Imagine my greater surprise when it then seemed not to about 25 percent of the time.

 

As I changed the delay with the four buttons I'd hooked up to port B, I found that the delay changed as expected, but the high portions of the outgoing pulses showed occasional, short, spurious transitions down to logic 0, and back up to the expected logic 1.  This only happened every fourth step when changing delay.

 

Specifically, when delay_nybles had the two low bits set, the spurious edges would occur.  When the LSB of delay_nybles was set, the nyble-swapping code would be used, so it seemed clear that this was the code which was implicated.

 

I stared at the code a few minutes, poked it with a stick, couldn't see the flaw.  Then it dawned on me:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y,      r17     ; 2
        ld      r17,    Z       ; 2
        sts     UDR0,   r17     ; 2
        rjmp    .+0             ; 2
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r16,    r17     ; 1
        andi    r16,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r18     ; 1
        std     Y+1,    r17     ; 2
        ldd     r17,    Z+1     ; 2
        sts     UDR0,   r17     ; 2
        subi    YL,     -2      ; 1
        subi    ZL,     -2      ; 1
        rjmp    loop            ; 2 = 32

In each of the highlighted instructions, there would be a point (every 256 bytes, in fact), where an out-of-bounds buffer access would occur.  This would only ever happen for the head, and never for the tail.

 

What's more, this wouldn't manifest when the tail was an even number of bytes away from the head, even if it was an odd number of nybles away (which would result in the selection of the above nyble-swapping method) because both the head's write accesses and the tail's read accesses would always be on word-aligned addresses when entering the loop.

 

Take the case where the head is 4 bytes ahead of the tail:

        0x00  0x01  0x02  0x03  0x04  0x05          0xFA  0xFB  0xFC  0xFD  0xFE  0xFF
       +-----------+-----------+-----------+/     /+-----------+-----------+-----------+
head   |           |           |           |       |           |           | Y,   Y+1  |
tail   | ^         |           |           |       | ,Z   ,Z+1 |           | |         |
       +-|---------+-----------+-----------+/     /+--|--------+--^--------+-|---------+
         |                                            |           |          |
         |                                            +-----------+          |
         +-------------------------------------------------------------------+

As the head approaches the end of the buffer, Y points to the 0xFEth byte.  The first access in the loop is without offset, so no worry there.  The second access is offset by 1, but again, that's within the the buffer.  At the end of the loop, the two pointers are each incremented by two.  Y overflows and ends up pointing at the 0x00th byte.  Later, the same will happen to Z.  All is well.

 

All is well, because after a change in the length of the delay, the pointers are reloaded to reflect the new length.  The tail (Z) is always reloaded to point at the beginning of the buffer, the 0x00th byte, an even address.  It will therefore always point to an even address since ZL is only ever incremented by 2.  The head (Y) is loaded to point the required number of bytes ahead.

 

Now imagine the case where the head is 5 bytes ahead of the tail:

        0x00  0x01  0x02  0x03  0x04  0x05          0xFA  0xFB  0xFC  0xFD  0xFE  0xFF  ----
       +-----------+-----------+-----------+/     /+-----------+-----------+-----------+
head   |           |           |           |       |           |           |       Y,  |,Y+1
tail   |       ^   |           |           |       | ,Z   ,Z+1 |           |       |   |
       +-------|---+-----------+-----------+/     /+--|--------+--^--------+-------|---+
               |                                      |           |                |
               |                                      +-----------+                |
               +-------------------------------------------------------------------+

Now as the head approaches the end of the buffer, the last thing Y will point to, before the increment-by-two which overflows YL, is the 0xFFth byte.  The first access in the loop is again without offset, so no worry.  However, the second access, with an offset by 1, is outside the buffer.  The head will write to a byte from which the tail will never read.  Afterwards, the head is incremented by 2, overflowing YL, and Y will now point to the 0x01st byte, having completely (and forever) skipped the 0x00th byte.

 

Meanwhile, when the tail returns to the beginning of the buffer, it will read from the 0x00th byte, into which the head has never deposited anything.  Since the buffer is zeroed before entering the loop after a change in the delay length, it means that every 256 bytes the tail will be reading a permanently zero byte and shoving that out the MSPI, which explains the spurious edges I was seeing.

 

So.  Another rookie mistake ;-)

 

One way to fix this is to return to using auto-increment, and then reload the high byte of the pointers to the correct page.  There are just enough spare cycles in the loop to do it:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y+,     r17     ; 2
        ld      r17,    Z+      ; 2
        sts     UDR0,   r17     ; 2
        ldi     YH,     hi8(buf); 1
        ldi     ZH,     hi8(buf); 1
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r16,    r17     ; 1
        andi    r16,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r18     ; 1
        st      Y+,     r17     ; 2
        ld      r17,    Z+      ; 2
        sts     UDR0,   r17     ; 2
        ldi     YH,     hi8(buf); 1
        ldi     ZH,     hi8(buf); 1
        rjmp    loop            ; 2 = 32

Technically, it's only necessary to do this with the head (Y), since it is the only one which could ever take on an odd value, so Z could still be handled with the previous by-two increment and offset-by-one access.  However, doing it this way for both Y and Z means that the head and tail can be loaded with >>any<< value and accesses will always remain within bounds, so any changed to the code which change to the way in which the head and tail are initialised shouldn't cause a similar bug to surface with Z.

 

The attachment in #78 has been replaced with the bug-fixed code.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Pages