Delaying - but not what you think!

Go To Last Post
84 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,

This question has nothing to do with the normal questions based around _delay_us() and friends!

 

Instead, I am trying to use a mega168 to really delay a pulse train in the microsecond range (20-50us). Read "delay line"

 

Ideally, I want to work with an input signal of 3MHz, but an AVR can't cope, so I've divided down my input signal to 500kHz. Dividing down the input more to say 250kHz is not a problem.

 

My tactic was to poll for input changes, stick them in a buffer, then some time later bang the square wave out on another pin:

 

	// Put debugging symbols on Release mode to see why the NOP()S are where they are.
	// We want a duty-cycle of the same as the input at ALL times. All paths must take equal time.
        cli();

	while (1){

		uint8_t& myb = buf[i++];
		uint8_t in_state = PINC & IN_HIGH;

		if (in_state){
			myb = 1;
		}else{
			myb = 0;
		}

		if (buf[c++]){

			if (in_state){
				NOP();NOP();
			}
			PORTC |= OUT_HIGH;
		}else{
			NOP();
			PORTC &= ~OUT_HIGH;

		}

	};

 

This works. The NOPs are there to keep the timing consistent (in the Sim)

 

 

However, the output has jitter, even at 250kHz. I am surprised by this. Interrupts are off as you can see. I clearly see up to 2uS of jitter on the 'scope, and this won't do once the input is multiplied back up in frequency :(

 

Has anyone any tips on how to do this jitter free? I just want to exactly delay the pulse train: no artifacts!

 

Thanks!

 

 

 

Last Edited: Tue. Aug 2, 2016 - 03:35 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
Has anyone any tips on how to do this jitter free?

I recall at least one extensive past thread on AVR "delay line".  Hmmm--this one didn't have a "resolution"

https://www.avrfreaks.net/forum/d...

...and "delay line" site search came up with too many hits.

 

If really important, then I might recommend investigating using an Xmega with port DMA  -- set the buffer(s) length to the needed delay time.  [I see I suggested looking into that in the thread above but without resolution]

 

 

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes the whole "delay" vs "delay-line" thing makes it hard to Google alone; let alone anything else.

 

Extra Info: External crystal osc, 20MHz, no prescaler.

 

I do have an Xmega + development breakout board here somewhere...where's that data sheet ...

 

Thanks!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A Google search for "digital "delay line" with microcontroller" gives some interesting results.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you need a precise amount of delay,

probably you should use assembly.

If you just need jitter-free operation,

use a formula that does not require an if that depends on your data:

enum { DELAY=37 } ; // in while loop passes
enum { BUF_BITS=7 } ;
enum { BUF_SIZE=1<<BUF_BITS } ;
// twos complement assumed later
uint8_t buf[BUF_SIZE];
uint8_t jin=0, jout=0;
uint8_t delay_left=DELAY;

while(1) {
    buffer[jin++ & -BUF_SIZE]=PINC;
    volatile uint8_t portv=
            shift(buf[jout++ & -BUF_SIZE] & IN_HIGH) | (PORTC & ~OUT_HIGH); }
            // shift to get the right bit position
    uint8_t port3=portv;
    // getting the timing right on the following
    // if/else might be easier in in-line assembly
    // even slight compilation changes could make a differnce otherwise
    if(delay_left) {
        --delay_left;
    } else {
        PORTC=port3;
    }
} // while

 

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Reviewing the loop that I posted last year, and as a given that the whole AVR will be used during this "delay line" operation [it can be like e.g. miniDDS operation -- a hard loop with interrupts enabled with sources such as UART for "new command" or a STOP button to go back to command mode] then let's do some noodling...

 

Ideal operation from the initial post is 3MHz signal.  That is 60 AVR clocks at 20MHz, right?

 

My first guess is to use an entire AVR port for the input, and especially the output so that OUT can be used.  And also, fastest operation would be if [wastefully] a whole byte is used to store each sample.

 

Does Mr. Nyquist come into play here?  If you want to catch a single 1/3us pulse then do you need 6MHz sampling rate?  So what is the minimum high/low time to be reproduced?

 

You said "no jitter"...but with any kind of sampling you would not reproduce the input signal exactly, right?  For example, let's say we make this ideal delay line with the AVR, sampling the input every [for example] 1 microsecond.  The actual start/end of the incoming pulse could be up to 1 microsecond earlier.  It is true, isn't it, that the output pulse high/low times will be like +/- one sample period?

 

Let's start with some way of a "programmable" or "settable" delay period, with the maximum time of 50us.  If you sample once a microsecond and use the "wasteful" one byte per sample, that is a 50 byte buffer for max delay.  Not too bad.

 

Expand on your specs for min and max delay and resolution needed.  For most efficient wrap of the buffer pointers a power of two would be most efficient.  And for least cycles in the loop I think 256 would work best.  OK, here goes:

 

-- 256 byte buffer on a mod 256 boundary.

-- Arbitrary output port and pin, but overwrite entire PORT register.

-- Input pin on a different port, and in the same pin as the output pin.  E.g. PB0 as input, PD0 as output.

 

With the above, there should be no conditional logic or masking in the loop.

 

Init the buffer to zeros or ones for the initial delay front porch.

 

Set "offset" based on the desired delay and the sample time.  E.g. if the loop below takes 10 clocks and you are running at 20MHz, then "offset" is 2x the desired delay time in microseconds.

 

As mentioned, have interrupts on for stop; change parameters; etc.  If you care to, fuss with the stack so the RETI doesn't jump back into the loop but rather to the setup/start code.

 

Using the aforementioned PB0=>PD0 the ASM might look something like:

 

    .DSEG
    .ORG 0x400
BUFFER:
    .BYTE 0x100
...
; SETUP
    LDI ZH, HIGH(BUFFER)        ; OUTPUT POINTER
    LDI ZL, 0

    LDI XH, HIGH(BUFFER)        ; INPUT POINTER
    LDS XL, offset              ; CALCULATED ELSEWHERE

LOOP:
    IN  R16, PINB   ; 1 CYCLE
    STS X, R16      ; 2
    INC XL          ; 1
    LDS R16, Z      ; 2
    INC ZL          ; 1
    OUT PORTD, R16  ; 1
    RJMP LOOP       ; 2

If I counted correctly that is 10 cycles; 1/2 us at 20MHz.  2MHz sampling rate.  0-128us delay with 1/2 us resolution.

 

 

 

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Tue. Aug 2, 2016 - 05:26 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@Skeeve:

 

Thanks for the code. I'll give it a try! Sadly, my C is OK, assembly = nil! Its all I could do to de-mangle the assembly output to figure out where to put the NOP()s!!

BTW, buf size is UINT8_MAX, so OK.

Last Edited: Tue. Aug 2, 2016 - 05:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@ theusch:

Thank you, btw this is the AVR's only job, so I'm not worried about anything else.

 

I'd like to be able to set a delay of up to 50uS, say in 1 or 2 uS intervals, in hard code, empirically, but when its done, its done!

 

Let me figure out how to put this into my C compiler. I guess I'd better crash course on inline assembly...

Last Edited: Tue. Aug 2, 2016 - 05:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
Let me figure out how to put this into my C compiler.

I've often posted in "you can't do this in C" threads; [almost?] always on the C side.

 

But C has no concept of buffer pointer wrapping.  So my guess is that even with alignment and my other axioms above the C clocks would be double or more.

 

nobba wrote:
I'd like to be able to set a delay of up to 50uS, say in 1 or 2 uS intervals,
theusch wrote:
2MHz sampling rate. 0-128us delay with 1/2 us resolution.
nobba wrote:
in hard code, empirically, but when its done, its done!
I don't know what you are getting at.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
nobba wrote: in hard code, empirically, but when its done, its done! I don't know what you are getting at.

 

I mean its set and forget once it comes off the bench! I just have to match the group delay in an audio low pass filter at rf: match it once (its around 20uS) and then its set forever.

 

So if I can't do it in C (for the reasons you suggest, I'll be finding out assembler 101, ie how to program my 168 with the code you posted. Mark you, I won't know whats happening if it doesn't work!)

 

So, is the assembly you posted the complete solution? (Bearing in mind i will struggle to edit it?)

 

eg: offset is my "delay", set elsewhere -- yes, before the loop.

 

So what's the assembly version of

 

static const uint8_t offset = 20;

 

??

 

Thanks

Last Edited: Tue. Aug 2, 2016 - 05:43 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:

Sadly, my C is OK, assembly = nil!

 

Only a couple changes to make the above a complete program for Atmel assembler:

 

    .DSEG
    .ORG 0x400
BUFFER:
    .BYTE 100

    .CSEG
    .ORG 0
; SETUP
    LDI ZH, HIGH(BUFFER)        ; OUTPUT POINTER
    LDI ZL, 0
    
    LDI XH, HIGH(BUFFER)        ;  POINTER
;   LDS XL, offset              ; CALCULATED ELSEWHERE
    LDI XL, 123 ; DELAY IN MICROSECONDS*2

LOOP:
    IN  R16, PINB   ; 1 CYCLE
    STS X, R16      ; 2
    INC XL          ; 1
    LDS R16, Z      ; 2
    INC ZL          ; 1
    OUT PORTD, R16  ; 1
    RJMP LOOP       ; 2     

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

nobba wrote:

Sadly, my C is OK, assembly = nil!

 

Only a couple changes to make the above a complete program for Atmel assembler:

 

    .DSEG
    .ORG 0x400
BUFFER:
    .BYTE 100

    .CSEG
    .ORG 0
; SETUP
    LDI ZH, HIGH(BUFFER)        ; OUTPUT POINTER
    LDI ZL, 0
    
    LDI XH, HIGH(BUFFER)        ;  POINTER
;   LDS XL, offset              ; CALCULATED ELSEWHERE
    LDI XL, 123 ; DELAY IN MICROSECONDS*2

LOOP:
    IN  R16, PINB   ; 1 CYCLE
    STS X, R16      ; 2
    INC XL          ; 1
    LDS R16, Z      ; 2
    INC ZL          ; 1
    OUT PORTD, R16  ; 1
    RJMP LOOP       ; 2     

 

Well, copy n paste: I can try! Thank you.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
LOOP: IN R16, PINB ; 1 CYCLE STS X, R16 ; 2 INC XL ; 1 LDS R16, Z ; 2 INC ZL ; 1 OUT PORTD, R16 ; 1 RJMP LOOP ; 2

 

Compiler chokes at the line:

STS X, R16      ; 2

 

Should that be STS XL, R16

??

thanks

 

EDIT: clearly not, just tried it :(

 

UPDATE: It compiles with ST instead of STS and LD instead of LDS.

 

Is this OK?

 

Error: Invalid number

Last Edited: Tue. Aug 2, 2016 - 06:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry.  [I didn't actually run the code, as you see. ;) ]

 

ST X, R16

 

and

 

LD R16, Z

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello!

 

Here is the full program. It works but still looks a bit wobbly. Perhaps its my scope! Will investigate shortly with 'scope #2 in case its leading me up the garden path

 

I will also double-check its running on the crystal, coz that's the only reason I can see for any remaining jitter. Mind you, having said that, it has a hard time "keeping up" with an input beyond about 500khz... :(

 

;
; delayass.asm
;
; Created: 02/08/2016 19:02:34
; Author : steve
;

; Replace with your application code
    .DSEG
    .ORG 0x400
BUFFER:
    .BYTE 100 ; 256 bytes (100 is hex, innit)

    .CSEG
    .ORG 0
; SETUP

    LDI    R16,  0xFF       ; Load 0b11111111 in R16
    OUT    DDRC, R16        ; Configure PortC as an Output port

    LDI    R16,  0x00       ; Load 0b00000000 in R16
    OUT    DDRB, R16        ; Configure PortB as an Input port
	; This is because I am using PB1 as input and will take output on PC1

    LDI ZH, HIGH(BUFFER)        ; OUTPUT POINTER
    LDI ZL, 0

    LDI XH, HIGH(BUFFER)        ;  POINTER
    LDI XL, 0 ; DELAY IN MICROSECONDS*2

LOOP:
    IN  R16, PINB   ; 1 CYCLE
    ST X, R16      ; 2
    INC XL          ; 1
    LD R16, Z      ; 2
    INC ZL          ; 1
    OUT PORTC, R16  ; 1
    RJMP LOOP       ; 2 

Please let me know if there are any howlers in there - first asm build. Ever.

Just note that on the line:

LDI XL, 0 ; DELAY IN MICROSECONDS*2

You should put the required delay in. Zero is in there to measure just the propagation delay, and to prove it works without too much jitter.

 

Still worried that when I multiply the signal back up to 3Megs the 1u or so of jitter will smash the waveform up. Might need a "super AVR" (don't know what that is, just invented it)

Last Edited: Tue. Aug 2, 2016 - 08:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
You should put the required delay in. Zero is in there to measure just the propagation delay, and to prove it works without too much jitter.

So, did you get about half a microsecond of propagation delay?

 

Describe what you mean by "jitter".  Remember what I outlined earlier:

theusch wrote:
You said "no jitter"...but with any kind of sampling you would not reproduce the input signal exactly, right? For example, let's say we make this ideal delay line with the AVR, sampling the input every [for example] 1 microsecond. The actual start/end of the incoming pulse could be up to 1 microsecond earlier. It is true, isn't it, that the output pulse high/low times will be like +/- one sample period?

 

Perhaps further describe your input signal.  An arbitrary pulse train?  Or a repeated "frequency"?   If so, is it 50% duty cycle?  And further, if so, are you really trying to do "phase shift"?

 

The above code should sample at 2Msps -- >>if<< your AVR is really running at 20MHz.  Have you proven that?  Nyquist would indicate that at 2Msps up to a 1MHz signal could be reproduced without losing any highs or lows.

 

Can you capture and post a 'scope trace that shows the jitter?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Tue. Aug 2, 2016 - 08:33 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And yes, I forgot to make the output pin high in my "complete program". ;)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,

 

So I can't easily post a picture of the scope as it is analog.

 

To tell you a bit more: I set the fuse to give me clock output (no div) out on the pin, setting fuses like so:

-U lfuse:w:0x87:m -U hfuse:w:0xdf:m EDITED TO CORRECT VALUES

 

So I can scope the clock to see that it really is 20 Mhz.

 

The application is quite complex: put simply I must preserve any phase-modulated information on the 3 megs (or 4 Megs) signal, but I must delay it by x microseconds.

 

Now, since the AVR cannot handle fast freqs, I thought it best to divide the signal (using D flip flop) to a frequency it can cope with, perform the delay, then multiply the frequency back up to the operating frequency.

 

The scope, with the input on channel A (and triggered on channel A) shows input B (from the AVR output) as "smeary". If I trigger on the output (channel B) then "A", the input frequency, looks smeary.

 

The reason I absolutely must minimize jitter is that, say I have 1 uS jitter @ 500khz. I'm worried I will end up with 8 times that when I multiply back up: and thats a whole 'nother can of worms...

 

All my tests so far are KISS: testing with 1:1 square wave only from a waveform generator.

 

According to my 'scope, with the buffer read set to 0, I am seeing a (jittery) 0.5 - 1uS prop delay.

Last Edited: Tue. Aug 2, 2016 - 09:34 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
So I can't easily post a picture of the scope as it is analog.

That's why I like my smartphone.  Or other digital camera...

nobba wrote:
The scope, with the input on channel A (and triggered on channel A) shows input B (from the AVR output) as "smeary". If I trigger on the output (channel B) then "A", the input frequency, looks smeary.

But does it "catch up"?  As I said, as far as I can see, the edges will always be +/- one sample period for >>any<< sampling-type setup.  Right?

 

So your 'scope has no one-shot or similar mode to show a few cycles across the screen?

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Maybe you could use the USART in SPI mode to handle the signal? After all, it is buffered, so maybe with the proper coding you will not loose any input. Also, the peripheral will do most work, leaving the cpu breathing time to handle memory transactions and delay calculations. I think 3-4 MHz may be possible, of course I'm just speculating and didn't do any actual coding.

 

And yeah, this is one of those times, when assembly will be needed.

 

Edit: As theusch said, because of the Nyquist theorem, you have to sample the signal at at least twice its frequency. So the USART clock should be set at maximum possible value, I think its system clock/2 (10 MHz). I don't actually know if this leaves enough cpu time to process the delay and handle interface with the peripheral. Maybe tomorrow I'll write some code.

Last Edited: Tue. Aug 2, 2016 - 10:50 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
Thanks for the code. I'll give it a try! Sadly, my C is OK, assembly = nil! Its all I could do to de-mangle the assembly output to figure out where to put the NOP()s!! BTW, buf size is UINT8_MAX, so OK.
My code requires BUF_SIZE to be a power 2.

With BUF_SIZE=0x100 you would not need the & -BUF_SIZE .

 

Reliable cycle-accurate timing pretty much requires assembly.

Subcycle-accurate timing is not available.

 

As another noted, there is a variable delay between the time a signal

changes on a pin and it is available for reading within the AVR.

How big a jitter are we discussing?

 

Edit: Is there a chance that you have been trying to read or write buffer[0x100]?

Iluvatar is the better part of Valar.

Last Edited: Wed. Aug 3, 2016 - 12:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
Edit: Is there a chance that you have been trying to read or write buffer[0x100]?

 

Good point. I did check for this at one point, I *think* it was reading from 0 to UINT8_MAX -1, with the array dimensioned to be ar[UINT8_MAX + 1] just for good measure when I doubted myself...

 

I was letting the counter (uint8_t) simply overflow.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

Maybe you could use the USART in SPI mode to handle the signal? After all, it is buffered, so maybe with the proper coding you will not loose any input. Also, the peripheral will do most work, leaving the cpu breathing time to handle memory transactions and delay calculations. I think 3-4 MHz may be possible, of course I'm just speculating and didn't do any actual coding.

 

And yeah, this is one of those times, when assembly will be needed.

 

Edit: As theusch said, because of the Nyquist theorem, you have to sample the signal at at least twice its frequency. So the USART clock should be set at maximum possible value, I think its system clock/2 (10 MHz). I don't actually know if this leaves enough cpu time to process the delay and handle interface with the peripheral. Maybe tomorrow I'll write some code.

 

OK, thanks. Let me know whar you find. Always ready to learn!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:
Maybe you could use the USART in SPI mode to handle the signal? After all, it is buffered, so maybe with the proper coding you will not loose any input.

Good idea.  I didn't think of that one--a fake SPI.  Input on RXD; output on TXD.  At say 8MHz SPI clock that gives 1us granularity.

 

El Tangas wrote:
I think its system clock/2 (10 MHz). I don't actually know if this leaves enough cpu time to process the delay and handle interface with the peripheral.

If one sets the SPI clock "properly", then [theoretically at least] one could cycle-count and not have to do any flag checking.

 

The loop would look very much like my bit-at-a-time, right?  Now, are the needed registers in reach of IN/OUT...

No, darn it.  Another two cycles wasted.  But doing 8 bits at a time should give good results.

 

Or does it...

 

El Tangas wrote:
So the USART clock should be set at maximum possible value, I think its system clock/2 (10 MHz).

But it appears that the max clock rate would be clk/16 -- 1.25MHz.

 

But maybe/probably U2X applies so then clk/8 2.5MHz.  8 clocks per bit means 64 clocks per byte, right?  Lots of time...

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Simpler, clearly jitter-free:

enum { DELAY=137 } ; // in while loop passes
enum { BUF_BITS=8 } ;
enum { BUF_SIZE=1<<BUF_BITS } ;
// twos complement assumed later
uint8_t buf[BUF_SIZE];
uint8_t port1=PORTC;
memset(buf, port1, DELAY);  // DELAY passes of sameness

uint8_t jin=DELAY, jout=0;
while(1) {
    buffer[jin++ & -BUF_SIZE]=shift(PINC & IN_HIGH) | (port1 & ~OUT_HIGH);
            // shift to get the right bit position
    PORTC=buffer[jout++ & -BUF_SIZE];
} // while

 

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But it appears that the max clock rate would be clk/16 -- 1.25MHz.

Not for MSPI:

 

 

This might do it:

#define __SFR_OFFSET 0
#include <avr/io.h>

; Must be a power-of-two no greater than half of the available SRAM
#define DL_SIZE_BYTES 64

; 2 <= DELAY_BYTES < DL_SIZE_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define DELAY_BYTES 25

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.section  .bss
.balign   DL_SIZE_BYTES
.comm    dl, DL_SIZE_BYTES

.section .text

.global __do_clear_bss

.global main

main:

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     hi8(dl+DELAY_BYTES)
        ldi     XL,     lo8(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     hi8(dl)
        ldi     YL,     lo8(dl)


; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;                    
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    XH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        andi    YL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    YH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        rjmp    .+0                               ; 2
        rjmp    loop                              ; 2
                                                  ; = 16

Compiles, and looks like it will work, but COMPLETELY UNTESTED.

 

At 20 MHz system clock, and a 10 MHz sample clock, a 3 MHz input will show a fair amount of jitter since 10 isn't an integer multiple of 3.  Nyquist tells us that the sample frequency must be > twice the signal frequency, so we need >>more<< than two sample per period of the input signal.  In order to minimise jitter we'd want an even integral number of samples per period, so a minimum of 4 samples per period total.  That would be a 12 MHz sample frequency.  Not achievable with a 20 MHz system clock.  You could overclock to 24 MHz, or live with the jitter, or reduce the frequency of your input signal.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 3, 2016 - 06:54 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nice coding, yeah, that's what I had in mind. Let's see if nobba tests it and it works in practice. If there are any timing errors, it will be quite hard to debug, but if you are working at the limits of the MCU, that's just the way it is.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Not for MSPI:

Indeed, it was late when I went through that datasheet section and rolled right by the actual equations.

 

I was wondering whether one would want to start with a "seed" write to start filling the double-buffers.  But perhaps while the very first iteration may have a bit of a gap things would then "catch up" after cycle counting?

 

 

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's very likely I've missed something.  If I have time later today I will try to test it.

 

Oh, and the way to compute the number of bytes you'll need for a given delay is simple.  May as well add that to the code.

 

Note that floating point arithmetic in preprocessor directives isn't supported by GCC for assembler sources, thus the delay is specified in nanoseconds instead of microseconds.

 

And I suppose there's no reason to limit the buffer to 64 bytes.  Might as well just go for the maximum size.

 

Also, note that the delay you specify does not include the 0.5-1.5 cpu cycles synchronisation delay imposed by hardware, nor the half-bit (i.e. 1 cpu cycle) phase delay between TX and RX.  The result is likely to be a -0.5 to +0.5 cycle jitter.  A scope or LA would be required to characterise the actual delay offset on real hardware.

 

Further, it should be possible to change the length of the delay line at runtime by communicating a new value over SPI (or TWI, or another USART, if your AVR has one).  With interrupt-driven SPI slave code, an ISR could compute a new value for DELAY_BYTES and set new values for the head and tail registers [X and Y], and then resynchronise MSPI.  Here's the new code with changes made.  I've omitted the code required to receive runtime changes to the length of the delay line, that is left as an exercise for the reader ;-)

 

#define F_CPU 20000000

#define __SFR_OFFSET 0
#include <avr/io.h>

#define DELAY_NS 20000

; Must be a power-of-two no greater than half of the available SRAM
#define BUF_SZ_BYTES (((RAMEND + 1) - RAMSTART) / 2)

; 2 <= DELAY_BYTES < BUF_SZ_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define BITS_PER_US (F_CPU / 2000000)
#define DELAY_BYTES ((DELAY_NS * BITS_PER_US) / 8000)
#if (DELAY_BYTES >= BUF_SZ_BYTES)
  #warning DELAY_US is too long
  #undef DELAY_BYTES
  #define DELAY_BYTES (BUF_SZ_BYTES - 1)
#endif
#if (DELAY_BYTES < 2)
  #warning DELAY_US is too short
  #undef DELAY_BYTES
  #define DELAY_BYTES 2
#endif

; Create a symbol reflecting the real delay.  Examine it with avr_objdump -t
; or similar.
.equ real_delay, (DELAY_BYTES * 8000) / BITS_PER_US

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.section  .bss
.balign   BUF_SZ_BYTES
.comm    dl, BUF_SZ_BYTES

.section .text

.global __do_clear_bss

.global main

main:

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     hi8(dl+DELAY_BYTES)
        ldi     XL,     lo8(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     hi8(dl)
        ldi     YL,     lo8(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     (BUF_SZ_BYTES-1) & 0xFF   ; 1
        andi    XH,     (BUF_SZ_BYTES-1) >> 8     ; 1
        andi    YL,     (BUF_SZ_BYTES-1) & 0xFF   ; 1
        andi    YH,     (BUF_SZ_BYTES-1) >> 8     ; 1
        rjmp    .+0                               ; 2
        rjmp    loop                              ; 2
                                                  ; = 16

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 3, 2016 - 01:45 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

It's very likely I've missed something.  If I have time later today I will try to test it.

 

Oh, and the way to compute the number of bytes you'll need for a given delay is simple.  May as well add that to the code.

 

Note that floating point arithmetic in preprocessor directives isn't supported by GCC for assembler sources, thus the delay is specified in nanoseconds instead of microseconds.

 

And I suppose there's no reason to limit the buffer to 64 bytes.  Might as well just go for the maximum size.

....

 

Hello guys and thanks for your sterling efforts.

 

Ok, baby steps here for me please.

 

I am totally willing to try to compile the code, and take a look on the scope (btw, its of 1988 vintage, much like my 'tronics knowledge, hence baby steps)

 

So, baby steps in mind -- using this code, I have no clue which pin to put the input signal in on, and which pin the output will appear. I will pore over the datasheet to see if I can fig it out.

 

 

BTW, the guy with your C code -- shift(...) doesn't exist on my compiler, so no idea what you mean by that line. I did paste it in. It fails to compile. I guess you mean left shift, but its just a guess..

 

Last Edited: Wed. Aug 3, 2016 - 02:49 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Input on RXD, output on TXD. For the m168 and friends, thats PD0 and PD1.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Input on RXD, output on TXD. For the m168 and friends, thats PD0 and PD1.

 

Ok, I see that on pins 2 & 3. Is the configuration of these pins required in addition to the code? (DDRD)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, continuing on the baby steps theme.

 

It doesn't compile for me in AtmelStudio at all.

 

"Cannot find include file avr/io.h"

"Invalid directive: section"

 

I think this is not for the same compiler I have installed.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:

joeymorin wrote:
Input on RXD, output on TXD. For the m168 and friends, thats PD0 and PD1.

 

Ok, I see that on pins 2 & 3. Is the configuration of these pins required in addition to the code? (DDRD)

 

I think it's automatic for RXD and TXD, some peripherals just take over the pins. However, you can see in the initialization code that PD4 was set as output, this is the clock pin. It is not needed here, but maybe you can use it for debugging, it will generate a 10MHz square wave synchronized with the output.

 

 

edit: yeah, it's a different assembler, to do it in atmel studio, some minor changes need to be made.

 

edit #2: it so happens I had converted the original version smiley  (hope there are no mistakes):

 



 #define __SFR_OFFSET 0

; Must be a power-of-two no greater than half of the available SRAM
#define DL_SIZE_BYTES 64

; 2 <= DELAY_BYTES < DL_SIZE_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define DELAY_BYTES 25

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.DSEG
.IF	(DL_SIZE_BYTES > SRAM_START)
	.ORG	DL_SIZE_BYTES
.ELSE
	.ORG	SRAM_START
.ENDIF
dl:	.BYTE	DL_SIZE_BYTES

.CSEG

.ORG 0

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     HIGH(dl+DELAY_BYTES)
        ldi     XL,     LOW(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     HIGH(dl)
        ldi     YL,     LOW(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    XH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        andi    YL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    YH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        rjmp    go                                ; 2
	go:
        rjmp    loop                              ; 2
                                                  ; = 16

 

Last Edited: Wed. Aug 3, 2016 - 03:14 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

nobba wrote:

joeymorin wrote:
Input on RXD, output on TXD. For the m168 and friends, thats PD0 and PD1.

 

Ok, I see that on pins 2 & 3. Is the configuration of these pins required in addition to the code? (DDRD)

 

I think it's automatic for RXD and TXD, some peripherals just take over the pins. However, you can see in the initialization code that PD4 was set as output, this is the clock pin. It is not needed here, but maybe you can use it for debugging, it will generate a 10MHz square wave synchronized with the output.

 

 

edit: yeah, it's a different assembler, to do it in atmel studio, some minor changes need to be made.

 

To get it to compile: I made a gcc project, emptied the main.c file , and added your file as a .s (little s) file to the project.

 

About to try it

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,

 

Not really sure what I am seeing here. It looks like some other pulse modulated by the input signal.

 

In fact, with no input signal supplied (pin floating), I see what look like 1 or 2 us pulses every now and then.

 

Grounding the input pin just leaves a train of these spurious pulses.

 

I don't know if that might mean anything to you?

 

EDIT: I also noticed on PD4 there are bursts of 10MHz square waves on for about a second, off for about a second (just timed it in my head, nothing accurate)

Last Edited: Wed. Aug 3, 2016 - 03:32 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
BTW, the guy with your C code -- shift(...) doesn't exist on my compiler, so no idea what you mean by that line. I did paste it in. It fails to compile. I guess you mean left shift, but its just a guess..
I did not know bit positions, so did not even know whether to shift left or right.

I'd intended OP to replace it with the appropriate << or >> .

Perhaps I should have added the comment shift(IN_HIGH)==OUT_HIGH.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
To get it to compile: I made a gcc project, emptied the main.c file , and added your file as a .s (little s) file to the project. About to try it

Are you talking about running the code as a result of this? It's not going to work without the use of -nostartfiles if you are really building it in an avr-gcc project.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

nobba wrote:

joeymorin wrote:
Input on RXD, output on TXD. For the m168 and friends, thats PD0 and PD1.

 

Ok, I see that on pins 2 & 3. Is the configuration of these pins required in addition to the code? (DDRD)

 

I think it's automatic for RXD and TXD, some peripherals just take over the pins. However, you can see in the initialization code that PD4 was set as output, this is the clock pin. It is not needed here, but maybe you can use it for debugging, it will generate a 10MHz square wave synchronized with the output.

 

 

edit: yeah, it's a different assembler, to do it in atmel studio, some minor changes need to be made.

 

edit #2: it so happens I had converted the original version smiley  (hope there are no mistakes):

 



 #define __SFR_OFFSET 0

; Must be a power-of-two no greater than half of the available SRAM
#define DL_SIZE_BYTES 64

; 2 <= DELAY_BYTES < DL_SIZE_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define DELAY_BYTES 25

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.DSEG
.IF	(DL_SIZE_BYTES > SRAM_START)
	.ORG	DL_SIZE_BYTES
.ELSE
	.ORG	SRAM_START
.ENDIF
dl:	.BYTE	DL_SIZE_BYTES

.CSEG

.ORG 0

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     HIGH(dl+DELAY_BYTES)
        ldi     XL,     LOW(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     HIGH(dl)
        ldi     YL,     LOW(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    XH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        andi    YL,     (DL_SIZE_BYTES-1) & 0xFF  ; 1
        andi    YH,     (DL_SIZE_BYTES-1) >> 8    ; 1
        rjmp    go                                ; 2
	go:
        rjmp    loop                              ; 2
                                                  ; = 16

 

 

Aaaannnddd... the compiler says:

 

Severity    Code    Description    Project    File    Line
Warning        .org 0x40 in .dseg is below start of RAM at 0x100    delay-advanced   delay-advanced\main.asm    25

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

nobba wrote:
To get it to compile: I made a gcc project, emptied the main.c file , and added your file as a .s (little s) file to the project. About to try it

Are you talking about running the code as a result of this? It's not going to work without the use of -nostartfiles if you are really building it in an avr-gcc project.

 

Thank you, I enabled that flag and now I see as I described: the output square wave has a lot of unwanted junk along with it. I have no clue what it is. Its just a blur on the scope, so I'm guessing its 10Mhz spikes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Try the attached .hex file.  Built with avr-gcc.  I still haven't tested it myself, but I will try this afternoon.

 

 

 

 

Attachment(s): 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

Try the attached .hex file.  Built with avr-gcc.  I still haven't tested it myself, but I will try this afternoon.

 

 

 

 

Ok, thanks.

 

For me, I don't see much. Difficult for the scope to sync. Even the input is jumping around.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I ought to stay out of this...

 

But that said:

 

1) I think you need the right tools for the job.

A descent dual channel digital O'scope with a robust trigger module would sure make life easier.

It will be very difficult to measure your system performance with an analog scope, you might end up with single impulse test measurements to do this.

 

2) I generally only tinker with electronics, so I have much less cross platform experience than many of the Forum regulars.

That said, I started years ago with the Basic Stamp.

When I needed interrupts and more memory, I moved to Pic's.

When I found AVR's I moved to AVR's and haven't used a Pic since then, (except I guess now I am using MicroChip parts, again...)

When I wanted a faster micro and priority interrupts I switched to the Xmega line, and most of my projects have used an Xmega since then.

.

So, what the point of this story?

Occasionally when one is trying to push the limits on one's current technology, it is time to make a paradigm shift to something totally new.

It seems to me that this project might be trivial on a several MHz - GHz ARM chip, except for the obvious learning curve of new hardware and new development platform.

 

3) Not ready for the switch to an ARM?

One might consider switching to an Xmega series AVR.

It will run, in spec, at 32 MHz, which might give one considerably less jitter in the signal processing chain.

As it appears that this won't use the analog functions, or EEPROM, it is probably reasonably to overclock it to 48 MHz, giving even better performance.

Atomic Zombie routinely overclocks them higher, I started to get into trouble above 48 MHz when I built a custom "logic analyzer" to snoop a signal once.

I've not used the "virtual ports" feature of the Xmega, but it might also be worth reading that section of the manual to see if the port functions can be sped up any through their usage.

If you look at the DMA read to buffer or write from buffer capability remember that the DMA doesn't actually run in parallel with the uC core, if I remember correctly, so don't think of it as a parallel processor.

 

4) Obviously, as you are doing, start at a lower frequency where it is easier to see and measure the system performance, and debug the concept and the software, then start pushing the limits, (calculate your expected limit, and then measure it).

 

5) Some projects benefit from developing custom testbed hardware first, to help develop and debug your primary project.

For example, I built an EKG signal simulator to aid in the development of an EKG monitor.

You might consider a simple uC signal generator project, also.

Push a switch and it generates a single pulse, and a scope trigger, used to watch the pulse propagate through your project.

Push another switch and get a pulse of twice the period.

Push another switch and get a burse of 5 pulses.

Push another switch and get a continuous test stream, (along with your synchronizing O'scope trigger pulse).

etc.

 

6)  I've heard of PIC's DSP series micros, for digital signal processing.

I've never read their data sheets, or worked with one.

I don't know if they bring anything useful to the project or not.

 

7) At 48 MHz you might get away with coding your project entirely in "C", or at least minimizing your need for ASM, as you work towards getting a working low freq prototype up and running.

(No spurious 10 MHz noise spikes, etc.).

Then work on optimizing the system's performance.

 

Sounds like an interesting project!

 

JC

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Appreciate your comments, JC.

 

If I wanted an easy life, I would do it all in analog: heterodyne the 3Mhz signal +/- LO to 200kHz - all pass filter - het it back up to 3Mhz using the same LO.

 

Its just component count putting me off.

 

I do have an XMEGA here (and a dev board), but I figured that 35Mhz vs 20Mhz isn't *that* much faster.

 

Now, if I could do the whole shebang: 3Mhz --> IF --> delay --> back up to 3Mhz in one chip, then I would want *that* chip. So maybe ARM. But then the complexity might outweigh the heterodyne approach.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If I wanted an easy life, I would do it all in analog

laugh  Right.  Easy.  Weekend Project!  Or not!

 

Anyway, I don't believe you described the BW of the base signal.

 

JC 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Its a single-sideband (SSB) voice signal, so approx. 150c - 3kc

 

But its only the phase component I am concerned with in this part of the project: the amplitude part is done elsewhere (actually in a PWM modulator that does not use AVR at all -- just a triangle generator and a comparator then some huge FETS and a LPF) (and delays the amplitude by 20-50uS), hence the need for matching the delays so as to keep the phase relationships.

Last Edited: Wed. Aug 3, 2016 - 05:35 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Rookie mistake :(

 

When I was applying the mask to X and Y in the main loop, I was clearing the higher bit representing the base address of the buffer.  The result is that both X and Y were pointing at the GP register file.  I was clobbering state.  The machine was running amok!

 

New code below.  Tested, and 'works', although I haven't confirmed if the delay is correct.  But the input is in fact now duplicated on the output.

 

#ifndef F_CPU
  #define F_CPU 20000000
#endif

#define __SFR_OFFSET 0
#include <avr/io.h>

; Set to desired delay
#define DELAY_NS 20000

; Must be a power-of-two no greater than half of the available SRAM
#define BUF_SZ_BYTES (((RAMEND + 1) - RAMSTART) / 2)

; 2 <= DELAY_BYTES < BUF_SZ_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define BITS_PER_US (F_CPU / 2000000)
#define DELAY_BYTES ((DELAY_NS * BITS_PER_US) / 8000)
#if (DELAY_BYTES >= BUF_SZ_BYTES)
  #warning DELAY_US is too long
  #undef DELAY_BYTES
  #define DELAY_BYTES (BUF_SZ_BYTES - 1)
#endif
#if (DELAY_BYTES < 2)
  #warning DELAY_US is too short
  #undef DELAY_BYTES
  #define DELAY_BYTES 2
#endif

; Create a symbol reflecting the real delay.  Examine it with avr_objdump -t
; or similar.
.equ real_delay_ns, (DELAY_BYTES * 8000) / BITS_PER_US

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.section  .bss
.balign   BUF_SZ_BYTES
.comm    dl, BUF_SZ_BYTES

; Determine  address  linker  will  select  for  buffer  (used  for  condtional
; compilation below).   Can't think  of a  way to extract  this from  the .comm
; declaration of dl above.   I expect it can't be done.   Rather, would need to
; specify a custom section and use --section-start= when building.  Meh.
#if BUF_SIZE_BYTES > RAMSTART
  #define DL BUF_SIZE_BYTES
#else
  #define DL RAMSTART
#endif

.section .text

.global __vector_default
        rjmp    reset
.global __vector_default

reset:

.global __do_clear_bss

.global main

main:

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     hi8(dl+DELAY_BYTES)
        ldi     XL,     lo8(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     hi8(dl)
        ldi     YL,     lo8(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    XH,     hi8(BUF_SZ_BYTES-1)       ; 1
        andi    YL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    YH,     hi8(BUF_SZ_BYTES-1)       ; 1
#if DL < 0x100
        ori     XL,     lo8(dl)                   ; 1
        ori     YL,     lo8(dl)                   ; 1
#else
        ori     XH,     hi8(dl)                   ; 1
        ori     YH,     hi8(dl)                   ; 1
#endif
        rjmp    loop                              ; 2
                                                  ; = 16

; Catch-all
__vector_default:
        reti

 

New .hex file for m168 attached.  Built for 20 MHz, and a 20 us delay.  Also, built without using -nostartfiles, so the full CRT is linked.  This clears the buffer to zero since it is in .bss.  It's not necessary, but ensures no spurious output at the beginning with the random contents of SRAM after a power-up.  I've built and tested it both with and without the CRT and it works either way.

 

EDIT:  Whoops, had the wrong hex file attached.  It was built for a 1 ms delay.  Replaced with new file built for 20 us.

Attachment(s): 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Aug 3, 2016 - 07:29 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
I was clearing the higher bit representing the base address of the buffer.

I read past that also.

 

I thought I was clever in my non-SPI version to stipulate the 256 byte buffer and then only increment the _L register.  But that loop is intended to be minimum cycles; this one exactly 16.

 

This thread has been fun, in that it allows us to make the AVR do tricks.  "The AVR's UART doesn't work right" gets old after a while.

 

 

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Wed. Aug 3, 2016 - 07:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Luckily there were 2 clocks to spare wink now there is none.

 

theusch wrote:

 

I thought I was clever in my non-SPI version to stipulate the 256 byte buffer and then only increment the _L register.  But that loop is intended to be minimum cycles; this one exactly 16.

 

This thread has been fun, in that it allows us to make the AVR do tricks.  "The AVR's UART doesn't work right" gets old after a while.

 

 

If there were no slack cycles, it could still have to be made like that, but like this is quite beautiful, at the limit of the MCU laugh

Last Edited: Wed. Aug 3, 2016 - 07:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Whew!

Saved by Joey.

 

I was tempted, very tempted actually, to grab my XmegaE and O'scope and tinker a bit, to see what kind of programmably variable digital signal delay I could come up with, perhaps reading a bit, or the analog comparator, +/- the DMA, to a variable sized buffer. 

 

Good thing Joey solved this as I really have other, (less fun), things to be working on at the moment.

 

I did Google around for some old fashioned Bit-Bucket-Brigade chips that have 1024 or 2048 FlipFlops and used to be used in audio reverb or telephone echo cancelation circuits, but I didn't find anything that stood out as immediately helpful with 20 - 50 uSec delays. 

 

JC

 

Edit Typo

Last Edited: Wed. Aug 3, 2016 - 08:53 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

DocJC wrote:
+/- the DMA,

I'd also have to dig into it, but I think that port DMA would work well.  Probably [as I remember my reading] with sacrificing entire ports as in my polling example on AVR8.

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

Rookie mistake :(

 

When I was applying the mask to X and Y in the main loop, I was clearing the higher bit representing the base address of the buffer.  The result is that both X and Y were pointing at the GP register file.  I was clobbering state.  The machine was running amok!

 

New code below.  Tested, and 'works', although I haven't confirmed if the delay is correct.  But the input is in fact now duplicated on the output.

 

#ifndef F_CPU
  #define F_CPU 20000000
#endif

#define __SFR_OFFSET 0
#include <avr/io.h>

; Set to desired delay
#define DELAY_NS 20000

; Must be a power-of-two no greater than half of the available SRAM
#define BUF_SZ_BYTES (((RAMEND + 1) - RAMSTART) / 2)

; 2 <= DELAY_BYTES < BUF_SZ_BYTES
; Since two bytes go  out from the tail of the delay line  before any are added
; to the head, the delay line must be at least 2 bytes long, or 16 samples, for
; a minimum delay of 1.6 us.
#define BITS_PER_US (F_CPU / 2000000)
#define DELAY_BYTES ((DELAY_NS * BITS_PER_US) / 8000)
#if (DELAY_BYTES >= BUF_SZ_BYTES)
  #warning DELAY_US is too long
  #undef DELAY_BYTES
  #define DELAY_BYTES (BUF_SZ_BYTES - 1)
#endif
#if (DELAY_BYTES < 2)
  #warning DELAY_US is too short
  #undef DELAY_BYTES
  #define DELAY_BYTES 2
#endif

; Create a symbol reflecting the real delay.  Examine it with avr_objdump -t
; or similar.
.equ real_delay_ns, (DELAY_BYTES * 8000) / BITS_PER_US

; Index registers by name
#define XL r26
#define XH r27
#define YL r28
#define YH r29

; 10 MHz sample rate  will generate 10 samples per us.  50  us will require 500
; 1-bit samples.  512 samples would fit in  64 bytes.  Although the m168 has 1K
; of SRAM, it is mapped starting at 0x100.  In order to keep the buffer aligned
; to a  power-of-two equal  to its size,  it cannot be  larger than  512 bytes.
; That would permit a 4096 sample delay line.  At 10 MHz, that's 409.6 us.
.section  .bss
.balign   BUF_SZ_BYTES
.comm    dl, BUF_SZ_BYTES

; Determine  address  linker  will  select  for  buffer  (used  for  condtional
; compilation below).   Can't think  of a  way to extract  this from  the .comm
; declaration of dl above.   I expect it can't be done.   Rather, would need to
; specify a custom section and use --section-start= when building.  Meh.
#if BUF_SIZE_BYTES > RAMSTART
  #define DL BUF_SIZE_BYTES
#else
  #define DL RAMSTART
#endif

.section .text

.global __vector_default
        rjmp    reset
.global __vector_default

reset:

.global __do_clear_bss

.global main

main:

; configure SPI for F_OSC/2 = 10 MHz
        eor     r1,     r1
        sts     UBRR0H, r1
        sts     UBRR0L, r1
        sbi     DDRD,   4
        ldi     r16,    (1<<UMSEL01)|(1<<UMSEL00)
        sts     UCSR0C, r16
        ldi     r16,    (1<<RXEN0)|(1<<TXEN0)
        sts     UCSR0B, r16
        sts     UBRR0H, r1
        sts     UBRR0L, r1

; Since  it's implemented  as a  circular  buffer of  bytes, the  delay can  be
; configured with a granularity of 8 bits, or 0.8 us.

; X is  used to point to  the head of  the delay line, where  incomming samples
; will be deposited
        ldi     XH,     hi8(dl+DELAY_BYTES)
        ldi     XL,     lo8(dl+DELAY_BYTES)
; Y is used to point to the tail of the delay line, where outgoing samples will
; be withdrawn
        ldi     YH,     hi8(dl)
        ldi     YL,     lo8(dl)

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
        andi    XL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    XH,     hi8(BUF_SZ_BYTES-1)       ; 1
        andi    YL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    YH,     hi8(BUF_SZ_BYTES-1)       ; 1
#if DL < 0x100
        ori     XL,     lo8(dl)                   ; 1
        ori     YL,     lo8(dl)                   ; 1
#else
        ori     XH,     hi8(dl)                   ; 1
        ori     YH,     hi8(dl)                   ; 1
#endif
        rjmp    loop                              ; 2
                                                  ; = 16

; Catch-all
__vector_default:
        reti

 

New .hex file for m168 attached.  Built for 20 MHz, and a 20 us delay.  Also, built without using -nostartfiles, so the full CRT is linked.  This clears the buffer to zero since it is in .bss.  It's not necessary, but ensures no spurious output at the beginning with the random contents of SRAM after a power-up.  I've built and tested it both with and without the CRT and it works either way.

 

EDIT:  Whoops, had the wrong hex file attached.  It was built for a 1 ms delay.  Replaced with new file built for 20 us.

 

I must say thanks to everyone for all the (what looks to me like considerable) effort.

 

I should get chance to try this tomorrow.

 

I'll rev up the ol' frequency generator to see how it fares.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

"The AVR's UART doesn't work right" gets old after a while.

You're not kidding.  I feel as though my patience and tolerance has gone down rapidly over the last year ;-)  ... I'm always impressed when the likes of you or Cliff keep at it patiently.  I'm trying.  At least, that's what my wife says.

 

El Tangas wrote:

Luckily there were 2 clocks to spare wink now there is

If there were no slack cycles, it could still have to be made like that, but like this is quite beautiful, at the limit of the MCU laugh

Well actually, I was being lazy, since I had 16 cycles to play with.   This is a little uglier, but it saves two cycles for future use, and still handles most any device.

; Fill the MSPI TX buffer and wait for  the first RX byte to be ready, i.e. the
; first read must  occur at least 16  cycles after the first  write.  The third
; write must  occur no more  than 32  cycles after the  first write, or  the TX
; buffer will be empty, and there will be a gap.
        ld      r16,    Y+                        ;
        sts     UDR0,   r16                       ;         1st write
        ld      r16,    Y+                        ;                   2
        sts     UDR0,   r16                       ;         2nd write 2

; 3 cycles per pass, total 15 cycle wait.
        ldi     r16,    5                         ;
wait:
        dec     r16                               ;
        brne    wait                              ;                  15

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2       1st read  2 = 21
        st      X+,     r16                       ; 2                 2
        ld      r16,    Y+                        ; 2                 2
        sts     UDR0,   r16                       ; 2       3rd write 2 = 27
#if BUF_SZ_BYTES <= 0x100
        andi    XL,     lo8(BUF_SZ_BYTES-1)       ; 1
        andi    YL,     lo8(BUF_SZ_BYTES-1)       ; 1
#else
        andi    XH,     hi8(BUF_SZ_BYTES-1)       ;     1
        andi    YH,     hi8(BUF_SZ_BYTES-1)       ;     1
#endif
#if DL < 0x100
        ori     XL,     lo8(dl)                   ; 1
        ori     YL,     lo8(dl)                   ; 1
#else
        ori     XH,     hi8(dl)                   ;     1
        ori     YH,     hi8(dl)                   ;     1
#endif
        rjmp    .+0                               ; 2
        rjmp    loop                              ; 2
                                                  ; = 16

I hope it doesn't offend your sense of limitless beauty ;-)

 

I've tidied up a few silly things, but I'm not going to post yet another copy of this thing unless someone asks ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

 

I hope it doesn't offend your sense of limitless beauty ;-)
 

 

Well, I guess I can live with that, lol.

 

After all, we could save 2 more cycles if BUF_SZ_BYTES is set to 0x100 by incrementing only XL and YL like theusch did in his code. With 4 cycles we could do all kinds of stuff cheeky

Better store this code snippet for future use.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, but then the maximim delay would be 256*8*0.1 = 204.8 us. Good enough for the OP, but by using 1 kB on, say, a 328P, that goes up to 819.2 us ;-)
If I ever think of something useful to do with those 4 cycles, I can always change it. I suppose we could toggle a pin, or increment a whole port, so that the MSB would signal at F_CPU/4096. Is that useful? ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 12:03 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Yes, but then the maximim delay would be 256*8*0.1 = 204.8 us. Good enough for the OP, but by using 1 kB on, say, a 328P, that goes up to 819.2 us ;-) If I ever think of something useful to do with those 4 cycles, I can always change it. I suppose we could toggle a pin, or increment a whole port, so that the MSB would signal at F_CPU/4096. Is that useful? ;-)

 

Well, please don't get bored with it on my account :)

 

an LED that says "I am processing input" (as opposed to just waiting for input). You know how people love a flashing LED. But only if its blue.

 

Or a way to attach an analog pot to adjust delay...

 

Or a couple pins to divide down FAST input (like 5Megs+)... and mul it back up again after processing

 

I'm not really taking the p...

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, please don't get bored with it on my account :)

On the contrary, this was an interesting nut to crack :)

 

 

an LED that says "I am processing input" (as opposed to just waiting for input). You know how people love a flashing LED. But only if its blue.

Hmmm.  We can just about do it:

; once for init
        eor     r30,    r30
        mov     r31,    r30
        sbi     DDRD,   7
; in the loop
        adiw    r30,    1                         ; 2
        sbrc    r31,    7                         ; 1/2
        out     PIND,   r31                       ; 1

The second part takes 4 cycles total, regardless.  If this takes our four extra cycles in the loop, the effect is that Z is incremented every 16 cpu cycles.  Z will overflow every 65,536*16 = 1,048,576 cycles.  Bit 7 of r31 will be low for 524,288 cycles, and then high for 524,288 cycles.  When it is low, the out is skipped, and PORTD remains unchanged.  When it is high, the out is executed, and PD7 is toggled.  The first toggle will turn PD7 high, the next will turn it low, etc.  It will be toggled at F_CPU/16, for a frequency of F_CPU/32 = 625 kHz.  Since bit 7 of r31 will be high an even number of times, the last toggle before Z overflows and (bit 7 of r31 is again 0) will leave PD7 low.

 

The overall effect is that an LED across PD7 and GND will appear to flash at a frequency of F_CPU / (16 * 65536 * 2) = 9.54 Hz.  That's pretty fast, but it's slow enough to perceive.  The off half of the flash will indeed be off, while the 'on' half will actually have the LED pulsing a square wave at 625 kHz.

 

Is that useful? ;-)

 

Note that PD3-7 (and PD2 if you don't enable XCK) will also toggle, but since they aren't outputs, only the pull-ups will be enabled/disabled.

 

As has been noted, there is always XCK (PD4), which outputs a 10 MHz square wave while the loop is running.  As @El Tangas has mentioned, it isn't necessary for MSPI to function in the manner we are using it here, so it need not be made an output.

 

Or a way to attach an analog pot to adjust delay...

That wouldn't need any cycles from the loop.  It can be done, similar to using SPI/TWI/whatever to set a new delay can be done as I suggested in #29.  So set a new delay with a POT, you'd also need a switch or button to 'latch' a the new value.  The switch would trigger an interrupt to read the POT, compute new offsets for X and Y, then restart the delay line.  For that matter, the button could just be on /RESET, with the POT position read only once on startup.  To set a new delay, change the POT, and just reset the AVR.

 

You could constantly poll the POT for a new value, but that would continually break the timing of the main loop.

 

If this basic solution works for you, I might take a crack at adding the variable delay.  I've got some unanticipated projects coming up, so it may not happen fast.

 

Or a couple pins to divide down FAST input (like 5Megs+)... and mul it back up again after processing

Are you proposing a kind of digial heterodyne?

I'm not really taking the p...

Go on.  Of course you are ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 03:28 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hey, it just occurred to me that we can unroll the loop and get more or less all the extra cycles we need to do anything smiley

 

Using the 256 byte buffer technique, for example, unrolling by 8, the first 7 unrolled cycles could use auto address increment instructions, while only the last needs manual increment. This saves even more cycles and leaves the flags alone, so we can interleave any code we want and only need to be careful with the timings. We would have nearly 50% CPU time available.

 

I think it would actually be possible to control the delay with a pot without even interrupting the program flow.

 

Naturally, I leave any actual coding as an exercise to the interested reader ;-) I wouldn't be able to test it anyway, I really need to buy an oscilloscope.

Last Edited: Thu. Aug 4, 2016 - 10:39 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The trouble there is that there are >>two<< pointers.  One for the head of the circular buffer, and one for the tail.  They won't wrap at the same time, so it's not easy to predetermine where to unroll the loop.  It would be harder still if the delay were variable.

 

However, just by moving to a 256 byte buffer, we can avoid auto-increment altogether, as Lee has done with his software-only solution, and we save a few cycles permanently:

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        inc     XL                                ; 1
        inc     YL                                ; 1
        ; 4 cycles available
        rjmp    loop                              ; 2

This is a loop we >>can<< unroll indefinitely and with impunity, giving us 6 free cycles instead of 8.  Mind you, there is the minor wrinkle that the inc instructions will affect the S, V, N, and Z flags, but careful coding could manage that restriction.  For example, since the C flag is unaffected by inc, it could be used for inquiry whenever possible.

 

If we unrolled to the end of flash, an m168 would have up to 2K words of flash available in 6-cycle chunks to do other work, with the remaining 6K words consumed by the unrolled loop. (!!)

 

Now, since Y is preloaded with the address of the start of the circular buffer (while X is loaded with an offset representing the delay), we could use incremental constant offsets in the indexing for each pass through the unrolled loop:

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available

        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+1                       ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available
.
.
.
        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+62                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 7 cycles available

        inc     XL                                ; 1
        lds     r16,    UDR0                      ; 2
        st      X,      r16                       ; 2
        ld      r16,    Y+63                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 4 cycles available

        inc     XL                                ; 1
        subi    YL,     -64
        rjmp    loop                              ; 2

This buys another cycle per pass, and we can unroll by 64.  In this way, we'd have 63*7+4 = 445 cycles to play with (We can unroll by more, in chunks of 64, offering 445*chunks-2 cycles to play with).  However, it doesn't prevent the condition flags from being modified, since X is still incremented every pass.  Again, since X is offset from Y and not 'synchronised' with the unrolled loop, we can't use the same technique.  Some of those would be involved in preserving flags across the unrolled passed to avoid the impact of the inc.

 

Then again, if we constrained the granularity with which the delay could be set, we could take advantage of the same offset technique for the head pointer.  Since X doesn't support index operations with an offset, we'd have to switch to using Z for the head, but no matter.  Then we'd have 8 cycles to play with for every pass, with no need to preserve flags across passes.  For example:

 

; Run the delay line.  Loop must be exactly 16 cycles
loop:
        lds     r16,    UDR0                      ; 2
        st      Z,      r16                       ; 2
        ld      r16,    Y                         ; 2
        sts     UDR0,   r16                       ; 2
        ; 8 cycles available

        lds     r16,    UDR0                      ; 2
        st      Z+1,    r16                       ; 2
        ld      r16,    Y+1                       ; 2
        sts     UDR0,   r16                       ; 2
        ; 8 cycles available
.
.
.
        lds     r16,    UDR0                      ; 2
        st      Z+62    r16                       ; 2
        ld      r16,    Y+62                      ; 2
        sts     UDR0,   r16                       ; 2
        inc     ZL                                ; 1
        ; 8 cycles available

        lds     r16,    UDR0                      ; 2
        st      Z+63    r16                       ; 2
        ld      r16,    Y+63                      ; 2
        sts     UDR0,   r16                       ; 2
        ; 4 cycles available

        subi    ZL,     -64                       ; 1
        subi    YL,     -64
        rjmp    loop                              ; 2

This would give us 63*8+4 = 508 cycles, and no need to preserve flags.  However, it would mean that the delay could only be set to multiples of 64 bytes, or 512 bits.  At 10 MHz sampling, that's a granularity of 51.2 us.  Not especially useful to the OP.

 

We could reduce the granularity from 64 bytes to 32 bytes, or 25.6 us, leaving us 31*8+4 = 252 cycles to play with, and no need to preserve flags.  So it's a trade-off between available cycles and granularity of delay.  As is, the fully rolled-up loop has the best possible granularity at 0.8 us, which is still greater than the 0.5 us the OP had specified.

 

Personally I don't see an advantage to trading granularity for cpu cycles.  I would stick with by-64 unrolled loop and the restriction on preserving flags across passes within the unrolled loop.  445 cycles should be enough to, as you suggested, handle the POT directly in the loop, even with the restriction on flags.  I'd still want an input to 'latch' a new delay value from the POT, since even the 445 cycles might not be enough to do proper filtering on the ADC input, so the >>length<< of the delay line would fluctuate with any noise on the ADC input.

 

Note that for all of these unrolling options, there is a bit of wiggle room w.r.t. the cycles used between passes.  That is, while there are nominally 7 cycles between passes in the first example, since the MSPI is buffered, we could use, say, 11 in one gap, and 3 in the next gap.  So long as we don't allow the buffers to empty or overflow, all will be well.  Careful cycle counting will be required, and the whole unrolled loop must be cycle-accurate overall.

 

Naturally, I leave any actual coding as an exercise to the interested reader ;-)

Well, @El Tangas, I kind of hate you now ... This is going to tend to draw my attention away from a multitude of projects and responsibilities ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 4, 2016 - 02:59 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
If we unrolled to the end of flash, an m168 would have up to 2K words of flash available in 6-cycle chunks to do other work, with the remaining 6K words consumed by the unrolled loop. (!!)

Depending on which loop you use it is about 10 words or so.  Or maybe 20, without counting.  256 full unroll x 20 is 5000 words, 10000 bytes.

 

(so we've gone from less-than-perfect C solutions; to my brute-force with 256 entries and full-port manipulation; to USART-as-SPI to get a bit-level manipulation and only 2/3 pins; to freeing up more cycles to e.g. read setpoint by unrolling)

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Lol, what a thorough analysis. I'd say this one is more than solved now, leave something for the OP to do and get back to work cheeky

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My proposed loop is 5 words, 10 bytes, 64 long, 2 extra words at the bottom.  644 bytes.  Plus as many as 7 more words per pass, x64 = 448 more bytes.  1,092 bytes per by-64 unrolled loop.  Multiple loops can be serialised up to the limit of flash.  15 by-64 unrolled loops would be as long as 16,326 bytes (if all the extra cycles are 1-to-1 with words), leaving 58 bytes (29 words) for init (on an m168).  The up side is that there are about 6,660 cycles which can be used in this massive collection of unrolled loops for other tasks.

 

I'm dizzy.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
My proposed loop is 5 words, 10 bytes, 64 long, 2 extra words at the bottom.

I count 6 words in the base loop (LDS, STS, LD, ST).  And you need words to use the other 8 cycles, right?  So more like 10...  [probably missed something...]

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

FEEDBACK FROM OP:

 

Thanks, Assembler dudes!

 

Tried it today with a 1.9Mhz signal, and it does exactly what's required.

 

Speculatively tried it on a 3.6 MHz signal, also good. Mr. Nyquist, I think, forbids my trying it on 7.1 Mhz, but its no problem to divide it down.

 

As far as developing the ideas further, I might need to read a book or two first, since I don't speak that lingo.

 

Bugger, they always told me C was close to the metal!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobba wrote:
Bugger, they always told me C was close to the metal!

[don't let sparrow2 see this]  Well, it can be.  But one needs to be pretty familiar with the code generation model of the particular C toolchain.

 

And as I think I mentioned earlier

I've often posted in "you can't do this in C" threads; [almost?] always on the C side.

 

But C has no concept of buffer pointer wrapping.

It is indeed hard to get to min machine cycles in C where a better algorithm/code sequence with a particular micro's instruction set uses concepts not part of the C language.  The mentioned wrapping; the concept of Carry; unsupported operand widths; and others.

 

You've put a sow's ear to the task of making a silk purse.  So to get best results the proposed solutions reserve two of three AVR pointer registers exclusively to the task.  GCC is pretty good about that in "memcpy" type loops so we might have gotten that far.  But the wrapping to mod 256 or whatever isn't part of C.  Thus the ASM-only (or implemented as inline fragments) solutions above will give you better performance in this case than C only. 

 

Now, I don't think anyone has explored the SPI version in C, which is a bit more forgiving (with "extra" cycles) than the straight polling approaches.

 

[If one starts with a buffer of exactly 256 bytes on a mod 256 boundary, then the pointer wrapping could be:

 

-- if starting address is 0x100, then wrap by assigning 0x01 to the _H register.  LDI one cycle one word.

--  same as above for "odd" pages 0x300 0x500 ...

-- if starting address is 0x200 or other "even" pages, then wrap by clearing the low bit of _H register.  CBR one cycle one word

 

Hmmm--odd or even one could use method 1 I guess.]

 

[[ Wouldn't that come out the same as ST X, rrr INC _L and use  ST X+,rrr LDI XH, HIGH(buffer) ?  ]]

 

[edit]  As reloading _H takes same words/cycles as INC _L, the latter is probably better because buffer on mod 256 isn't necessarily needed (although the width needs to be 256).  That would be at the expense of a little more one-time work at setup to take care of wrap of the input pointer offset from output pointer.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Thu. Aug 4, 2016 - 07:27 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

I count 6 words in the base loop (LDS, STS, LD, ST).    And you need words to use the other 8 cycles, right?  So more like 10...  [probably missed something...]

Whoops.  Actually, 7 words:

00000040 <loop>:
  40:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  44:	0c 93       	st	X, r16                          ; 2
  46:	08 81       	ld	r16, Y                          ; 2
  48:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  4c:	a3 95       	inc	r26                             ; 1
                                                     
  4e:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  52:	0c 93       	st	X, r16                          ; 2
  54:	09 81       	ldd	r16, Y+1	; 0x01          ; 2
  56:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  5a:	a3 95       	inc	r26                             ; 1 
.
.
.
  5c:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  60:	0c 93       	st	X, r16                          ; 2
  62:	0e ad       	ldd	r16, Y+62	; 0x3e          ; 2
  64:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  68:	a3 95       	inc	r26                             ; 1

  6a:	00 91 c6 00 	lds	r16, 0x00C6                     ; 2
  6e:	0c 93       	st	X, r16                          ; 2
  70:	0f ad       	ldd	r16, Y+63	; 0x3f          ; 2
  72:	00 93 c6 00 	sts	0x00C6, r16                     ; 2
  76:	a3 95       	inc	r26                             ; 1
  78:	c0 5c       	subi	r28, 0xC0	; 192           ; 1
  7a:	e2 cf       	rjmp	.-60     	; 0x40 <loop>   ; 2

... because I still use the inc for X.

 

So 9/18 words/bytes for the last bit, and 7/14 words/bytes for the rest.  4 cycles free in the last, 7 cycles free in the rest.  Assuming worst case 1:1 ratio of words to cycles for the interleaved code, that's 13/26 words/bytes total for the last, and 14/28 words/cycles for the rest.  With a full boat of 64, that's 28*63+26 = 1,790 bytes.

 

If we serialise several, only the last one needs the rjmp.  The rest would have an extra 2 cycles to play with, which could take up to 2/4 words/bytes, so 1,792 bytes.  An m168 could hold 9 of these by-64 unrolled loops, totalling a worst case of 16,126 bytes, leaving 258/129 bytes/words for init.  In those 16,126 bytes, there would be a total of 4,021 cpu cycles to play with, with as many as 4,021/8,042 bytes/words with which to do so.

 

I'm still dizzy.

 

The OP doesn't need the AVR to do anything else, except perhaps runtime adjustment of the delay.  That doesn't require any loop unrolling at all, as it can be handled by an interrupt, or even just a device reset.

 

nobba wrote:

Mr. Nyquist, I think, forbids my trying it on 7.1 Mhz, but its no problem to divide it down.

Overclock to 30 MHz. 

 

Glad it's working out though.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
/*
connect pin XCK to SCK
use MISO MOSI for input, TXD for output

connect SS to another port pin
*/

uint8_t buffer[0x100];
uint8_t jin, jout;

// set up left as an exercise for the reader

while(1) {
    if(UCSR0A USART0A & _BV(UDRE0)) {
        UDR0=buffer[jout++];
        //  UCSR0A USART0A=_BV(UDRE0);  useless, possibly counterproductive
    }
    if(SPSR & _BV(SPIF) {
        buffer[jin++]=SPDR;
        // SPIF cleared automatically
    }
}

 

Iluvatar is the better part of Valar.

Last Edited: Tue. Aug 9, 2016 - 04:48 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Neat.

 

Would preclude the use of SPI to receive new delay values from a master, but still neat.

 

But this:

        USART0A=_BV(UDRE0);

... is unnecessary isn't it?  UDREn is self-clearing.  Writing it to '1' has no effect.

 

And I assume you meant UCSR0A rather than USART0A?

 

It's going to need some coaxing to hit 16 cycles average.  Best results seem to be with -O1:

  while(1) {
    if(UCSR0A & (1<<UDRE0)) {
  98:   80 81           ld      r24, Z                                  ; 2
  9a:   85 ff           sbrs    r24, 5                                  ; 1/2
  9c:   0b c0           rjmp    .+22            ; 0xb4 <main+0x24>      ; 2
      UDR0 = buffer[jout++];
  9e:   a0 91 00 01     lds     r26, 0x0100                             ; 2
  a2:   81 e0           ldi     r24, 0x01       ; 1                     ; 1
  a4:   8a 0f           add     r24, r26                                ; 1
  a6:   80 93 00 01     sts     0x0100, r24                             ; 2
  aa:   b0 e0           ldi     r27, 0x00       ; 0                     ; 1
  ac:   af 5f           subi    r26, 0xFF       ; 255                   ; 1
  ae:   be 4f           sbci    r27, 0xFE       ; 254                   ; 1
  b0:   8c 91           ld      r24, X                                  ; 2
  b2:   88 83           st      Y, r24                                  ; 2
    }
    if(SPSR & (1<<SPIF)) {
  b4:   0d b4           in      r0, 0x2d        ; 45                    ; 1
  b6:   07 fe           sbrs    r0, 7                                   ; 1/2
  b8:   ef cf           rjmp    .-34            ; 0x98 <main+0x8>       ; 2
      buffer[jin++] = SPDR;
  ba:   a0 91 01 02     lds     r26, 0x0201                             ; 2
  be:   81 e0           ldi     r24, 0x01       ; 1                     ; 1
  c0:   8a 0f           add     r24, r26                                ; 1
  c2:   80 93 01 02     sts     0x0201, r24                             ; 2
  c6:   8e b5           in      r24, 0x2e       ; 46                    ; 1
  c8:   b0 e0           ldi     r27, 0x00       ; 0                     ; 1
  ca:   af 5f           subi    r26, 0xFF       ; 255                   ; 1
  cc:   be 4f           sbci    r27, 0xFE       ; 254                   ; 1
  ce:   8c 93           st      X, r24                                  ; 2
  d0:   e3 cf           rjmp    .-58            ; 0x98 <main+0x8>       ; 2

The no-action case takes 9 cycles.  The UDR0 only case takes 20 cycles.  The SPDR only case takes 22 cycles.  The UDR0+SPDR case takes 34 cycles.

 

Since the MISO MOSI and TXD will effectively be synchronised, you'd get better results by only checking on of the flags and doing the read and write together.  I've tried it, and it is better, but not yet fast enough.  Checking SPSR instead of UCSR0A is a bit faster as well, since it is within range of in/out.

 

The real problem is the pointer and index arithmetic which AVR GCC seems determined to perform.

 

EDIT:  MISO to MOSI

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Tue. Aug 9, 2016 - 03:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
But this:

        USART0A=_BV(UDRE0);

... is unnecessary isn't it?  UDREn is self-clearing.  Writing it to '1' has no effect.

 

And I assume you meant UCSR0A rather than USART0A?

Correct.  Maybe.  Correct.

Also, documentation seems to imply that UDREn is not cleared by an ISR.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Also, documentation seems to imply that UDREn is not cleared by an ISR.

That is true.  Writing UDRn i.e. placing something into the TX buffer can clear UDREn.  If there is no TX operation under way, the write to UDRn goes directly to the transmitter's shift register, so the TX buffer is still empty, and UDREn remains set.  A second write to UDRn during the TX operation in progress will go to the TX buffer, thus clearing UDREn.

 

Note my edits in post #68.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
The no-action case takes 9 cycles.  The UDR0 only case takes 20 cycles.  The SPDR only case takes 22 cycles.  The UDR0+SPDR case takes 34 cycles.

 

Since the MISO and TXD will effectively be synchronised, you'd get better results by only checking on of the flags and doing the read and write together.  I've tried it, and it is better, but not yet fast enough.  Checking SPSR instead of UCSR0A is a bit faster as well, since it is within range of in/out.

 

The real problem is the pointer and index arithmetic which AVR GCC seems determined to perform.

An else would eliminate the 34-cycle case.

If 22 cycles is not fast enough, I suppose four lines of in-line assembly

(two each for load and store) would do the trick for an aligned 256-byte buffer.

 

Note that though both SPIs have to run at the same frequency,

they do not have to have the same phase.

By activating SS at the right time, the delay can be selected to within a bit time.

Note that since SS is an input, one will need another wire to control it.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, I would just do this in assembler:

loop:
        in      r17,    SPSR    ; 1
        sbrs    r0,     7       ; 1/2
        rjmp    skip            ; 2
        in      r16,    SPDR    ; 1
        st      Y,      r16     ; 2
        inc     YL              ; 
skip:
        lds     r17,    UCSR0A  ; 2
        sbrs    r17,    5       ; 1/2
        rjmp    loop            ; 2
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     ZL              ; 1
        rjmp    loop            ; 2

Cycle counts are 9, 12, 15, and 18 for the four cases.  This will still be too long.  Needs to be <= 16.

 

I don't see an advantage to testing the bit first.  In C, yes, since it would in theory obviate the need to be exactly 16 cycles, but that seems not to be possible.  Even in assembler it seems difficult to remain under 16 cycles.

 

Maybe:

loop:
        in      r17,    SPSR    ; 1
        sbrc    r0,     7       ; 1/2
        rjmp    loop            ; 2
        in      r16,    SPDR    ; 1
        st      Y,      r16     ; 2
        inc     YL              ; 1
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     ZL              ; 1
        rjmp    loop            ; 2

Counts are 4 and 14, so works fine, and doesn't need to pad with nops since the flag is tested.  But I'd just as soon do it the same as the MPSI-only code posted earlier:

loop:
        in      r16,    SPSR    ; 1
        st      Y,      r16     ; 2
        ld      r16,    Z       ; 2
        sts     UDR0,   r16     ; 2
        inc     YL              ; 1
        inc     ZL              ; 1
        rjmp    .+0             ; 2
        rjmp    .+0             ; 2
        nop                     ; 1
        rjmp    loop            ; 2

This will still allow a granularity of 1 cpu cycles, or 50 ns @ 20 MHz, with your suggestion of the careful selection of SPI phase, and a well timed driving low of /SS.  Very smart, by the way.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think the granularity is 2 CPU cycles.

That said, 100ns is only 0.005 of 20 microseconds.

Note that 256*8 bits is enough for a 200 microsecond delay.

 

According to documentation, /SS is an input in slave mode.

It needs a wire to something.

An internal pull-up would be counter-productive.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

According to documentation, /SS is an input in slave mode.

Yes, I meant to drive it low with another pin, as you suggest in the code you posted in #67.

 

I think the granularity is 2 CPU cycles.

I suppose so.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

On further thought, I suspect that playing with the SPI mode can get down to singe-cycle resolution.

For subcycle resolution, one might need to adjust Fcpu.
 

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When you declared you thought the granularity was 2 cycles, I first thought "No it's 1", but when I took a few moments to explain how, I couldn't.  So I'm curious what you've come up with.

 

My back-of-the-napkin analysis gets stuck on the fact that the MSPI clock itself has a minimum period of 2 cpu clock cycles.

 

While we can control the phase of the MSPI clock on XCK w.r.t. TXD, that will adjust the phase relationship to either +1 or -1 cycles, so again a granularity of 2.

 

The same is true of the SPI doing the listening.  Phase relationship between SCK and MISO MOSI can be +1 or -1 cycles.

 

In truth, though, I haven't sat down with pencil and paper and the timing diagrams and seriously tried to work it out.

 

I suppose if there were a way to insert an additional 1 cycle of delay between XCK and SCK, or between TXD and MISO MOSI, it could be done.  I'm struggling to think of a way to do that.  The AC has it's own synchroniser which inserts 0.5-1.5 cycles of latency.  The variability in that figure is there to account for asynchronous input, but in our case the input would be synchronous so we can probably count on a fixed latency of 1 cycle.  The trouble is that the m168 and friends do not have a way to tie ACO to an output.  I can't think of another trick to get that one extra cycle.

 

However this may all be moot:

 

 

I read that to mean that we can't reliably clock the SPI slave at F_OSC/2.

 

EDIT:  MISO to MOSI

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Tue. Aug 9, 2016 - 03:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd forgotten about the SPI speed limit.

The next MSPI speed available is Fcpu/4 .

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I feel as though that limit is given with the assumption that the master would be a separate device, likely operating from a different clock.  Even if a separate master and slave are running on clocks which are nominally the same speed, mutual jitter and drift between the two clocks would be inevitable, making the master and slave effectively asynchronous.  Thus, I figure, the stipulation of < 2 cpu clock cycles, rather than <= 2 cpu clock cycles, which I would expect of a synchronous system.

 

Now, since in this case the master and slave are two separate peripherals on the same device, driven by the same underlying system clock, it may in fact be that the real limit would be <= 2 cpu clock cycles.  If that is so, it would seem to be possible to achieve both a high sample rate of F_OSC/2, and a granularity of 2 cpu clock cycles.  Whether or not this proves to be true would, I imagine, depend upon the precise phase relationship between TXD and MISO MOSI.  That is, if TXD switches at the same instant as MISO MOSI is latched, the effect would be metastability, despite the synchroniser.  However, if they are out-of-phase by any significant fraction of a cpu clock cycles, it might work consistently and reliably.  Only a bench test will reveal which is true.  I have not performed a bench test.

 

Notwithstanding an answer to the above, I'm still curious how you proposed to get 1-cycle granularity.

 

 

For my part, @skeeve, you got me thinking.  I figured it might be possible to get 4-bit granularity without involving SPI, by using swap:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y,      r17     ; 2
        ld      r17,    Z       ; 2
        sts     UDR0,   r17     ; 2
        inc     YL              ; 1
        inc     ZL              ; 1
        mov     r16,    r18     ; 1
        rjmp    loop            ; 2 = 18

18 cycles.  Not quite.  Although on an AVR where UDRn is within reach of in/out, it would be perfect.  The t2313/4313 is such a beast, but it doesn't have enough page-aligned SRAM for our purposes, although with auto-increment, the cycles used by the inc instructions could be used by andi instructions instead, allowing the use of a smaller buffer.  The m8/16/32/64/128 has enough SRAM, but doesn't support MSPI.  The only one I can find which meets all the criteria is the t1634.  Enough page-aligned SRAM, support for MSPI, and UDR0 in I/O space.

 

For m168 and friends, unrolling by 2 is sufficient:

loop
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r18,    r17     ; 1
        andi    r18,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r16     ; 1
        st      Y,      r17     ; 2
        ld      r17,    Z       ; 2
        sts     UDR0,   r17     ; 2
        rjmp    .+0             ; 2
        lds     r17,    UDR0    ; 2
        swap    r17             ; 1
        mov     r16,    r17     ; 1
        andi    r16,    0xF0    ; 1
        andi    r17,    0x0F    ; 1
        or      r17,    r18     ; 1
        std     Y+1,    r17     ; 2
        ldd     r17,    Z+1     ; 2
        sts     UDR0,   r17     ; 2
        subi    YL,     -2      ; 1
        subi    ZL,     -2      ; 1
        rjmp    loop            ; 2 = 32

 

Now we have two methods.  One method can be used when the desired delay is an even number of nybbles, the other when we want an odd number of nybbles.

 

I've also added the ability to change the delay at run-time by means of button inputs.  There are 4 inputs:

  • longer by 1 nybble
  • shorter by 1 nybble
  • longer by 10 nybbles
  • shorter by 10 nybbles

 

For simplicity of coding, when a button is pressed the delay line is stopped, reconfigured for the new length, and restarted.  The buttons are handled by a pin change interrupt and ISR, although they could just as easily have been handled by polling the interrupt flag and jumping out of the loop.  There are just enough free cycles in the method 1 loop.  It would be a bit more complicated in the method 2 loop, as there aren't enough cycles at the right time in the loop.  Unrolling further would solve that problem.

 

I've done some testing, and I believe it to be reasonably free of bugs.  I have not tested with a scope.  Instead, I used the delay line as a serial loop-back (using an UNO).  Also, serial debugging was peppered into various places to verify correct operation.  That debug code has since been stripped.  Testing with an appropriate scope or LA will be necessary to fully confirm correct operation.

 

Although I tried to make it easily configurable, no great effort was expended to make it a work of art, nor particularly clever.  As such, I rely on the linker for the vector table and for placement of the ISR.  The remainder of the CRT is not needed, but no effort is made to prevent it from being linked.

 

A simple build shell script is included:

#!/bin/bash

SRC=delay_line
TARGET=atmega168
F_CPU=20000000

avr-gcc -Wall -g -save-temps -mmcu=${TARGET} -DF_CPU=${F_CPU} -Wl,-Map,${SRC}.map ${SRC}.S -o ${SRC}.elf
avr-objcopy -O ihex ${SRC}.elf ${SRC}.hex
avr-objdump -t ${SRC}.elf > ${SRC}.sym
avr-objdump -Sz ${SRC}.elf > ${SRC}.lss

 

The .sym file shows the symbols contained within the .elf.  This is so that you can confirm various timing symbols that are created in the .S.  For example, when built for 20 MHz:

01312d00 l       *ABS*	00000000 delay_line_f_cpu
00989680 l       *ABS*	00000000 delay_line_sample_rate
00000190 l       *ABS*	00000000 delay_line_granularity_ns
00000640 l       *ABS*	00000000 delay_line_minimum_ns
00004e20 l       *ABS*	00000000 delay_line_default_ns
00031e70 l       *ABS*	00000000 delay_line_maximum_ns

Note that the symbols are in hexadecimal.  The granularity of 190 is actually 0x190, or 400 decimal.  That's now better than the 500 ns stipulated by the OP.

 

Included in the .zip file is a .hex built for an m168 at 20 MHz, a default delay of 20 us, and with buttons on port B:

  • down by one: PB0
  •   up by one: PB1
  • down by ten: PB2
  •   up by ten: PB3

 

EDIT: changed MISO to MOSI.  Surprised nobody caught that, going all the way back to @skeeve's post #67.

EDIT: changed preprocessor arithmetic to properly handle non-integral clock speeds (e.g. 18.432 MHz)

EDIT: added generalised hardware config macros for USART

EDIT: fixed bug introduced when USART config macros were added :(

EDIT: fixed bug responsible for out-of-bounds buffer access, leading to spurious edges

Attachment(s): 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Mon. Aug 15, 2016 - 07:59 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A SPI slave can be set to read a bit on either the rising or falling edge of SCK.

At a bit rate of F_cpu/2, those edges are one cycle apart.

Iluvatar is the better part of Valar.

Last Edited: Wed. Aug 10, 2016 - 03:20 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
A SPI slave can be set to read a bit on either the rising or falling edge of SCK. A a bit rate of F_cpu/2, those edges are one cycle apart.

 

The key word there is OR.

You can choose either clock polarity, but not both edges - Both edges is a DDR link, which modest MCUs still lack.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

A SPI slave can be set to read a bit on either the rising or falling edge of SCK.

A a bit rate of F_cpu/2, those edges are one cycle apart.

Ah yes, that was a failure of visualisation on my part.

 

Who-me wrote:

You can choose either clock pol