Are Load/Store Effective Times Specified?

Go To Last Post
26 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm working on a stripped-down AVR emulator in order to simulate in real time an ATmega1280/ATmega2560-based game I'm working on (my current emulator, simavr, is fairly accurate but slower than I'd like). While implementing timers, however, I've noticed a pretty consistent 1 cycle discrepancy between my emulator and the hardware. The problem, I think, comes down to the "effective" times of sts and lds.

 

These are two-cycle instructions. In my emulator, I do all the work on the first cycle and nothing on the second. But I think the hardware actually uses both cycles for its magic, effectively doing everything on the second cycle. My tests seem to confirm this. No difference usually, unless you're accessing some timer registers (TCCRnB or TCNTn, probably others).

 

clr r1
ldi r16, (1 << CS10)
ldi r17, (1 << TSM) | (1 << PSRSYNC)
sts TCCR1A, r1  ; normal mode
sts TCCR1B, r16 ; 1/1 prescaling        ; TCNT1 = -/0
                                        ; TCNT1 = 0/1
nop                                     ; TCNT1 = 1/2
nop                                     ; TCNT1 = 2/3
out GTCCR, r17                          ; TCNT1 = 3/4
lds r20, TCNT1L

write_byte framebuffer, r20 ; get 3

 

Seems reasonable, but is this behavior specified somewhere? I feel like it probably is, since it can make a noticeable difference, but I haven't found anything in the datasheet or various app notes. 

This topic has a solution.
Last Edited: Wed. May 25, 2022 - 03:16 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Welcome!

thcopeland wrote:
... but I haven't found anything in the datasheet or various app notes. 
The instruction set manual specifies the instructions though isn't a design document (the two simultaneous finite state machines [instruction, data])

Description | STS | AVR® Instruction Set Manual

 

edit :

There's HDL for emulated AVR; HDL isn't a design though will show various takes on the specification.

 

 

"Dare to be naïve." - Buckminster Fuller

Last Edited: Thu. May 19, 2022 - 05:28 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Thanks!

 

Yeah, I've read most of the instruction set reference, a real asset while implementing the instructions, but it doesn't seem to address this problem.

 

I hadn't though of HDL emulation... any particular one you're thinking of? Searching brings up a few results (OpenAVR and some dude's EE project), but they seem to lack interrupt and timer support.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Morten from Microchip /Atmel has said in the past what they use to build the simulation DLLs used in Studio 7. They basically feed it the VHDL for the chip design and it does the rest. 

 

HOWEVER it used to be the case that there was only accurate Microchip/Atmel simulation for Windows and Linux based developers were a bit stuck and had to rely on 3rd party simulation like simavr. But now they've ported the Studio 7 magic into the cross platform MPLABX. So if you want accurate simulation on Linux why not simply use that? 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

thcopeland wrote:
any particular one you're thinking of?
Gadget Factory

GitHub - GadgetFactory/Arduino-Soft-Core

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I hadn't realized Studio was available for Linux, good to know. But in this case I want a fast, sufficiently complete simulator that I can use to test my game, and eventually package with it so it can be played without creating a physical device. Studio would be accurate, complete (and maybe fast?) but doesn't really fit the bill.

 

And yeah, "sufficiently complete"... a one-cycle discrepancy isn't a big deal, but if I can fix it, might as well. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Verilog ... several posts on this forum about Verilator

IIRC, both Windows DLL and Linux shared object are with Microchip Studio 7.

 

edit :

Atmel Packs

 

"Dare to be naïve." - Buckminster Fuller

Last Edited: Thu. May 19, 2022 - 07:04 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Thanks! will look through..

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

re simavr, an alternate is QEMU.

QEMU AVR | AVR Freaks

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

whoops!

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. May 20, 2022 - 04:55 AM
This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

thcopeland wrote:
These are two-cycle instructions. In my emulator, I do all the work on the first cycle and nothing on the second. But I think the hardware actually uses both cycles for its magic,

You'll have to pardon me for "stating the bleeding obvious".

 

If the hardware takes two cycles, then it surely needs those two cycles for it's magic [sic]. I seriously doubt the designers of the AVR core decided to waste a clock cycle just for the hell of it.

 

What is up for discussion is whether the SRAM write is synchronous or asynchronous. I'd vote for asynchronous or internally timed because the STS target address gets clocked in on the 2nd of the two CPU clock cycles leaving no clock edge available for the instruction to complete. I don't think it can use the fetch cycle of the next instruction to perform the actual write (or read i the case of LDS)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Since the STS and LDS are 32 bit instructions (2x16) there is no way the data can be delivered on the first clk! (register for LDS and memory for STS).

But since your emulator also should make 2 memory reads how can you even get data on the first? (before you know full addr.)  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
... because the STS target address gets clocked in on the 2nd of the two CPU clock cycles leaving no clock edge available for the instruction to complete.
Two edges per clock period, two simultaneous engines (instruction, data)

 

P.S.

Demonstrates CMOS's peak current during possibly any edge (ST instead of STS, AVRxm is one clock less than AVRe)

XMega SRAM slow turnaround? - Solved (Glitchy Power Supply). | AVR Freaks

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
If the hardware takes two cycles, then it surely needs those two cycles for it's magic [sic]. I seriously doubt the designers of the AVR core decided to waste a clock cycle just for the hell of it.

 

Yep, definitely. But I figured the extra cycle might be a nop for the register file, or necessary for the ram to stabilize or something. Looks like neither's the case, though.

 

N.Winterbottom wrote:
the STS target address gets clocked in on the 2nd of the two CPU clock cycles leaving no clock edge available for the instruction to complete

 

sparrow2 wrote:
But since your emulator also should make 2 memory reads how can you even get data on the first? (before you know full addr.)  

 

Because my emulator isn't that precise: I just read PC+1 during the first cycle. The extra cycles are just padding to get the correct cycle count.

 

But you've answered my question! the second cycle is necessary to get the address, so the memory access must be performed during the second cycle. I believe Arduino-Soft-Core does this too, though I could be wrong--my VHDL isn't up to much. Same for ld/st.

 

Thanks!

 

gchapman wrote:

Two edges per clock period, two simultaneous engines (instruction, data)

 

Huh... that sounds relevant, do you know a place I could read more about these inner workings?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


If I recall correctly, older AVR spec sheets had timing diagrams for a number of basic operation types. Doubt that it has changed up to XMega. XMega and its descendants are potentially different because there are now a few operations that take a different number of clock cycles than in the classic AVRs. For example, the following is from Mega8 in a section called "Instruction Timing". There is also a timing diagram for "Single Cycle ALU Operation". Did not see anything about multi-cycle operations - that might be in a different spec sheet.

 

Jim

 

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

thcopeland wrote:
do you know a place I could read more about these inner workings?
IDK; AVR computer architecture was well before I joined this forum.

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Since the STS and LDS are 32 bit instructions (2x16) there is no way the data can be delivered on the first clk!

Yeah, that.  Since the address for LDS/STS is in the 2nd word of the instruction, it "obviously" takes an extra cycle to read that address, delaying the actual read/write till the 2nd cycle.

 

An interesting question is why the 16bit load/store instructions (LD, ST, LDD, STD) also take 2 cycles, and whether they also do the transfer on the 2nd cycle.

(I'd "guess" that the first cycle does math on the address register, where non-offset, non-incrementing versions just do "add zero" in that cycle.)

 

do you know a place I could read more about these inner workings?

Of the AVR specifically?  I don't know.  "proprietary", and all that.  Although there are apparently FPGA implementations of the architecture that would probably be illuminating if they were actually readable and/or well documented.

 

In general, there are lots of books about computer architecture.  If you get the older (pre-RISC) descriptions, especially in those architectures that take "many" cycles per instruction, or cover bitslice-based designs, you'll gain a lot of insight about how things work "in general."  For that matter, each new architecture that you learn "in some detail" will increase your understanding.  (Most recently, a lot of TI data books describe more about chip internals than most people probably want to know.  The extensive description of the "constant generator" in MSP430 is particularly "interesting.")

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've coded a simulator also.  It took me some time to update the cycles times (and I'm not done) which was done based on the AVR instruction set manual.DS40002198B.  Times vary with chip architecture.   For LDS, AVRe and AVRrc are two cycles, and for AVRxm 3 cycles, unless accessing I/O registers, in which case it's 2 cycles.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

(I'd "guess" that the first cycle does math on the address register, where non-offset, non-incrementing versions just do "add zero" in that cycle.)

Indeed - Identical op codes are generated.

int main (void)
{
  asm("LD  R15, Y");
  80:	f8 80       	ld	r15, Y
  asm("LDD R15, Y + 0");
  82:	f8 80       	ld	r15, Y

 

 

westfw wrote:
The extensive description of the "constant generator" in MSP430 is particularly "interesting.")

Had to read that section at least 3 times over; to understand it. Also to explain it to someone now, I'd have to read it again.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For the constant generator of the MSP430:

This is close to the same as we do with R1 always 0 , then if R3 always 1 etc. (just that they are in HW and can't be changed, and then we give them an other name) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Thanks all! I've implemented the proper lds/sts timings, this reduced the emulator vs. physical discrepancy in some cases, but unfortunately did not totally fix it. Still a solid improvement, but looks like cycle-perfect timers are a tough row to hoe. I'll settle for what I've got, should be sufficient for my use case, and maybe do some reading and testing before returning to the problem.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What do you mean with 

physical discrepancy

how do you measure it ? (is it when IO is involved, or IRQ or ... ) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I suppose that if you're talking about a timer running at the CPU clock frequency, you might want to get into details like which edge of the clock the timer is actually loaded at, and which edge causes the increment/decrement...

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
is it when IO is involved, or IRQ or

 

westfw wrote:
which edge of the clock the timer is actually loaded at, and which edge causes the increment/decrement

 

Not exactly, it's an off-by-two error, so I'm probably missing something more fundamental. I'll create a new thread though, so posterity can more easily benefit from freaks' wisdom...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In the case of store instructions,

testing is not hard.

1:
OUT
store
RJMP 1b

1:
OUT
NOP
OUT
RJMP 1b

1:
OUT
OUT
NOP
RJMP 1b

Initializations and other significant details
have been left as an exercise for reader

The second loop will produce pulses one cycle longer than the third.

The first loop should be equivalent to the second or to the third.

 

I'm less sure abut testing loads, but here goes.

As I think is well-known, an OUT immediately followed by an IN does not work.

The AVR's synchronization mechanism defeats it.

An OUT immediately followed by a load should work or not

depending on when the load actually occurs.

 

Moderation in all things. -- ancient proverb