## A modulos operation, takes 40us on Atmega328p?

19 posts / 0 new
Author
Message

Hi there freaks,

hours are ticking by like seconds and Google presents only garbage...

I now tried a lot of combinations on a simple if-clause and came to the point that this is a Problem which i think isnt normal. My oszi shows 40us for this if-clause, which i think is way too much. With the assemblercode i counted 31 clocks on another combination (which i calculated to around 2us at 16Mhz).

I have this code:

```#define TIMING_TEST_PIN 6

int volatile pos = 16;

void setup() {
pinMode(TIMING_TEST_PIN, OUTPUT);
pinMode(5, OUTPUT);

}

void test2(){

PORTD |= (1<<TIMING_TEST_PIN);
if(micros() % (1000L*1000) < 20000){
PORTB &= ~(1<<5); // LED_BUILTIN(13) -> PB5 bit5
//delayMicroseconds(10);
//pos = pos+1;
}
PORTD &= ~(1<<TIMING_TEST_PIN);

}

void loop(){
test2();
delayMicroseconds(100);

}```

Does anyone has a clue?

UPDATE:

I could shrink the code and still having the problem:

```#define TIMING_TEST_PIN 6

unsigned volatile long var = 0;
unsigned volatile long var2 = 0;

void setup() {
pinMode(TIMING_TEST_PIN, OUTPUT);
pinMode(5, OUTPUT);

}

void test2(){
PORTD |= (1<<TIMING_TEST_PIN);
var2 = var % 1000000L;
PORTD &= ~(1<<TIMING_TEST_PIN);
}

void loop(){
test2();
delayMicroseconds(100);

}```

Is that bloody modulos operator taking all that time?

I use that quite often to time some functions. I dont know how i could then achive this without using a timer (timers are already used).

Yes i use Google, but there is no search enginge within Google to search through those many useless results...

Last Edited: Wed. Mar 22, 2017 - 09:31 PM

You need to tell us more.

In particular, this is an Arduino sketch, right?  So it could be interrupting and doing other stuff.  [just for fun, turn off interrupts during your test sequence--but I don't think that's it]

But anyway, I'd like to see the 31 clocks code.  You are invoking the micros() function.  I'm guessing that this might be e.g. a 32-bit value with timer overflows counted, augmented by a microsecond tick count.  That's going to take a few microseconds.

Then you force a modulus operation with a very large divider.  That could indeed take 10s of microseconds.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

The compare will first have to AND the result from micros() (which must be 32bit) with 1000L * 1000. Just getting 4 bytes out via the stack will take a few cycles. Then, it will have to do a multi-byite compare. So, even that is not going to be speedy.

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

40us is 640 cycles on 16M. Sounds quite possible for a divide

You're forcing a 32bit division for the modulus operation; 40us doesn't seem too unlikely.  I can't find benchmark info for modulus specifically, but a floating point division (which is similar, but only 24bits) claims to take 400+ cycles.  (A divide will take several operations for each bit of operand, each operation has to handle 32bit (4 bytes...))

Oooh, did not see that the operation is "%" rather than "&". Big difference.

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

I thought the title was "simple if..."

We still haven't seen the generated code.

One can plug in a few numbres in a simulator program to see how long it takes.

Helpful if one knows the AVR model and clock speed and toolchain and version and optimization level, without trying to infer it all.

Racing to count microseconds with looping code wouldn't be my first approach.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Tobey wrote:

...40us...

I measure 47.6usecs at 16MHz clock using CVAVR.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

Tobey wrote:

I dont know how i could then achive this without using a timer (timers are already used).

It's often possible to use one timer to do multiple things.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

Brian Fairchild wrote:
I measure 47.6usecs at 16MHz clock using CVAVR.

IIRC from previous threads where divide speed was measured, the time will depend on the values of the operands.  [Also] IIRC, CodeVision uses the "repeated subtract of powers of 10" and would be consistent with that.

Back when I was your age, I had an AT90S4433 app running at about 4MHz.  4x 7-seg display.  Taking an ADC reading and converting to selected pressure units and display with decimal point in the right spot took just shy of a millisecond.  And surprisingly was nearly the same using three different approaches.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

theusch wrote:
Helpful if one knows the AVR model and clock speed and toolchain and version and optimization level, without trying to infer it all.

The fact that it is Arduino code possibly narrows the guessing game a little but tere have been a few issues of that with varying avr-gcc and with varying build options and of course we don't know if the Arduino is a mega2560 or a mega328 or what.

Anyway, here is my guess..

```C:\SysGCC\avr\bin>type avr.c
#include <avr/io.h>

uint32_t a, b, c;

int main(void){

a = b % c;
return 0;
}
C:\SysGCC\avr\bin>avr-gcc -mmcu=atmega328p -Os -g avr.c -o avr.elf

C:\SysGCC\avr\bin>avr-objdump -S avr.elf

avr.elf:     file format elf32-avr

Disassembly of section .text:

00000000 <__vectors>:
0:   0c 94 34 00     jmp     0x68    ; 0x68 <__ctors_end>
4:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
8:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
10:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
14:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
18:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
1c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
20:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
24:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
28:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
2c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
30:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
34:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
38:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
3c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
40:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
44:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
48:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
4c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
50:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
54:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
58:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
5c:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
60:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>
64:   0c 94 46 00     jmp     0x8c    ; 0x8c <__bad_interrupt>

00000068 <__ctors_end>:
68:   11 24           eor     r1, r1
6a:   1f be           out     0x3f, r1        ; 63
6c:   cf ef           ldi     r28, 0xFF       ; 255
6e:   d8 e0           ldi     r29, 0x08       ; 8
70:   de bf           out     0x3e, r29       ; 62
72:   cd bf           out     0x3d, r28       ; 61

00000074 <__do_clear_bss>:
74:   21 e0           ldi     r18, 0x01       ; 1
76:   a0 e0           ldi     r26, 0x00       ; 0
78:   b1 e0           ldi     r27, 0x01       ; 1
7a:   01 c0           rjmp    .+2             ; 0x7e <.do_clear_bss_start>

0000007c <.do_clear_bss_loop>:
7c:   1d 92           st      X+, r1

0000007e <.do_clear_bss_start>:
7e:   ac 30           cpi     r26, 0x0C       ; 12
80:   b2 07           cpc     r27, r18
82:   e1 f7           brne    .-8             ; 0x7c <.do_clear_bss_loop>
84:   0e 94 48 00     call    0x90    ; 0x90 <main>
88:   0c 94 87 00     jmp     0x10e   ; 0x10e <_exit>

8c:   0c 94 00 00     jmp     0       ; 0x0 <__vectors>

00000090 <main>:

uint32_t a, b, c;

int main(void){

a = b % c;
90:   60 91 00 01     lds     r22, 0x0100
94:   70 91 01 01     lds     r23, 0x0101
98:   80 91 02 01     lds     r24, 0x0102
9c:   90 91 03 01     lds     r25, 0x0103
a0:   20 91 04 01     lds     r18, 0x0104
a4:   30 91 05 01     lds     r19, 0x0105
a8:   40 91 06 01     lds     r20, 0x0106
ac:   50 91 07 01     lds     r21, 0x0107
b0:   0e 94 65 00     call    0xca    ; 0xca <__udivmodsi4>
b4:   60 93 08 01     sts     0x0108, r22
b8:   70 93 09 01     sts     0x0109, r23
bc:   80 93 0a 01     sts     0x010A, r24
c0:   90 93 0b 01     sts     0x010B, r25
return 0;
c4:   80 e0           ldi     r24, 0x00       ; 0
c6:   90 e0           ldi     r25, 0x00       ; 0
c8:   08 95           ret

000000ca <__udivmodsi4>:
ca:   a1 e2           ldi     r26, 0x21       ; 33
cc:   1a 2e           mov     r1, r26
ce:   aa 1b           sub     r26, r26
d0:   bb 1b           sub     r27, r27
d2:   fd 01           movw    r30, r26
d4:   0d c0           rjmp    .+26            ; 0xf0 <__udivmodsi4_ep>

000000d6 <__udivmodsi4_loop>:
d6:   aa 1f           adc     r26, r26
d8:   bb 1f           adc     r27, r27
da:   ee 1f           adc     r30, r30
dc:   ff 1f           adc     r31, r31
de:   a2 17           cp      r26, r18
e0:   b3 07           cpc     r27, r19
e2:   e4 07           cpc     r30, r20
e4:   f5 07           cpc     r31, r21
e6:   20 f0           brcs    .+8             ; 0xf0 <__udivmodsi4_ep>
e8:   a2 1b           sub     r26, r18
ea:   b3 0b           sbc     r27, r19
ec:   e4 0b           sbc     r30, r20
ee:   f5 0b           sbc     r31, r21

000000f0 <__udivmodsi4_ep>:
f0:   66 1f           adc     r22, r22
f2:   77 1f           adc     r23, r23
f4:   88 1f           adc     r24, r24
f6:   99 1f           adc     r25, r25
f8:   1a 94           dec     r1
fa:   69 f7           brne    .-38            ; 0xd6 <__udivmodsi4_loop>
fc:   60 95           com     r22
fe:   70 95           com     r23
100:   80 95           com     r24
102:   90 95           com     r25
104:   9b 01           movw    r18, r22
106:   ac 01           movw    r20, r24
108:   bd 01           movw    r22, r26
10a:   cf 01           movw    r24, r30
10c:   08 95           ret

0000010e <_exit>:
10e:   f8 94           cli

00000110 <__stop_program>:
110:   ff cf           rjmp    .-2             ; 0x110 <__stop_program>
```

I'm not in a position to simulate that so cannot say how many cycles are there. But clearly there's a pretty big loop there being executed 32 times!

theusch wrote:

...from previous threads where divide speed was measured, the time will depend on the values of the operands.  [Also] IIRC, CodeVision uses the "repeated subtract of powers of 10" and would be consistent with that.

Over a cup of coffee I tried with divisors of between 10 and the OP's 1,000,000 (in steps of x10) and the variance was 1.5us.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

Brian Fairchild wrote:

Tobey wrote:

I dont know how i could then achive this without using a timer (timers are already used).

It's often possible to use one timer to do multiple things.

I searched for the Interrupt and found it in wiring.c:

```#if defined(__AVR_ATtiny24__) || defined(__AVR_ATtiny44__) || defined(__AVR_ATtiny84__)
ISR(TIM0_OVF_vect)
#else
ISR(TIMER0_OVF_vect)
#endif
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;

m += MILLIS_INC;
f += FRACT_INC;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
m += 1;
}

timer0_fract = f;
timer0_millis = m;
timer0_overflow_count++;
}```

How can i define a function in my sourcefile  that is called from this interrupt?

Yes i use Google, but there is no search enginge within Google to search through those many useless results...

Last Edited: Thu. Mar 23, 2017 - 06:54 PM

You can't. So you need to disable that copy and implement your own. If you want to retain what that one was doing then you need to do the work form there too.

Tobey wrote:
Tobey wrote: I dont know how i could then achive this without using a timer (timers are already used).

You haven't really said what you are tryng to time, what the minimum and maximum duration is, and what is the needed accuracy.

Your simple if() example seems to imply 20ms.  So a simple "soft timer" using millis() would be my first guess.  But tell us all the needed parameters and specifications.  Tell us what else is going on in the app such that all the timers are used.  Tell >>how<< they are used.  For example, if one timer is free-running doing PWM that can also be used to clock things to one timer tick.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

clawson wrote:

You can't. So you need to disable that copy and implement your own. If you want to retain what that one was doing then you need to do the work form there too.

Unless you use the ancient forbidden arts and hook that interrupt vector into you own function, like an old DOS virus

Nah, that's just crazy talk, a flashback to my old days as a teenage hacker. Carry on

I wrote in the title that it is a atmega 328p at 16mhz.

The reasen i was using the modulus is, that i wanted to make things happen at an exact interval or at a certain timing.

Timer0 is used for millis(), delay().

Timer1 is used for a servo.

Timer2 for tone.

Now... this is what i came up with...

```... from wiring.c
ISR(TIMER0_OVF_vect)
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;

m += MILLIS_INC;
f += FRACT_INC;
tobysCounter++;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
m += 1;
tobysCounter++;
}

timer0_fract = f;
timer0_millis = m;
timer0_overflow_count++;

if(tobysCounter >= tobysCounterTop){
tobysCounter = 0;
}
if(tobysCounter == tobysCounterCompA){
tobysCounterCompFlagA = true;
}

}

boolean tobysCounterCompFlagAResult(){
if(tobysCounterCompFlagA){
tobysCounterCompFlagA = false;
return true;
}
else
return false;
}

void setTobysCounterCompA(unsigned int value){
tobysCounterCompA = value;
}```

Usage:

```
if(tobysCounterCompFlagAResult()){//tobysCounterCompare(8000)){
PORTD |= (1<<TIMING_TEST_PIN);
delay(1);
}
else{
PORTD &= ~(1<<TIMING_TEST_PIN);
}```

I just realized that this is much better, since there is a flag until the function was called, so the timing does not have to be exactly and there is no "double execution" when having a to big of a value that is compared with the modulus operator.

Yes i use Google, but there is no search enginge within Google to search through those many useless results...

IIRC, CodeVision uses the "repeated subtract of powers of 10"

Is that for ALL divisions by powers of 10, or just for number to string conversion?

The reasen i was using the modulus is, that i wanted to make things happen at an exact interval or at a certain timing.

`    if(micros() % (1000L*1000) < 20000){`

Doesn't look THAT exact.   Within 20 milliseconds of every second, right?  An AND with 1024*1024-1 would be within 50ms...  (and on an Arduino, isn't millis() accurate to within <2ms, so (millis() & 1023 < 8 should be within ~10ms...)