SREG C Bit on MUL (unsigned) instruction

33 posts / 0 new
Author
Message

According to instruction set, the AVR MUL (unsigned) instruction copies result bit 15 to C bit.

I can't really see the reason for that, except (requiring further investigation and I am not really sure about it) when using number format 1:15 = 1:7 * 1:7, even so, it seems very strange, since 0xB6 x 0xB6 turns on bit 15, and not all 1:7 x 1:7 does that, 0x85 & 0x85 does not.

Can you see any reason and advantage to have result bit 15 copied to C bit?

What one would use this carry in the subsequent programming?

I think N bit would be more fashionable to represent bit 15.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

I've used the C flag in a multiply-and-accumulate function. It's a warning that the product just calculated might overflow the accumulator. That gives an opportunity to scale data to prevent overflow.

I think the N flag should never be set by MUL since it's unsigned. You could make a case for MULS & MULSU setting the N flag, but I like the consistency of just Z & C across all the xMULx instructions.

Total votes: 0

balisong42 wrote:

I've used the C flag in a multiply-and-accumulate function. It's a warning that the product just calculated might overflow the accumulator. That gives an opportunity to scale data to prevent overflow.

I think the N flag should never be set by MUL since it's unsigned. You could make a case for MULS & MULSU setting the N flag, but I like the consistency of just Z & C across all the xMULx instructions.

That is an interesting your use for this Cbit copied from bit 15, even missing the exact and real math function of the carry bit.

I wrote about N bit, because any other instruction, ADD, SUB, SHIFTs (except LSR) , OR, AND, etc, Nbit is a pure copy of bit7, no matter if thinking about signed value of not.

When accumulating the result of multiplication even if the high byte (R1) is 0x01, it could create an overflow at the accumulator, it would be a dangerous piece of code to rely on bit 15 to be aware of overflow, any multiply-and-accumulate should always test carry bit after adding.

Interesting to observe that SER (set on all reg bits) change no flags, while CLR (set off all reg bits) sets SVN=0, Z=1, there should be a strong reason for such decision, breaking reciprocity rules.

I was unable to find ANY information anywhere in the instruction set documentation about the differences in operation logic for MUL, MULS and MULSU, except for the range of regs used, I mean, what they really does.

Examples:

```LDI R25, 0x82  ; 130
LDI R26, 0x83  ; 131
MUL R25, R26```

Results in 0x4286 (17030), R1=0x42, R0=0x86

But if R25+R26 should be considered Signed, -126 * -125 must result in a positive 15750 (0x3D86), to do that, it is needed to NEGate the Signed negative bytes, so

```LDI R25, 0x82 ; (-126)
LDI R26, 0x83 ; (-125)
MULS R25, R26

when both are negative, should do the same as

LDI R25, 0x82 ; (-126)
LDI R26, 0x83 ; (-125)
NEG R25       ; R25 = 0x7E
NEG R26       ; R26 = 0x7D
MUL R25, R26```

Another single example, a multiplication of two signed numbers, one negative, another positive.

```LDI R27, 0xFD      ; -3
LDI R28, 3         ; +3
MUL R27, R28       ; regular unsigned mul
; Results in 0x02F7 (759)
; while
MULS R27, R28      ; Should result in -9 (0xFFF7)```

So, how MULS do it?

As far as I know about signed ALU processing in other chips, it checks bit 7 of both R27 and R28, then NEGates whichever has bit7 on.

The MULS decoding, based on reg's bit7, NEGates R27 value while entering ALU (InputDataBusNegate), from 0xFD to 0x03, this also latches OutputDataBusNegate at the ALU ouput, since result will be negative.  Then the MUL takes place, resulting 3 x 3 = 9 (0x0009), and because the OutputDataBusNegate the logic automatically NEGates the ALU output, changing 0x00 to 0xFF and 0x09 to 0xFD resulting in 0xFFFD, what is a signed -9.

So, I guess, MULS always NEGate any register with bit7=1, but only NEGates to the result if an XOR of both bit7 is 1, I mean, if both bits 7 where the same is doesn't NEGate the result.

In case of both positive (bit 7 down), no NEGates takes place, neither at ALU input or output, while a regular MUL instruction takes place.

Then, a simple (human) way for the MULS to decide is, adding both bits7:

0 = Both values are positive.  Just do a regular MUL and nothing else, result is positive.

1 = NEGate the only register with bit7 on, do a regular MUL, NEGate the result since the result is negative.

2 = Both values are negative. Negate both registers, do a regular MUL, nothing else, result is positive.

I guess for the "MULSU Rd,Rs", the decision is based only on the signed register, Rd.

If Rd7 is on, result will be negative, latches InputDataBusNegate and OutputDataBusNegate, NEGate Rd, regular MUL, NEGate the result, otherwise, regular MUL.

The point of my complains for years, why this is not explained in the documentation?  Perhaps "we don't need to know". Well, I use to work with assembly language, bare chip code, not because I want to suffer, but to super-optimize and make the best possible, I am entitled to have such information, mostly when datasheet for MUL/MULS/MULSU doesn't show any difference among them.

By the same way, I still wonder why MUL/MULS/MULSU copy bit 15 to Cbit... and don't use SREG Nbit based on OutputDataBusNegate, we don't need to know, just need to guess.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

wagnerlip wrote:

The point of my complains for years, why this is not explained in the documentation?  Perhaps "we don't need to know". Well, I use to work with assembly language, bare chip code, not because I want to suffer, but to super-optimize and make the best possible, I am entitled to have such information, mostly when datasheet for MUL/MULS/MULSU doesn't show any difference among them.

Wow.  I think the problem is too many RET in your assembly code, approaching one for each [R]CALL.

If I were faced with your conundrum, I guess I would first use the Studio simulator and feed it various interesting multiplicand/multiplier values and see if the results shed any light.  Then I would have a test program that with a bit more instrumentation could be run on the AVR to see the results.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Total votes: 0

wagnerlip wrote:
... The point of my complaints for years, why this is not explained in the documentation?  Perhaps "we don't need to know". Well, I use to work with assembly language ... to super-optimize and make the best possible, I am entitled to have such information, mostly when datasheet for MUL/MULS/MULSU doesn't show any difference among them....

That is old fashioned thinking. These days you are entitled to however much information maximizes shareholder value. Some day, I hope people will understand that was nonsense.

- John

Total votes: 0

Of course, Theusch, that is what I will do, as soon as I reach the other station with Studio, don't have it on this PC...

I confused the Call/Ret relationship of the assembling stat posted, not a good day, I guess. I use a little invasive application that counts how many times each subroutine is called during a specific period of time or task.  That is used to evaluate the waste of clock cycles in call/ret in order to decide to keep them as called subroutines or straight duplicated code. I was just reviewing some results when I wrote that comment, at that particular  moment it was really alarm-calling seeing that huge discrepancy difference, shame on me.  Too much aluminum in my life... ;)

The statistics routine is pretty simple, instead of using RET at the end of each subroutine, there is this JMP to this small piece of code, it just reads the ret address of the caller at stack and increment a counter in SRAM for that particular address, and then do the RET.  On the next reset (with the engine warm) it dumps that block of data for analysis.  In someway it helps to decide by speed against code size.  The JMP or RET at the end of each subroutine is conditional assembly, a simple 1 or 0 in a variable assembles the code in one way or another.   Another nice small feature of this stats is to see how far down the SRAM the stack went, at reset it checks down from RAMTOP when zeros sequence starts, then clean up and restart everything.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

wagnerlip wrote:
there is this JMP to this small piece of code
C compilers do that - they call it "tail call optimization" ;-)

Total votes: 0

from #3

Interesting to observe that SER (set on all reg bits) change no flags, while CLR (set off all reg bits) sets SVN=0, Z=1, there should be a strong reason for such decision, breaking reciprocity rules.

There is no SER instruction!

just see the real instruction for SER and you will know why nothing change (It's a LDI).

And there is no clr instruction it is EOR a register with it self.

If you also want to update C then use SUB register with it self.

Add:

And in some cases where you need to sign extend you can SBC register with it self that give 0XFF if C==1 and 0x00 if C==0

Last Edited: Fri. Mar 29, 2019 - 11:23 AM
Total votes: 1

wagnerlip wrote:

The point of my complains for years, why this is not explained in the documentation?  Perhaps "we don't need to know". Well, I use to work with assembly language, bare chip code, not because I want to suffer, but to super-optimize and make the best possible, I am entitled to have such information, mostly when datasheet for MUL/MULS/MULSU doesn't show any difference among them.

It is worth remembering that the AVR was not designed for assembly programmers but for people who use C. The instruction set was designed in conjunction with IAR and was optimised so that compilers could be efficient.

Attachment(s):

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

Total votes: 0

Brian Fairchild wrote:
... It is worth remembering that the AVR was not designed for assembly programmers but for people who use C. The instruction set was designed in conjunction with IAR and was optimised so that compilers could be efficient....

I would add that it is arguably even more important that the authors of a compiler know the varied implementations of an instruction set, as their work must automatically accommodate a good many targets, and likely a great many more. Satisfying the compiler writers was more critical and important.

- John

Total votes: 0

If you want to learn more about the different MUL instruction then read the old Atmel AVR201 appnote, hos microchip it's  AN_1631 : Using the 8-bit AVR Hardware Multiplier

Total votes: 0

sparrow2 wrote:

If you want to learn more about the different MUL instruction then read the old Atmel AVR201 appnote, hos microchip it's  AN_1631 : Using the 8-bit AVR Hardware Multiplier

Nothing new to learn about the six modes of MUL, except for the odd-ball Cbit on SREG copying bit 15.

No other AVR instruction raises Cbit without a real overflow, meaning the operation result could not fit into the register bits.

I already registered a ticket at MChip about this, they accepted, status is "on-going".  Curious about the answer.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

It's impossible for an 8-bit multiply to overflow.

255 x 255 is 65,025 which fits in 16 bits.  I'd

guess that the engineers know this and were

only trying to save you some effort if you ever

needed to know the top bit of the result.

--Mike

Total votes: 0

sparrow2 wrote:

from #3

Interesting to observe that SER (set on all reg bits) change no flags, while CLR (set off all reg bits) sets SVN=0, Z=1, there should be a strong reason for such decision, breaking reciprocity rules.

There is no SER instruction!

just see the real instruction for SER and you will know why nothing change (It's a LDI).

And there is no clr instruction it is EOR a register with it self.

If you also want to update C then use SUB register with it self.

Add:

And in some cases where you need to sign extend you can SBC register with it self that give 0XFF if C==1 and 0x00 if C==0

You are quite correct, I forgot to check if SER and CLR wasn't aliases, they are.

Also aliases are TST is alias of AND, and SBR is alias of ORI.

There is something strange on RCALL and RJMP, this 16 bits instructions can hold only 12 bits address displacement. It means 4096 bytes forward, or 4094 bytes backward.  The instruction set explain it as PC-2k+1 or P+2k -words-.   This little difference of 2 bytes (1 word) is because after reading the RCALL/RJMP instruction, the Program Counter will be already pointing to the next word address of the RCALL/RJMP, so, moving forward is just adding the full value of 12 bits, backwards already lost 1 word in the count.   But the catch here, not explained exactly how it happens, it says on item 91 of the datasheet:

RCALL... For AVR microcontroller with program memory not exceeding 4k words (8kBytes), this instruction can address the entire memory from every address location.

How can it reach any 4kWords (8kBytes) address with only 12 bits of address bytes displacement? it can swing within only 4kBytes, not 8k.  Something is not very well documented.

The only way I see it is possible, it is not the chip, but the compiler, that decides to jump backward even if the target address is forward, or vice-versa, in case the distance is farther than 2k+1Words, and it is only possible if the whole program memory is seeing as a rounding up, tail to head.  For example, if needing to jump back from 0xF05 to 0x007, is easier to jump forward 0x102, the addressing bits just ignore the extra, keeps it within 0xFFF range, rounds back to the start of the program memory, and it is shorter than 2kWords+1

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

"it is not the chip, but the compiler, that decides to jump backward " - Yes, the compilers do that. But they wouldn't do that if the chip didn't support that? So I'm not sure why you say "it is not the chip". Or are you just complaining about this not being (well) documented?

Total votes: 0

The 12-bit of address displacement are WORD address space not byte. That's because all AVR opcodes are 2-bytes, so PC is always pointing to a word address.

Total votes: 0

ezharkov wrote:

"it is not the chip, but the compiler, that decides to jump backward " - Yes, the compilers do that. But they wouldn't do that if the chip didn't support that? So I'm not sure why you say "it is not the chip". Or are you just complaining about this not being (well) documented?

When I say "it is not the chip", referring to the text on the datasheet that say about "For AVR microcontroller with program memory not exceeding 4k words (8kBytes)", meaning it is not the chip that has some magic and allows to jump within 8kBytes with only 12 bits of address displacement, it is the assembler or compiler that recognize the chip selected has only 13 bits of addressing, so overlap could be done.  Of course, the chip will do its natural address overlap within 13 bits of addressing, but it is the compiler/assembler that prepare the other way jump if the displacement is larger than 2k words.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

balisong42 wrote:

The 12-bit of address displacement are WORD address space not byte. That's because all AVR opcodes are 2-bytes, so PC is always pointing to a word address.

The 12 bit address displacement is bytewise, it means 4096 addresses, this is why the displacement of a RJMP or RCALL is only +2048/-2047 words.

The Program Memory addressing still bytewise, but the Program Counter register lacks the low order bit, it doesn't need it, it count instructions otherwise.

AVR memory addressing is linear, it needs to address bytes, for registers, I/Os, SRAM and even Program Memory.

The LPM instruction reads bytes exactly because the addressing still bytewise possible.

Interesting enough, the SPM (Store Program Memory) uses 16 bits writing, registers R0+R1 as data source, even so, I am not sure if the core writes 16 bits at once, or do it in two cycles of 8 bits, maybe it uses the same pipeline bus (16 bits) of instruction fetch when writing to flash.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

avr-mike wrote:

It's impossible for an 8-bit multiply to overflow.

255 x 255 is 65,025 which fits in 16 bits.  I'd

guess that the engineers know this and were

only trying to save you some effort if you ever

needed to know the top bit of the result.

--Mike

Mike, that exactly why I question this Cbit = bit15, Cbit is for overflow, period.

It is impossible to have overflow in a Mul 8 x8 bits when the result is 16 bits.

The engineers may though about some other use of Cbit copying Bit15 for some subsequent instructions.

If they thought about that, I want to know, it may be very useful.

This kind of knowledge shall not be lost just by some documentation shortage of explanation.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

wagnerlip wrote:

No other AVR instruction raises Cbit without a real overflow, meaning the operation result could not fit into the register bits.

I already registered a ticket at MChip about this, they accepted, status is "on-going".  Curious about the answer.

Do you really think that a mature (perhaps obsolete) microcontroller instruction set is going to be changed because you question something that has been around for several decades?

If you really want to be pedantic:

COM

SEC

BSET

I'll leave it to you ecclesiastical scholars about whether to list all the shift and rotate instructions.  And also to make the judgement call on FMULS "before the shift"

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Total votes: 0

No you miss the point with the AVR memory structure, is't a Harvard structure where code and data are separate.

The Flash is 16 bit NOT 8, that is why it can do 1 16 bit instruction in one clk. The program counter counts in 16bit (not because it's missing a bit!).

Every thing else is 8 bit (registers, IO's RAM etc. often 16 bit timers have a extra HW temp/shadow register so 16 bit can be done in sync, by 2 8bit reads/writes)

Now we come to the bridges between 16 bit and 8 bit:

First some of the old org. chips did not have any, like the AVR1200 no LPM instruction (and before Atmel dared to make a SPM instruction).

LPM (the first chips only had one instruction for this) this instruction have are special where Z get's mapped (shifted down 1 bit, if done in SW or HW only the designers knows but the instruction take a extra clk so any way could work) to the flash, and bit 0 in Z select if high or low byte get's on the 8 bit databus.

The SPM is kind of new, and there it looks like they have made R1:R0 as a 16 bit bridges. (It would have been nice if LPM would load 16 bit direct into R1:R0).

I'm not sure how the EEPROM is implemented today but in the past they behaved like 8 bit but actually was a part of 16 bit flash, and that was why a EEPROM read halted the cpu for some clocks.

Total votes: 0

sparrow2 wrote:

The Flash is 16 bit NOT 8, that is why it can do 1 16 bit instruction in one clk. The program counter counts in 16bit (not because it's missing a bit!).

Now we come to the bridges between 16 bit and 8 bit:

First some of the old org. chips did not have any, like the AVR1200 no LPM instruction (and before Atmel dared to make a SPM instruction).

LPM (the first chips only had one instruction for this) this instruction have are special where Z get's mapped (shifted down 1 bit, if done in SW or HW only the designers knows but the instruction take a extra clk so any way could work) to the flash, and bit 0 in Z select if high or low byte get's on the 8 bit databus.

The SPM is kind of new, and there it looks like they have made R1:R0 as a 16 bit bridges. (It would have been nice if LPM would load 16 bit direct into R1:R0).

I wonder why then, the old (and new) LPM could not just get both bytes from the 16 bits it reads and stuff it into R0 and R1 at once, exactly like SPM does it to write flash, it would be easier and faster for the programmers, right?

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

theusch wrote:

Do you really think that a mature (perhaps obsolete) microcontroller instruction set is going to be changed because you question something that has been around for several decades?

Theusch, no, I am not here trying to fix anything that could be wrong, not the idea.  My curiosity sticks to the possibility that the original designer thought about to use Cbit as a copy of Bit15 in MUL for some productive way in subsequent instructions.  I can not see it, and I would love to know.   And you are correct, all the Fractional Multiplication uses the Cbit correctly as Bit16 after the left shift (fractional multiplication required), as a real overflow (2's complement overflow) result.   I am really curious about MChip eng's answer for that.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

Yes it would be nice with 16 bit LPM, and I'm sure that if the chip should be developed today it would have it, but the old AVR design was "more" 8 bit and beside of X,Y and Z the registers was 8 bit, and they didn't have any 16 bit correlation , like the movw instruction, (What I try say is that the LPM is kind of a hack on top of a harvard structure)

ok I have to add that R25:R24 also work with adiw sbiw, but since they take 2 clk I sure that there aren't any real HW for that it make two 8 bit instructions.

And as a side note one of the best known 8 bit cpu's the Z80 only have a 4 bit ALU.

Total votes: 0

Got the ticket result from Microchip about this issue, and it is not clear for them the original intention, just guessing.

Item #1, the example Aarthi said:

```mac16x16_32_method_B:			; uses two temporary registers
; (r4,r5), but reduces cycles/words
; by 1
clr	r2

muls	r23, r21		; (signed)ah * (signed)bh
movw	r5:r4,r1:r0

mul	r22, r20		; al * bl

add	r16, r0
adc	r17, r1
adc	r18, r4
adc	r19, r5

mulsu	r23, r20		; (signed)ah * bl
sbc	r19, r2
add	r17, r0
adc	r18, r1
adc	r19, r2

mulsu	r21, r22		; (signed)bh * al
sbc	r19, r2
add	r17, r0
adc	r18, r1
adc	r19, r2

ret
```

Item #2, multiplying by 1/2, Aarthi is trying to say that when you divide by 2 (rotate right) a negative number resulting from a MUL, bit 15 (sign) is also carry and will still 15 after being shifted right to 14.  But that is also done by instruction ASR.  So...  not really convincing.

Created By: Aarthi Thiravidamani (4/15/2019 6:50 AM) Hi Wagner Lipnharski,

We have heard from our internal team.

The real reason behind copying the 15th bit of multiplication results to the C bit of SREG register in not clear. However, there are several estimates to this from our end;

1. In assembly language routines that do multiplication of quantities that are wider than 8-bits. Please look at the source code given for the application note “AVR201-Using the 8-bit AVR Hardware Multiplier”. The asm file contains routines for several different types of 16-bit by 16-bit multiplies. Some of them, for example signed 16*signed 16, use a multiply instruction followed by an SBC instruction that takes advantage of the SREG C bit.

2. When multiplying by a factor of (1/2). It should be possible to make a fairly generic multi-byte scaling routine by repeatedly calling Rotate Right Through Carry (ROR) multiple times on the individual bytes, starting with the most significant byte.

Hope this information helps.

Thanks and Regards,
Aarthi

Created By: Aarthi Thiravidamani (4/1/2019 7:31 AM)
Hi Wagner Lipnharski,

Thank you for contacting Microchip Technical Support.

We are discussing with our internal team on your query which is very valid. We shall let us know with any inputs once we hear from our internal team.

Hope you understand.

Thanks and Regards,
Aarthi

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

How can it reach any 4kWords (8kBytes) address with only 12 bits of address bytes displacement?

You are over-thinking it.   First, the 12bit address is a word address, not a byte address.  This is only occasionally noticed in any form of source code, because the assembler or compiler "knows" and takes care of the conversion for you.

Second, on a chip with 8k of flash, you also have a 12bit program counter (4k words, again.)  ""Obviously", when you add a 12bit offset to a 12bit number, you can achieve any other possible 12bit number...

Total votes: 0

westfw wrote:

How can it reach any 4kWords (8kBytes) address with only 12 bits of address bytes displacement?

You are over-thinking it.   First, the 12bit address is a word address, not a byte address.  This is only occasionally noticed in any form of source code, because the assembler or compiler "knows" and takes care of the conversion for you.

Second, on a chip with 8k of flash, you also have a 12bit program counter (4k words, again.)  ""Obviously", when you add a 12bit offset to a 12bit number, you can achieve any other possible 12bit number...

Westfw, exactly, and it is only possible if the chip has no more than 8192Bytes (4096Words) Flash memory, as the datasheet states.  The displacement of a RJMP or RCALL is limited to only +2048/-2047 words, what only allows to reach any 4kWords because the rollover of the overflow.  0001 + 2048 does not reach word address 3000, but 0001 - 1096 does, only if the memory is 4096 words long.  The compiler or assembler takes care of that, IF it knows the chip has 8192bytes of code memory or lower.  I will need to make few tests with different def chips and see how the assembler complains.

It would be very interesting to use a 16k chip, lying to assembler it has only 8k, then duplicating the code from lower 8kbytes to upper 8k.  The program would be jumping between the upper and lower 8kbytes without even noticing, but saving one byte on every RJMP/RCALL instruction.   This would happens, because the hardware rollover of the 16k chip would be different, 0001 - 1096 will not point to 3000, but 7095, that is exactly the counterpart address of the high block, 7095 is 4095+3000.

In the past I did some crazy stuff like that with 8051, addressing several banks of 64k of code.  A transition jump section on the bottom part of every 64k block was exactly the same, so when the code changes the port bits selecting another 64k Eprom the code never gets interrupted, lost or broken.

This webtext is very old, explains it in my silly wording at the time.

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

The GCC compiler has a flag to mark if the top and buttom should be mapped around on chips bigger than 8K.(-mshort -calls )

Total votes: 0

No, -mshort-calls in an option to be used internally by avr-gcc and the docu says DO NOT SET IT BY HAND. In most cases, avr-gcc can deternine that this option is not appropriate and removes it from the command line.

If the goal is smaller code size, use -mrelax.

If wrap around is needed for correct operation, the specs file should add --mp-wrap-around=size.

avrfreaks does not support Opera. Profile inactive.

Last Edited: Tue. Apr 16, 2019 - 07:09 PM
Total votes: 0

I just copy'ed the text in AVR studio 7

Total votes: 0

In very old versions of avr-gcc, short-calls could be used as optimization option, but many users had problems applying it correctly. Hence it was deprecated and removed as the option to use is mrelax.

When avrxmega3 (linear address space) was supported in v8, -mshort-calls was re-introduced internally for multilib selection, cf. multilib structure of v8. avr-libc uses that option to deternine the size of vector table entries. If not compiled for avrxmega3, avr-gcc will remove that option. If used with avrxmega3, you might get wrong code and wrong multilib and encounter other problems with incompatible -m[no]short-calls.

avr-gcc uses RCALL/RJMP for devices with up to 8KiB flash and CALL/JMP for families with more flash (except in situations where the compiler knows that RJMP is ok, of course). For calls, usually only the linker knows whether it's appropriate to replace CALL by RCALL etc. and has to be told to do the replacenent, dito for address wrap-around. The latter is performed by the driver qua specs file.

avrfreaks does not support Opera. Profile inactive.

Last Edited: Tue. Apr 16, 2019 - 08:43 PM
Total votes: 0

I made the test on AS4.19, about the wrap-around addressing.

TEST 1)

Assembly AtMega88 - 8k Flash (\$0000-\$0FFF), Simulator AtMega88

The assembler makes correctly the auto wrap-around of RCALL MF00,it does the subtraction of \$100 correctly.

```.include "m88def.inc"
.list

.cseg
.org \$0

rcall MF00

.org \$F00
MF00: ret```

The code disassembled, observe the PC-0x0100 below.

```7:        rcall MF00
+00000000:   DEFF        RCALL     PC-0x0100      Relative call subroutine
8:        rcall M800
+00000001:   D7FE        RCALL     PC+0x07FF      Relative call subroutine```

TEST 2)

Assembly AtMega168 - 16k Flash (\$0000-\$1FFF).  Simulator AtMega168

The assembler does not make the auto wrap-around and complain about the RCALL M1F0,

it could easily subtract \$100 as the Mega88, and jump wrap-back correctly within the +2048/-2047 displacement,

but it does not make it.

```.include "m168def.inc"
.list
.cseg
.org \$0

rcall M1F00

.org \$1F00
M1F00: ret```

It seems that this wrap-around is only taken in consideration if in fact the chip has no more than 8192 bytes of Flash.

Even so, there is no reason why it could not do it, considering the wrap-around would be within the displacement +2048/-2047.

Or the assembler follows strictly the rules to do it only for no more than 8192 flash, or, the chip itself forbids in its internal addressing buses to wrap-around into the flash. May be the flash memory cells are not there, but addressing and internal gates are already prepared for larger memory.

TEST 3)

AtMega88, Simulator AtMega88

Even that Z is pointing to a second block of memory address \$F00 (that doesn't exist on AtMega88), it gets the right values of 1, 2, 3 and 4 into R0, R1, R2 and R3

```.include "m88def.inc"
.list

.cseg
.org \$0

ldi ZH,\$3E
ldi ZL,0
lpm R0,z+
lpm R1,z+
lpm R2,z+
lpm R3,z+

.org \$F00
MF00:
.DB 1,2,3,4,5,6,7,8```

TEST 4)

AtMega168, Simulator AtMega168

The same here, data (1,2,3,4) from words at \$0F00 are being read from words at \$3F00 (bytes \$5E00), address not existent on AtMega168,

at least the simulator read it correctly from the second block of non-existent memory, into R0, R1, R2, R3.

Or the simulator is doing what it should not, or the chip would be doing the same, so the 168 could in fact wrap-around.

I will try to produce this effect on a real M328 I may have around here.

```.include "m168def.inc"
.list

.cseg
.org \$0

ldi ZH,\$5E
ldi ZL,0
lpm R0,z+
lpm R1,z+
lpm R2,z+
lpm R3,z+

.org \$0F00
MF00:
.DB 1,2,3,4,5,6,7,8```

Wagner Lipnharski
Orlando Florida USA

Total votes: 0

But none of the should be a surprise.

The chip can wrap around, but some of the SW don't know that. Only for those with  8k or less, (and therefore don't have call and jmp instructions (perhaps today they have but never have any need for it) ).

There is no limitations to the value to Z, the HW will and off those bits that will be outside the flash size (it's always 2**n in size).

That is why rjmp works the core gladly add/sub outside the flash, but bits outside the actually size will just be 0.

The same would happen if you push two values (chips with 128 KByte flash or less), and then do a RET if that is outside the flash those extra bit will be and'ed off. (Not long ago there was a thread here about an error in one of the chips that actually pushed a wrong return value on the stack and nobody have noticed because the chip behave correct )