Trampolines are not created by the compiler

Go To Last Post
21 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi, I am writing a USB bootloader for the ATxmeg128A1U.  Currently, the bootloader is >8kB so won't fit in the boot section of flash.  However, I planned on splitting the code between the table (0x1e000) and the boot (0x20000) sections which gives me about 16kB for the bootloader.  The problem is that extended indirect calls (eicall) are used throughout the compiled code.  This instruction (and the eijmp) append their 16-bit address with the EIND register to form the complete 24-bit address.  In my application, this would require two values (0 for functions in the table section and a 1 for those in the boot section) and the compiler does not support that. 

 

To get around this problem, as I understand it, the compiler will use "trampolines."

 

The GCC compiler documentation says that it will automatically create "trampolines" that will allow the EIND register to have a single value and all eicall functions call a trampoline to call any function in the entire address space no matter where it is.  

 

However, the compiler (avr-gcc) does not seem to be doing that.  Am I missing a compiler option?

 

Last Edited: Tue. Dec 7, 2021 - 07:51 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You do know that EIND is exposed as a C symbol don't you?

 

So:

foo();

can become:

EIND = 1;
foo();
EIND = 0;

It's cumbersome but it gives you control over the extended address.

 

BTW you misunderstand "trampolines". They are not for EIJMP/EICALLs. It's when you would have a JMP that cannot reach then it goes via a locally position table to "bounce" it to the right place.

 

One solution to all your problem by the way is to try and construct your own dispatch table so you isolate all the "boundary problems" to just one place. Say one half of the code has 18 functions that are accessed by the other half. Then just make an 18 entry table of function pointers or, better yet, 24bit JMPs and CALL the table entries not the functions themselves directly.

 

Personally I'd just be looking to simplify the bootloader to fit in the single BLS. It shouldn't need to be 14K. You can get USB off the ground for less than 4K so if you have an 8K BLS that should easily fit. If necessary simplify the delivery protocol so you don't need great lumps of software stack to support the data transport.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I didn’t think you needed eijmp/eicall for a 128k chip…

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

?

"trampolines."  is only fot chips with a pc bigger than 16bit so a 128 don't use it (remember 64k words) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

You do know that EIND is exposed as a C symbol don't you?

 

So:

foo();

can become:

EIND = 1;
foo();
EIND = 0;

It's cumbersome but it gives you control over the extended address.

 

BTW you misunderstand "trampolines". They are not for EIJMP/EICALLs. It's when you would have a JMP that cannot reach then it goes via a locally position table to "bounce" it to the right place.

 

One solution to all your problem by the way is to try and construct your own dispatch table so you isolate all the "boundary problems" to just one place. Say one half of the code has 18 functions that are accessed by the other half. Then just make an 18 entry table of function pointers or, better yet, 24bit JMPs and CALL the table entries not the functions themselves directly.

 

Personally I'd just be looking to simplify the bootloader to fit in the single BLS. It shouldn't need to be 14K. You can get USB off the ground for less than 4K so if you have an 8K BLS that should easily fit. If necessary simplify the delivery protocol so you don't need great lumps of software stack to support the data transport.

 

I did not know about EIND being exposed as a symbol.  However, I did not write the problematic code.  The eicall instructions are the result of the C code that was written for the USB drivers (e.g., usb_device.c, udc.c, etc.) so I have no control over them.  My understanding of trampolines was directly from a link you provided.  It was in section 3.18.6.1 where EIND is explained very clearly and how, with trampolines, eicall and eijmp allow access to any part of the code space.  It goes on to explain that EIND must be initialized during the startup of my code and should not be changed.  Here is the link:

 

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AVR-Options.html

 

The solution is to get the compiler to work as it is stated and use trampolines for eicall or eijmp instructions that need to access to code in another 128kB page (like would have to happen when calling functions between the table and boot sections (byte addresses 0x1e000 and 0x20000 respectively).  It is stated in the link that by specifying the "relax" options in the compiler and linker command lines, it will use trampolines but it does not seem to do that.

 

As a test, I moved all of the bootloader code (about 12k bytes) down to 0x1c000 so that all of the code is contained in the same 128kB page with only the vector table and SPM related code in the boot section and all works perfectly (although I can't protect the code that is before the table section (0x1c000 to 0x1e000) since there is no fuse that can protect it from being erased when the application section is erased as there is for the table and boot sections.

 

I am new to this level of control, so maybe I am misunderstanding what is being said in the manual, but I don't think so.

 

If I can't get the compiler to use trampolines, my temporary solution is to locate the bootloader in unprotected flash as I did in my successful test (i.e., at 0x1c000) and prevent the boot loader from erasing or overwriting itself (i.e., by ignoring any commands from the application software driving the bootloader over Modbus that would do that).  It does reduce the useful application section space by 8k bytes, but that is not a problem in my case.  An even more complex solution would be to keep a copy of the bootloader around the vector table from 0x1e000 to 0x22000 and restore the code if it is detected that the location 0x1c000 has been erased.  This would be a lot of work just to get around maybe an improper use of the command line options (I sure hope that that is the problem).

 

Thanks again for your help.  I have learned quite a bit and hope to continue to do that.

-David

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will say that this is some bad USB drivers then.

ICALL can reach everything on this chip.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is something seriously wrong with a Bootloader > 8kB.

 

You possess the source code.  So you should be able to see where and how it wastes so much Flash memory.

 

Bootloader code generally erases a page at a time as it requires upload a new page.
 

The Lock bits and Fuse bits will protect the official Bootloader pages.

Bootloader code generally tests the page for validity before replacing that page.   So you can prevent it replacing any of your "extended" Bootloader pages.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

I will say that this is some bad USB drivers then.

ICALL can reach everything on this chip.

 

According to the description of icall in the AVR instruction manual, it can only access code in the same 128kB page.  That would work fine if all of the bootloader could fit in the boot section where the interrupt vector table and reset vectors must be.  However, that is not possible using the CDC implementation of the ASF USB driver; it is simply too big to fit in 8kB.  To use it, which is required in my application, I have to split the code between the table and the boot sections which are in different 128kB pages.  This requires the use of the eicall and eijmp instructions.

 

So why all of this effort?  Because this product will be buried inside equipment that would take a very long time to physically access so updating by holding a pin low while you reset it (as is required for the DFU bootloader), it not an option.  The firmware update process must be able to be done through the existing PC connection (in this case, via Modbus).   

 

I'm almost there...  I only need the compiler to use trampolines as explained in the manual.

 

 

 

 

Last Edited: Tue. Dec 7, 2021 - 01:35 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:

There is something seriously wrong with a Bootloader > 8kB.

 

You possess the source code.  So you should be able to see where and how it wastes so much Flash memory.

 

Bootloader code generally erases a page at a time as it requires upload a new page.
 

The Lock bits and Fuse bits will protect the official Bootloader pages.

Bootloader code generally tests the page for validity before replacing that page.   So you can prevent it replacing any of your "extended" Bootloader pages.

 

David.

 

David, I would like to agree with you, but I have no control over the size of third party drivers and would not dare to try to implement my own USB CDC driver (I work for a living).  My original boot loader erased the entire application flash section as soon as the first block to be programmed was received.  It is much faster than erasing each block just before writing it.  Not the end of the world, however since I expect the entire application flash to be reprogrammed in less than a couple of minutes. 

 

Yes, my bootloader will ignore any blocks that would overwrite any part of it and I will use the lock bits to protect what I can.

 

Thanks...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1


 

I think the thing some folks here may have forgotten that in Xmega the bootloader is in EXTRA flash:

 

 

So the chip is not 64Kword/128Kbyte but 68Kword/136Kbyte.

 

So while the main 128K does not break the 64K word boundary the added bootloader bit on the end does.

 I only need the compiler to use trampolines as explained in the manual.

So, to be honest I don't know why you think the compiler does  NOT generate trampolines when it needs to. So I wrote this:

 

 

I then positioned fn() up high by moving that function in section ".boot" as follows:

 

 

The BLS starts at 0x10000 (word) which during the build the IDE passes to the linker as:

"C:\Program Files (x86)\Atmel\Studio\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-gcc.exe" -o testGCC.elf  main.o   -Wl,-Map="testGCC.map" -Wl,--start-group -Wl,-lm  -Wl,--end-group
-Wl,--gc-sections -mrelax -Wl,-section-start=.boot=0x20000  -mmcu=atxmega128a1u -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\XMEGAA_DFP\1.2.141\gcc\dev\atxmega128a1u"  

And the proof of the pudding is:

 

 

which is supported by:

 

 

So to make the call from down near 0x0242 the location of the trampoline (0x1Fc byte = 0xFE word) is first stored to the function pointer address (a 16 bit function pointer because all AVR pointers are 16 bit). This is then recovered into ZH:ZL for the EICALL so that will then call to 0xFE/0x1FC where it will find the 24 bit jump on to 0x20000 (byte) / 0x10000 (word) where the fn() function has been located to.

 

That's trampolines in action and they are working for this atxmeag128a1u code!!

 

PS forgot to mention (but hopefully this is obvious anyway?) that trampolines only come into play when you make an indirect call through a 16 bit wide function pointer. If I had simply used:

 

 

then this generates:

 

 

The CALL and JMP instructions on AVRs with >128Kbyte flash are 24 bit so this can reach in once go. It's only when you start forcing things through (callback?) pointers that the trickery needs to come into play.

Last Edited: Tue. Dec 7, 2021 - 02:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

clawson wrote:

 

 

 

I think the thing some folks here may have forgotten that in Xmega the bootloader is in EXTRA flash:

 

 

So the chip is not 64Kword/128Kbyte but 68Kword/136Kbyte.

 

So while the main 128K does not break the 64K word boundary the added bootloader bit on the end does.

 I only need the compiler to use trampolines as explained in the manual.

So, to be honest I don't know why you think the compiler does  NOT generate trampolines when it needs to. So I wrote this:

 

 

I then positioned fn() up high by moving that function in section ".boot" as follows:

 

 

The BLS starts at 0x10000 (word) which during the build the IDE passes to the linker as:

"C:\Program Files (x86)\Atmel\Studio\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-gcc.exe" -o testGCC.elf  main.o   -Wl,-Map="testGCC.map" -Wl,--start-group -Wl,-lm  -Wl,--end-group
-Wl,--gc-sections -mrelax -Wl,-section-start=.boot=0x20000  -mmcu=atxmega128a1u -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\XMEGAA_DFP\1.2.141\gcc\dev\atxmega128a1u"  

And the proof of the pudding is:

 

 

which is supported by:

 

 

So to make the call from down near 0x0242 the location of the trampoline (0x1Fc byte = 0xFE word) is first stored to the function pointer address (a 16 bit function pointer because all AVR pointers are 16 bit). This is then recovered into ZH:ZL for the EICALL so that will then call to 0xFE/0x1FC where it will find the 24 bit jump on to 0x20000 (byte) / 0x10000 (word) where the fn() function has been located to.

 

That's trampolines in action and they are working for this atxmeag128a1u code!!

 

PS forgot to mention (but hopefully this is obvious anyway?) that trampolines only come into play when you make an indirect call through a 16 bit wide function pointer. If I had simply used:

 

 

then this generates:

 

 

The CALL and JMP instructions on AVRs with >128Kbyte flash are 24 bit so this can reach in once go. It's only when you start forcing things through (callback?) pointers that the trickery needs to come into play.

 

 

Thank you for the detailed response.  I see what we did differently.  You specified the location of the boot section in the memory settings of the linker and I only did that in the linker script file.  I'll try doing that and, fingers crossed, it will work as it did for you.

 

I should say that before any extended call or jump to the trampolines requires that the EIND register be set properly.  I see that the compiler startup code does actually set the EIND register to where trampolines are to be located, so I do not need to do that myself as I understood from the linker description.

 

I'll let you know if this fixes my issue.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes,  erasing a page at a time is slower than erasing the whole AVR in one go.   i.e. what an ISP programmer does.

The erase page and program page operations don't take long.   Most of the time is spent transferring the data to load each page buffer e.g. on USB or UART link.

 

I can understand your reluctance to interfere with third party code.

I would still like to see where and how they are "wasting" Flash.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David. Well a USB stack (for one device) might be expected to "eat" about 4-6K (which would have fitted nicely on its own) but it appears he's just using USB as "transport" and then running Modbus on top of this. So I guess one might conclude that Modbus is using 8-10K ?

 

Personally I think it's a very dangerous strategy. Because a bootloader allows all the rest of the code to be replaced for bug fixes and improvements it is the only bit of the whole code that must be 100% rock solid on day one. You can probably ensure that in a few K with simple protocols. Once you start piling stuff on top it makes it very difficult to guarantee the integrity of the bootloader in all conditions. I would be pushing for a much simpler mechanism/protocol during bootloading and save the really wild stuff for the part that can easily be fixed/replaced.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:

Yes,  erasing a page at a time is slower than erasing the whole AVR in one go.   i.e. what an ISP programmer does.

The erase page and program page operations don't take long.   Most of the time is spent transferring the data to load each page buffer e.g. on USB or UART link.

 

I can understand your reluctance to interfere with third party code.

I would still like to see where and how they are "wasting" Flash.

 

David.

 

David, I do this as a profession and accountability is very important.  I won't modify something as complex as USB CDC driver or my clients or customers would have a field day if there were problems with my modifications.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree.   It is "safer" to just check for valid pages.

If the Third Party code is not broke you don't need to fix it.

 

I just said that I would investigate the bloat for my own curiosity.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


dnaviaux1 wrote:
However, that is not possible using the CDC implementation of the ASF USB driver; it is simply too big to fit in 8kB.
fyi, similar issue for XMEGA in SerialBootloader and FOTAU where the communications stack is only in the application; SerialBootloader loads via another interface or external memory.

Though these are likely a no-go in your case :

  • XMEGA128A1U : attached memory (EBI, SPI, TWI)
  • XMEGA384 : application images into non-application __flash1 .. __flash5

If considering a follow-on to XMEGA, some PIC24 and dsPIC have dual-partition flash.

dnaviaux1 wrote:
... so updating by holding a pin low while you reset it (as is required for the DFU bootloader)
One case where that behavior has been changed.

dnaviaux1 wrote:
I'm almost there...
Concur

 


Atmel AVR2054: Serial Bootloader User Guide

[page 16]

via ATMEL LIGHTWEIGHT MESH | Microchip Technology

Atmel Lightweight Mesh stack | AVR Freaks

dsPIC33CK256MP508 | Microchip Technology

 

Booting from XMEGA application code into the bootloader (and staying there!) | AVR Freaks

fwupd-test-firmware/AVR/XMEGA-A3BU-XPLAINED-1.23 at master · fwupd/fwupd-test-firmware · GitHub

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dnaviaux1 wrote:
... but I have no control over the size of third party drivers ...
in SLOC though do for object code (willing to try IAR EWAVR?)

dnaviaux1 wrote:
... and would not dare to try to implement my own USB CDC driver ...
If willing to forego CDC then USB packet code is small (likewise USB megaAVR : HalfKay, ubaboot)

dnaviaux1 wrote:
... (I work for a living)
Staying within the bid can be difficult; consider presenting alternatives to the customer.

 


IAR Embedded Workbench for AVR | IAR Systems

 

https://github.com/nonolith/USB-XMEGA/blob/master/example/stream/test.py

GitHub - kuro68k/xmega_usb: Minimalist portable USB device stack for XMEGA

XMEGA USB bulk transfer, without ASF (or LUFA) | AVR Freaks

GitHub - kevinmehall/usb: Minimalist portable USB device stack for SAMD21, LPC1800, LPC4300, Xmega

HalfKay Communication Protocol | PJRC

GitHub - rrevans/ubaboot: USB bootloader for atmega32u4 in 512 bytes

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
(a 16 bit function pointer because all AVR pointers are 16 bit)
in AVR GCC; up to 24 bit function pointers in IAR EWAVR.

IAR C/C++ Compiler User Guide for Microchip Technology’s AVR Microcontroller Family

[page 287]

FUNCTION POINTERS

via IAR Information Center for Microchip AVR

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
I didn’t think you needed eijmp/eicall for a 128k chip…

ATxmega128 has more than 128k of Flash. For Xmegas, the actual Flash size is application area according to the MCU name, in this case it is 128k, PLUS bootloader area. If your application or bootloader is executed only in its area, I mean application area or bootloader area respectively, jumps with addressing in range of 128k are sufficient. If you want jump between these areas, you need 24 bit address.

I once created bootloader that was located partially in the application area. There were no issues with exception of switch/case. For small number of cases corresponding to given switch, everything was OK. When number of cases increased, I think to 7, the bootloader was crashing. For small number of cases the switch/case is compiled as chain of if/else and for larger number of cases the switch/case uses table od addresses to jump, but they are apparently only 16 bit. I solved the issue creating new section in the application area and defining explicitly which functions are located in it. This way I could avoid calls of code located in application area directly from switch/case and this solved the issue.

Last Edited: Wed. Dec 8, 2021 - 07:51 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

clawson, I wanted to thank you again for the detailed response.  After revisiting your response, and "really" reading the details, I now see how I can write code that extends beyond the 128kB boundary.  However, the code that was written for the ASF USB driver was not written to support its use from code located in another 128kB page, i.e., no trampolines are created. 

 

However, I do see that it might be possible to locate all of the USB driver in the boot section and only call the few functions that I need via a special function that 1. sets EIND to the 1, then calls in entry to a local jump table then restores EIND on return.  This way I can keep all my code split between the table and the boot sections without extending into the application space.  The product is getting sent off to the customer in 10 hours... I hope I can get this done and thoroughly tested before then (if not, the current version that uses some application flash works fine, but just not as secure as I would like it).

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hope it works out for you! Have a good Christmas.