Understanding the .hex file

Go To Last Post
19 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm using an ATMega328p on an Arduino board.

I've been trying to understand the .hex output generated when I compile my program using the Arduino IDE.

The hex output for my program is:

:100000000C9434000C944F000C944F000C944F004F
:100010000C944F000C944F000C944F000C944F0024
:100020000C944F000C944F000C944F000C944F0014
:100030000C944F000C944F000C944F000C944F0004
:100040000C944F000C944F000C944F000C944F00F4
:100050000C944F000C944F000C944F000C944F00E4
:100060000C944F000C944F0011241FBECFEFD4E02E
:10007000DEBFCDBF11E0A0E0B1E0E8EFF0E002C0EC
:1000800005900D92A030B107D9F711E0A0E0B1E0E2
:1000900001C01D92A030B107E1F70C9467000C94E9
:1000A00000008FEF84B987B98EEF8AB9089501C037
:1000B0000197009759F020E00000000000000000C8
:1000C000000000002F5F2A3599F3F6CF08958FEFD7
:1000D00084B987B98EEF8AB98FEF88B985B98BB9A2
:1000E00084EF91E00E94570018B815B81BB884EF50
:0800F00091E00E945700F0CFDF
:00000001FF

I've looked up the Intel Hex file standard on wikipedia, and that all makes sense to me.
The format is:

[Colon] [Data Size] [Start Address] [Record Type] [Data] [Checksum]

Great, so I've got the first record, which is 16 bytes starting at address 0x0000.

So the first line of the program can be parsed as:

[:][10][0000][00][0C9434000C944F000C944F000C944F00][4F]

The data contains eight machine instructions:

0C94
3400
0C94
4F00
0C94
4F00
0C94
4F00

The first instruction, when I look at an assembler listing (avr-objdump.exe -S) shows as:

   0:	0c 94 34 00 	jmp	0x68	; 0x68 <__ctors_end>

That makes sense to me. The first instruction is a JMP 0x68, which should skip over to the beginning of the program (or at least to some overhead setup instructions before the main() of the program).

However, when I look at the AVR Instruction Set, this isn't what that machine instruction should be doing. http://www.atmel.com/dyn/resources/prod_documents/doc0856.pdf

If we convert the hex to a binary instruction (so we can match it to the AVR Instruction Set PDF), we get:

0x0C94 = 0b'0000110010010100

But the JMP instruction in the PDF shows:

1001 010k kkkk 110k
kkkk kkkk kkkk kkkk

That's 0x9[4 or 5]X[C or D]
which could be 0x940C, but then the bytes of the
instruction would be backwards.

I'm not quite sure how it calculated the jump address to be 0x68 either.

I'm confused why the assembler listing shows this as a JMP instruction, while the AVR instruction set shows this as an ADD instruction. I believe the compiler made the .hex file correctly as my compiled programs run correctly on my Microcontroller.

Anyone have any idea what's going on here?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can load the HEX file in Studio. It will disassemble it for you.

But since you wrote the original source code, you might just as well study that.

You should find a .ELF file somewhere. I have never looked for the generated files from an Arduino. Studio will not only disassemble the ELF file, it should show you the originating source code too.

If your interest is solely in 'parsing' a machine instruction, say so.

Examining a HEX file is a bit like translating your own words into an unknown foreign language, and then trying to read the foreign language. It is much easier to read your original words!

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looks like the bytes are swapped with big endian storage. Since the jumps will always occur on an even boundary, they save a bit by taking the real address and shifting it right 1 bit before storing. When the opcode is executed the address is shifted to the left 1 bit before the jump occurs. So your 0x34 becomes 0x68 for the jump.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you dksmall. That makes a lot more sense. I had suspected that the bytes may have been reversed, but I could not find any documentation indicating that.

Do they swap bytes for all the double-word instructions? Is there some method for determining ahead of time if the bytes are swapped? There must be. The MCU knows they're swapped.

The method I'm currently using to determine which instruction the bytes indicate is to create a instruction mask and do a bitwise AND of the mask and the instruction. If the result is the same as the mask, then I know I've found the correct instruction. However, this method does not work if the bytes are reversed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't know if it's all instructions or not.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The byte order is not swapped, I think you are reading them backwards.
The hex file is in incrementing byte order or I guess you could call it little endian.

Considering your 1st line:
[:][10][0000][00][0C9434000C944F000C944F000C944F00][4F]

This suggests the following memory map:
0000: 0C
0001: 94
0002: 34
0003: 00
0004: 0C
and so on.

AVR8 'is' little endian and so your 1st 2-word (4-byte) opcode is: 0034 940C

Note this is reverse order to what you had.

Steve

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

AVR8 'is' little endian and so your 1st 2-word (4-byte) opcode is: 0034 940C

Surely you mean 940C 0034 ? The byes in the 16bit opcode words may be little endian but the opcode and operand aren't swapped. That to me, without looking it up, looks like a JMP 0x0034 or in GCC parlance JMP 0x0068

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Clawson's right, and I lay money that 0x0034 is the first instruction after the interrupt vector table ;-)

I have a natty little utility I use for turning plain files into intel hex to upload into SRAM. If there's interest I could add a few tweaks and make it work in both directions.

Cheers,

Joey

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

and I lay money that 0x0034 is the first instruction after the interrupt vector table

Then I'll take your money. The 0x0034 is half of the JMP opcode and it's all part of the reset vector which precedes the interrupt vectors. Clearly this is on one of the >8K AVRs which have 4 bytes for the reset and each vector entry, rather than the small AVRs that have just 2 bytes and hence only room for RJMPs.

Quote:

I have a natty little utility I use for turning plain files into intel hex to upload into SRAM.

Anyone using GCC already has that - it's called avr-objcopy. It's more than happy to take any form of binary (including text) as input and either output as linkable ELF object or straight to Intel .hex

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Anyone using GCC already has that - it's called avr-objcopy. It's more than happy to take any form of binary (including text) as input and either output as linkable ELF object or straight to Intel .hex

And I bet it takes a lengthy command line rather than a simple windows interface though ;-)

Cheers,

Joey

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
And I bet it takes a lengthy command line rather than a simple windows interface though Wink

Yes. It works perfectly, rather than a non-responsive / crashed GUI.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

And I bet it takes a lengthy command line rather than a simple windows interface though

Not necessarily - you could wrap the command in a batch file or Tcl/Tk or Perl or Python or some other scripting language to add some eye candy if you plan for seven year old kids to use it.

(seven year old kids have never used a command line and aren't generally capable of reading a user manual)..

FYI I can convert a text file to Intel hex as follows:

E:\avr[i386_vc]>type test.c
#include 

int main(void) {
        DDRB = 0xFF;
        while (1) {
                PORTB ^=0xFF;
        }
}

E:\avr[i386_vc]>avr-objcopy -I binary -O ihex test.c foo.hex

E:\avr[i386_vc]>type foo.hex
:1000000023696E636C756465203C6176722F696F3D
:100010002E683E0D0A0D0A696E74206D61696E28A6
:10002000766F696429207B0D0A0944445242203DC1
:1000300020307846463B0D0A097768696C652028B0
:100040003129207B0D0A0909504F525442205E3D50
:0E005000307846463B0D0A097D0D0A7D0D0AEB
:00000001FF

23 = #
69 = i
6E = n
63 = c
6C = l
etc.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Anyone using GCC already has that - it's called avr-objcopy. It's more than happy to take any form of binary (including text) as input and either output as linkable ELF object or straight to Intel .hex


LOL--now you got me curious enough to participate in this snipe hunt.

CodeVision has a nice set of functional tools to manipulate .HEX/.BIN. Load, edit, save [as].

But I never thought about loading a text file before. I found I could indeed load a text file, edit, and save as .HEX.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you for all the replies. AVR-GCC writes the .hex file with the least significant byte first (Little-Endian). Once I read the bytes in reverse order, things are working much better.

As a follow-up question:

Are the bytes in Program Memory on the ATMega328p also in "Little-Endian" format too? Or does the programmer swap the bytes as it uploads the hex to the microcontroller? I assume the latter since the AVR Instruction Set shows the instructions in Big-Endian format.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

AVR-GCC writes the .hex file with the least significant byte first (Little-Endian).

No it doesn't - it writes them in program address order. Which rather answers your follow up question doesn't it?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nope. You just confused the heck out of me.

We previously determined that the machine instruction 940C is saved in the .hex file as 0C94. The bytes are stored in little-endian format. (http://en.wikipedia.org/wiki/Little-endian)

I'm not sure what 'program address order' is.

You may have cryptically answered the question, but at the same time, answered it in a way that was not entirely useful to me.

Perhaps a simpler question is:

Does the Microcontroller use Big-Endian or Little-Endian notation when reading the Program Memory?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The is the second time to day I've explained this but:

:100000000C9434000C944F000C944F000C944F004F 

is really:

: 10 0000 00 0C9434000C944F000C944F000C944F00 4F 

which says write as follows:

0x0000: 0C
0x0001: 94
0x0002: 34
0x0003: 00
0x0004: 0C
0x0005: 94
0x0006: 4F
0x0007: 00
0x0008: 0C
0x0009: 94
0x000A: 4F
0x000B: 00
0x000C: 0C
0x000D: 94
0x000E: 4F
0x000F: 00

those bytes in flash will hold those values - without a doubt. However, when you apply power to an AVR it makes a 16bit opcode fetch which is little-endian. It therefore reads:

0x0000: 0C
0x0001: 94

as 0x940C and passes it to the instruction decoder which identifies it as a JMP opcode. This further triggers logic to make it make a second fetch as part of this same opcode and this time it picks up:

0x0002: 34
0x0003: 00

which again it reads as little endian and therefore treats as WORD address 0x0034 (which is BYTE address 0x0068). So after reading the first 4 bytes in two 16bit chunks it has determined it has "JMP 0x0068" and sets PC to be that address. (really 0x0034 in fact - it's just GCC that interprets things byte wise to be common across all architectures).

So the bytes are programmed in exactly the order they appear in the .hex at increasing byte addresses but when the AVR core makes fetches (two bytes, 16bits at a time) it treats them as little endian using the lower addressed of the two bytes as the lower part of each 16bit fetch. So it's not when programming hex that little endianness comes into play - it's when the data/code is later fetched back from the flash that the endianness issue applies - hence my reply above - sorry if it was confusing.

Cliff

PS today's other, similar thread:

https://www.avrfreaks.net/index.p...
(Apr 27th traffic only)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you clawson.

That clarified everything. Now, back to programming!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am unsure as to BKnight960's interests. Is it ?

How an instruction decoder works.
How to disassemble program code.
How an Intel Hex file is formed.
How to re-create the original C code.

The bits of an instruction are organised for the benefit of the original chip designer.

You simply have to rearrange them for yourself. Typically, you extract / build the fields and associate the required opcode and operands.

You can always compare your 'interpretation' with Studio's disassembly. OTOH, you could just read the 'correct' disassembly in the first place.

Nowadays PCs have plenty of memory. I can remember struggling to create a full Z80 disassembly in a few hundred bytes. Even your AVR has plenty of program memory.

I have always looked at microprocessors as black boxes. They just do what it says on the tin.

There are a few books on processor design. Google is your friend.

David.