ASCII Compression silliness - Tiny/mega/AVR

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Some of you who arguably speak English may have noticed that printable ASCII characters only require seven bits - the high bit is never set.  Some may have also noticed that the AVR is fundamentally an eight-bit processor, and therefore, because 8x7 == 7x8, one could theoretically stuff eight ASCII characters into only seven bytes.

 

Nota bene:  If you actually try to implement this in a functional gizmo, you are prematurely optimizing, and kindly Knock It Off Already.

 

With that out of the way, because I was bored, mostly, I have developed, implemented, and tested routines for compressing (and decompressing) eight (8) ASCII characters represented in seven bits into only seven (7) bytes.  This might be marginally helpful for those of you who load up your memories with strings (which I do incessantly).

 

This particular application:

a) receives eight characters from the RS-232 serial port

b) stores them uncompressed in the EEPROM, consuming eight bytes

c) compresses them, and stores them again in the next seven bytes

d) stores a cheerful $00 in the sixteenth EEPROM byte (because why not?)

e) retrieves the seven compressed bytes from the EEPROM

f) decompresses them back into valid ASCII

f) and dumps them out the serial port, in reverse order.*

 

It also dumps the first sixteen bytes, in ASCII hexademcial, of the EEPROM at pretty much every stage of the operation.

 

It's written in AVR assembler, and I have programmed and tested it in an ATmega8535.  Converting it to other chips is left up to you.  Those of you who consider yourselves C coder wizards are welcome to rewrite it in C - who knows, perhaps your code might be even more efficient!

 

* Fixing that, if you care, is left as an exercise for the student.

 

Now for the fun part!

 

I have included all the supporting files in the attached .zip, but the fun bits I shall paste into a code window here.  Enjoy!

 

; Compress and store in EEPROM <-- The fun bit starts here

; Load to registers r0-r7
	ldi ZL, low(SRAM_START)
	ldi ZH, high(SRAM_START)
	ld r0, Z+
	ld r1, Z+
	ld r2, Z+
	ld r3, Z+
	ld r4, Z+
	ld r5, Z+
	ld r6, Z+
	ld r7, Z+

; and compress to r0-r6

; Generally speaking, putting multiple instructions
; on one line is horrible practice.  But in this
; case, I think it better shows what is actually going on.

	lsl r7	; Toss carry

	lsl r7
	rol r6	; Toss carry
	lsl r7
	rol r6 rol r5	; Toss the carry
	lsl r7
	rol r6 rol r5 rol r4	; Toss the carry
	lsl r7
	rol r6 rol r5 rol r4 rol r3	; Toss the carry
	lsl r7
	rol r6 rol r5 rol r4 rol r3 rol r2	; Toss the carry
	lsl r7
	rol r6 rol r5 rol r4 rol r3 rol r2 rol r1	; Toss the carry
	lsl r7
	rol r6 rol r5 rol r4 rol r3 rol r2 rol r1 rol r0	; And toss the final carry

;< At this point the registers r0-r6 contain the
;< compressed characters - 8 of them, in 7 bytes.  I didn't
;< paste the boring storing stuff.

;< Next, once the seven bytes have been retrieved into
;< registers r0-r6. we decompress them...

	lsr r0
	ror r1 ror r2 ror r3 ror r4 ror r5 ror r6 ror r7
	lsr r1
	ror r2 ror r3 ror r4 ror r5 ror r6 ror r7
	lsr r2
	ror r3 ror r4 ror r5 ror r6 ror r7
	lsr r3
	ror r4 ror r5 ror r6 ror r7
	lsr r4
	ror r5 ror r6 ror r7
	lsr r5
	ror r6 ror r7
	lsr r6
	ror r7
	lsr r7

; And lo, the registers r0-r7 contain the same eight valid ASCII characters.

Have fun!

 

S.

 

 

Attachment(s): 

This topic has a solution.
Last Edited: Sat. Mar 21, 2020 - 09:31 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Let me tell you about SIXBIT (6 bits per character, an uppercase subset of ascii) and SQUOZE (6 popular characters used in symbols, squished into 32bits)

PDP-10s had a 36bit word, and instructions that dealt with any "byte size" between 1 and 36, so they regularly stored text as 5 seven-bit byes in a word (which wrecked havoc when trying to write or use a C compiler.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

so they regularly stored text as 5 seven-bit byes in a word (which wrecked havoc when trying to write or use a C compiler.)

 

But five seven-bit characters is only 35 bits, not 36!  Wasteful!!  laugh  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But five seven-bit characters is only 35 bits, not 36!  Wasteful!!

they used the extra bit to indicate whether a word contained the line number.  So you had magic 5-digit line numbers that editors could use, and compilers (for example) would automatically ignore.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Use Every Bit.  This might be a bit of a walnut and brass-plated anachronism, then - but that's how they played the game in the Spanish Sahara!

 

cheeky  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:
Some may have also noticed that ... one could theoretically stuff eight ASCII characters into only seven bytes

The GSM people noticed that in the 80s - SMS does exactly that!

 

There used to be lots of posts about it here when the "host" microcontroller had to do the PDU encoding & decoding itself - before cellular modules became smart enough to do it themselves.

 

I also remember that DEC had "Radix-50" - which packed a subset of ASCII ...

 

https://en.wikipedia.org/wiki/DEC_Radix-50

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Sat. Mar 21, 2020 - 10:03 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It surprises me not that this particular wheel has been invented before.  But since when did you see an SMS implementation in AVR assembler code?  frown

 

'sides, I think the code is pretty.  So there.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I remember doing this for what was then called a "telemetry" project with vending machines (the trendy new term at the time was "M2M"; nowadays, we'd say "IoT").

 

There was too much data to get into one plain text SMS message - and SMS was expensive in those days.

 

So I came up with a scheme of cramming only the required bits into a PDU.

 

It looked quite pretty in 'C' - with all the shifts & masks nicely arranged ...

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

It looked quite pretty in 'C' - with all the shifts & masks nicely arranged ...

 

Show me.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Using a different concept you could save plenty of code and time to compress and decompress.

 

For example for encoding you could copy all bits 0-6 of r7 to bit 7 of registers r0-r6, which requires only 21 instruction words and 21 cycles instead of 36.

 

Decoding is equivalent and would need only 22 instruction words and cycles instead of 36.

 

The assembly code could look as follows (just an example):

; compressing eight 7-bit bytes into 7 bytes using registers r0-r7, requires 21 cycles
; compressing algorithm: copy all bits 0-6 of r7 to bit 7 of registers r0-r6

	lsl	r0	; prepare r0
	lsr	r7	; bit 0 of r7 to CY
	ror	r0	; CY to bit 7 of r0

	lsl	r1	; prepare r1
	lsr	r7	; next bit of r7 to CY
	ror	r1	; CY to bit 7 of r1

	lsl	r2	; prepare r2
	lsr	r7	; next bit of r7 to CY
	ror	r2	; CY to bit 7 of r2

	lsl	r3	; prepare r3
	lsr	r7	; next bit of r7 to CY
	ror	r3	; CY to bit 7 of r3

	lsl	r4	; prepare r4
	lsr	r7	; next bit of r7 to CY
	ror	r4	; CY to bit 7 of r4

	lsl	r5	; prepare r5
	lsr	r7	; next bit of r7 to CY
	ror	r5	; CY to bit 7 of r5

	lsl	r6	; prepare r6
	lsr	r7	; next bit of r7 to CY
	ror	r6	; CY to bit 7 of r6

; decompressing seven 8-bit bytes into 8 bytes using registers r0-r7, requires 22 cycles
; decompressing algorithm: copy all 7th bit of r6-r0 to bits 6-0 of r7 and reset bit 7 of r7

	lsl	r7	; reset final bit 7 of r7 (could also use clr r7)

	lsl	r6	; bit 7 of r6 to CY
	rol	r7	; CY to next bit of r7
	lsr	r6	; readjust r6

	lsl	r5	; bit 7 of r5 to CY
	rol	r7	; CY to next bit of r7
	lsr	r5	; readjust r5

	lsl	r4	; bit 7 of r4 to CY
	rol	r7	; CY to next bit of r7
	lsr	r4	; readjust r4

	lsl	r3	; bit 7 of r3 to CY
	rol	r7	; CY to next bit of r7
	lsr	r3	; readjust r3

	lsl	r2	; bit 7 of r2 to CY
	rol	r7	; CY to next bit of r7
	lsr	r2	; readjust r2

	lsl	r1	; bit 7 of r1 to CY
	rol	r7	; CY to next bit of r7
	lsr	r1	; readjust r1

	lsl	r0	; bit 7 of r0 to CY
	rol	r7	; CY to next bit of r7
	lsr	r0	; readjust r0

For multiples of eight 7-bit ASCII bytes in the memory you copy 7 bits of the first byte to bit 7 of the next 7 bytes when encoding, and decoding accordingly.

 

In other words: You don’t need 8 registers, but 2 will be sufficient for the shifts and in addition the Z register as memory pointer.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

just for fun comments :

 

if you really want to compress then look here :

http://prize.hutter1.net

 

20 years ago when I started with AVR i was writing a bit lib (was missing a lot of the 8051 bit things), and found out that if you wan't to compress like you it's actually "cleaner" to have the bit's the "wrong" way round (transposed) so if you want 7 bit words then 7 bytes hold 8 chars. and all bit 0 hold one char all bit 1 the next etc. This cost 7 loads for 1 char (and some bit check) but you know the placement. 

This is best when you have a big array where 3 bit hen hold bit number (placement), and the addr is the rest *7 + offset   

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I always knew there was someone out there smarter than I.

Last Edited: Sat. Mar 21, 2020 - 09:37 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My code was for size not speed !    (I make controllers I think in worst case not AVG.)

 

With some layout planning before doing the code I don't think it's a problem. 

 

 

add

Ups I see that I'm not the only one that do it that way.

Last Edited: Sat. Mar 21, 2020 - 11:44 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's worth asking:  Do you want flat-out speed, or smaller code size?  Each should lead to a different solution. 

 

Edited to add:  Or human-readable code.  Token-selecting from a library is going to be annoying reading for code maintainers.  Maybe nobody cares...

 

S.

Last Edited: Sat. Mar 21, 2020 - 09:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:
Show me.  

I'd have to find it first - this was going on for 20 years ago!

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Build it.  See if it works.  I did.  So can you.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:
It's worth asking:  Do you want flat-out speed, or smaller code size?  Each should lead to a different solution.  Does yours really require 22 cycles per character?  That's much slower than mine.  S.

 

As mentioned in the text and in the source code, my proposal needs 21 + 22 words of instruction code for encoding + decoding and is running in 21 + 22 instruction cycles for encoding + decoding (instead of 36 + 36) for all 8 characters, of course.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

You can save one instruction per decompressed bit bit placing the 8th ASCII character bits in bit 0 instead of bit 7. This way you don't need the "readjust" instructions.

 

	lsr	r6	; bit 0 of r6 to CY & adjust
	rol	r7	; CY to next bit of r7
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

El Tangas wrote:

You can save one instruction per decompressed bit bit placing the 8th ASCII character bits in bit 0 instead of bit 7. This way you don't need the "readjust" instructions.

 

	lsr	r6	; bit 0 of r6 to CY & adjust
	rol	r7	; CY to next bit

Indeed, a great idea, which can be used both for encoding and decoding, which would result in only 2 instead of 3 instructions each!

This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Here is the improved version:

; compressing eight 7-bit bytes into seven 8-bit bytes using registers r0-r7, requires 14 instruction words and cycles
; compressing algorithm: copy all bits 0-6 of r7 to LSB of registers r0-r6 being shifted left by one

	lsr	r7	; LSB of r7 to CY
	rol	r0	; CY to LSB of r0

	lsr	r7	; next bit of r7 to CY
	rol	r1	; CY to LSB of r1

	lsr	r7	; next bit of r7 to CY
	rol	r2	; CY to LSB of r2

	lsr	r7	; next bit of r7 to CY
	rol	r3	; CY to LSB of r3

	lsr	r7	; next bit of r7 to CY
	rol	r4	; CY to LSB of r4

	lsr	r7	; next bit of r7 to CY
	rol	r5	; CY to LSB of r5

	lsr	r7	; next bit of r7 to CY
	rol	r6	; CY to LSB of r6

; decompressing seven 8-bit bytes into eight 7-bit bytes using registers r0-r7, requires 15 instruction words and cycles
; decompressing algorithm: copy all LSB of r6-r0 (and shift right by one) to bits 6-0 of r7 and reset MSB of r7

	lsl	r7	; reset MSB of r7 (could also use clr r7)

	lsr	r6	; LSB of r6 to CY
	rol	r7	; CY to next bit of r7

	lsr	r5	; LSB of r5 to CY
	rol	r7	; CY to next bit of r7

	lsr	r4	; LSB of r4 to CY
	rol	r7	; CY to next bit of r7

	lsr	r3	; LSB of r3 to CY
	rol	r7	; CY to next bit of r7

	lsr	r2	; LSB of r2 to CY
	rol	r7	; CY to next bit of r7

	lsr	r1	; LSB of r1 to CY
	rol	r7	; CY to next bit of r7

	lsr	r0	; LSB of r0 to CY
	rol	r7	; CY to next bit of r7

This improved code results in 14 + 15 = 29 instruction words and cycles (for 8 characters) instead of 36 + 36 = 72, which is a big improvement of almost 60%, both in code size and speed.

 

Thanks for the hint El Tangas!

smiley

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you have a lot of ASCII text and you need to save space then don't dick around with small fish like this.  Just go straight to at least huffman+token-substitution.

 

If your strings total a few hundred bytes then it wont be worth it - but if you are going to <quote>load up your memories with strings</quote> then it can save a lot.

 

I have had projects I inherited that had over 1/2 of the micro filled with human readable text for menus and messages on 16x2 LCD screens.

 

The kind of text in menus you will get about 5 bits per char without limiting yourself to a subset of ASCII.  If you can do token substitution as well you can get even more dramatic gains.  The example project I am thinking about here the word "flow " was almost as common as the letter "e" so would have been only 4 or 5 bits to encode.  That is 5 bits to encode rather than 40.

 

Of course this kind of effort only makes sense if you can't just put a micro with bigger flash in, but that counts for most large optimizations.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

marginally helpful for those of you who load up your memories with strings (which I do incessantly)

Why are you putting in so much text? Any examples?  Are you reusing words & phrases?  Doing so can save a lot.

I/YOU/BILLY/FREDDY= SHALL /EAT/COOK UP/DELIVER= A VERY TASTY /STEAK/HOGIE/PIZZA/HOT DOG= AT NOON

 

  

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:

I have had projects I inherited that had over 1/2 of the micro filled with human readable text for menus and messages on 16x2 LCD screens.

 

I've built projects like that - maybe you inherited one of mine!!  laugh  But the biggest one had more like a 16x16 character LCD, and the screenwriting routines went to some effort to use ASCII as a rest-of-the-world interface, even though each actual character drawn on the screen was a custom-designed 8x8 pixel array.  Some of them were quite cute...  but of course, every phase of machine operation required a different screen, and redrawing the whole screen with a new one was 256 bytes.  We eventually did both - bought an AVR with more flash, and threw in some optimizing routines.

 

The whole thing was actually a shim.  The original display was character-oriented, with its own character generator, and was replaced at a late moment with another display that didn't have any internal logic at all - just 'put a dot here' instructions.  I put a lot of dots.

 

In current practice, although this will get up the nose of some around here, I put a lot of strings in flash such that the MCU can talk reasonably intelligently to a dumb terminal.  The AVR will present menus and expect a user selection from them, and the menus are written to be human* readable.  Obviously a much better way would be an intelligent head-end program, but I'm not smart enough to write one of those yet (working on it, in Python).  S.

 

* in this case, "human" refers to me.  Arguable, I know.  S.

 

PS - If any of you are still worrying about the point of this, see the original thread title, and note the operative term "silliness".

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

one could theoretically stuff eight ASCII characters into only seven bytes.

Off-topic but somewhat related... Kurzweil synthesizers have been doing something like this since the 1990s. To send binary data over MIDI, one has to "pack" seven 8-bit bytes into eight 7-bit values. Makes for some fun coding in C and Javascript.

Mike