Open source compressing algorithm

Go To Last Post
18 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi everyone,

Where can I find an open source compression algorithm (zip, rar or any equivalent format)?

Many thanks

Have a nice day

Pippo

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

7Zip should be open source.

Regards
Sebastian

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

infozip is as well (www.infozip.org).

The question is what is a good zip algorithm for the AVR, i.e.
something that doesn't need too much RAM.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Any form of dictionary compression is probably out unless you are talking about one of the real big AVRs with plenty of RAM. I'd maybe look at a static Huffman but as with any kind of compression it kind of depends what input data you are talking about. Clearly if it were audio, for example, you'd probably be better off with ADPCM or one of the g. standards such as G.729, if it were static images then .png might be a good bet while if it were video I guess one of the flavours of MPEG would be the compressor of choice (though you'd almost definitely need external silicon support for this!). But if it's fairly "loose" data of a general nature but with a recurrent type of content then Huffman (using a static tree so it doesn't need to be in RAM) might well be the best choice for the limited resource environment inside an AVR

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have a similar problem ... except it's the other way around.

I want/need to store alot of message strings in a Mega128 EEprom (4K) and I need to compress them at least 5:1 to fit in. I can do the compression on a PC and program them either at compile time or configuration time (via serial). So I'm not worried about the compression side (in terms of resources).

The question is, what's the algorithm of choice and what's out there (and available) for decompressing the packed messages at run time?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

5:1 may be a bit ambitious, not? Plus the time to decompress on the fly, and the coding time and space of the decompression.

Why not store them in flash?

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I want/need to store alot of message strings in a Mega128 EEprom (4K) and I need to compress them at least 5:1 to fit in.

Hmm...well, in the end you might end up in the way, that the decompresion alg. will be bigger and more bloated than storing them simply in the FLASH. In the end (if 5:1) it's 20k of prog mem, so if you don't have a really ALL FLASH EATING program, you might use that.

There are pointy haired bald people.
Time flies when you have a bad prescaler selected.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the comments folks ...

Lee,

5:1 probably isn't too ambitious for this particular set/type of message strings. They are extremely repititious and ideally suited to a dictionary/look-up table approach.

I did a quick test and fed them to WinZip which managed to compress them to 3% of their former size ... so I should be in with a chance.

Daqq,

The idea of using FLASH is an interesting one ... perhaps I shouldn't have mentioned the idea of doing the compression at compile time as its important that the end user have control over the messages via a configuration program. (Which implies a serial link rather than ISP etc)

I suppose we could come up with some bootloader type scheme for getting the strings into program memory but we're using an AES based encrypting bootloader and I'd really like to keep this thing as simple as possible.

At the moment my thoughts are moving along the lines of using a PC based program to sit and sift through the messages finding duplicated portions and constructing a dictionary/lookup table. But being a typical programmer I'd really like to find a snippet of code that's done it all for me ... ok maybe I shouldn't make generalisations, let me restate that ... but, being lazy I'd really like to find a snippet of code that's done it all for me.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
The idea of using FLASH is an interesting one ... perhaps I shouldn't have mentioned the idea of doing the compression at compile time as its important that the end user have control over the messages via a configuration program. (Which implies a serial link rather than ISP etc)

Well, in a mega128 you can WRITE to the flash memory. Although the flash can only withstand a limited number of overwrites (10000), it sohuld be sufficient, unless the customer plans on changing 20k of string on a daily basis.

A well writen bootloader might do the trick!

[edit] Understood more of the post. Makes this post obsolete. But I think you can write into the flash EVEN when the program is running.

There are pointy haired bald people.
Time flies when you have a bad prescaler selected.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

LordAdam wrote:
5:1 probably isn't too ambitious for this particular set/type of message strings. They are extremely repititious and ideally suited to a dictionary/look-up table approach.

You can't really do dictionary compression on an AVR - very few have sufficient RAM for that. However your data sounds like a perfect candidate for a static Huffman analysis. If it's just English text then, for example you should find that the 'e's code into just one bit and the 'a's in just 2, etc. - 5:1 might well be achievable.

cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just a collection of links:

http://datacompression.info/Sour...

You might find something helpful there !

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Cliff,

The compression would be done on the PC, only the compressed data would be transferred to the AVR.

The AVR itself would only need to perform the decompression part of the method chosen.

But ... if we do go for a dictionary or look-up table approach then the dictionary either has to be statically designed and pre-loaded in program memory space (which is inflexible and not likely to be acceptable) or we have to load the look-up table in EEprom, and that's space etc which we have to take into account.

Overnight I've been thinking/sleeping on the problem ... what do you think to the following approach?

The data is really only 7-bit ascii ... so encode it as follows ... anything with the top bit unset represents itself... anything with the top bit set is a look-up into a table which is passed to the AVR at configuration time along with the rest of the encoded messages.

The look-up table would contain whole words on an optimal basis determined by tokenizing/sorting/counting the messages at the PC end before being sent to the AVR.

This way we could have up to 128 frequently occurring words or sequences that would be replaced by a single byte in the EEprom ... strings of spaces (common in this message set) would also be replaced by single bytes.

Ossi,

Just seen your message and had a quick butchers at the link ... looks really good, you can be sure that I'll be exploring that thoroughly (and putting it in my LinkStash file!).

Mike.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you are using only alphanumeric characters e.g. a-z, A-z, 0-9, then you might store two characters in a single byte... Build up your own character set, eg
a = 0
b = 1
c = 2
..
z = 26
A = 27
...
Z = 42
0 = 43
..
9 = 52

This way you can store a character in int(ln 52+1) = 6 bits.

If you ommit capital chracters, that shrinks to 36 = 6 bits..

Okay so this is not enough... :(

axos88

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Axos,

That's an interesting idea ...

Your posting reminded me of something that I've not seen for 30 years ...

On the old PDP11 series of assemblers there was a special mode for storing three characters in a 16-bit word. It worked on the idea that 40 cubed was just less than 2^16 ... so you could store a 40 character alphabet three to a word by doing the maths at run time.

It was a poor, and inconvenient tradeoff, and reflects the extreme cost of storage space at that point in history ...

In another twist of the tale ... it was supprisingly called something like Dec50 or Ascii50 storage .... DEC from the makers of the PDP11 and the 50 from octal which was a common number base of those machines.

I wonder if anyone else recalls seeing 177777 rather than FFFF ?

And do you remember the console switches coloured in groups of 3? (Were they really grey and purple?)

Perhaps someone can still remember the bootloader sequence to get the thing running!

Mike.

Ps. thanks for the memories ... ah, the good old days.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

May be too late, but could you add some kind of external storage chip for the strings, like EEPROM, to some bus (I2C, SPI, bit-banged, etc)? Any leftover pins or connectors on your board?

- Jani

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why reinvent the wheel..use MORSE CODE..it will provide a degree of compression and ease of decoding.

Alternatively google for arithmetic compression.. it needs the dictionary and the message gets compressed into a single decimal fraction ( number ) altough the decompression algorithm may be a bit more expensive in terms of software overhead

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Morse Code is effectively a form of static Huffman compression in fact. (in the sense that the most common characters get the shortest symbols and the dit-dahs are really just 0-1's)

cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The idea of Morse Code is a very intriguing one ... in that as Cliff says it is a form of pre-set Huffman encoding which produces a form of semi-optimised bit length encoding (if you replace the dots and dashes by 0's and 1's).

But as it stands its not actually a binary code ... the silence between the dots and dashes is significant. True its not significant within a character but it is essential to determine the end of one character and the start of the next. Similarly in traditional Morse Code it determines the spaces between the words (there being no symbol to represent a space ... that came later with machine sent code).

The significance of the silence makes Morse a 3-level code (tertiary?) ... this was significant in that a hand produced coding method used by Russia in the 50's remained unbroken for many years until a defector revealed its secrets. Fundamentally the code was based on a 3-level code similar to morse and that is what had defeated the attempts of the time to crack it.

In terms of the encoding of single characters (as opposed to dictionary or table based methods) the Arithmetic compression method seems to be the optimal solution ... although as you say the implementation could be tricky. (But there's some good stuff out there on the web for a starting point.)

Mike.