Efficient storage of and access to variable length collections of strings

Go To Last Post
10 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have a lot of strings in multiple languages (as a basic i18n attempt for an embedded project) and I've been playing with ways to access and store them. They are of two types; some are simply short messages to be displayed on an OLED display; others are longer strings e.g. status messages and in particular menus.

 

One of the aspects of multiligualism is that it seems not all languages are created equal - some (I'm looking at you, French!) take many more words to convey the same information as, say, English or Dutch. German is fond of long words.. (I won't go into the aspects of diacritics and the possibility that a terminal expecting ASCII can't cope with them).

 

The problem is that while a list of error messages constrained in length to the width of the OLED can be grouped simply without too much wastage in a 3-d array:

const char i18n_faults [FLT_LAST][LANG_LAST][22] = {
		{
			//		FLT_MAX_LEAN,
			// the fuel-air mixture has reached its limit and is still too lean (too much air)
			"Mixture max lean     ",		// EN
			"Miscela max magra    ",		// IT
			"Mischung max mager   ",		// DE
			"Melange max maigre   ",		// FR		should be Mélange
			"Mengsel max mager    ",		// NL
		},
...
};

and can be accessed by printing (in this case) i18n-faults[error][language], it gets a bit more complicated with menus. Ideally I'd want to use a single routine to handle the printing of a menu in a selected language, but if the same approach is used then either all the strings have to be the same length so the offsets can be calculated, or the calling routine needs to know how long the strings are in a particular menu - which has its own problems if anything changes at a later date. The size of a single menu entry might vary from a few tens of chars to a few hundred - that's a lot of space wasted...

 

Instead, what I've done is to define *every* string in each language (with a sensible name so I can keep track) individually. That way, there isn't any wasted space equalising the line lengths. Arrays of pointers group the same block of strings (in different languages)  to provide access by [language] and sometimes these are grouped further to provide e.g. [msg][language] access.

 

<...>
const char func_lam_NL [] = "lambda sensor spanning (alleen 20v)";
const char func_int_NL [] = "geintegreerde lambda-spanning (alleen 20v)";
const char func_mph_NL [] = "voertuigsnelheid (alleen 20v)";

const char * func_ptr [LANG_NL][12] = {
	{ func_rpm_EN, func_inj_EN, func_adv_EN, func_man_EN, func_air_EN, func_wat_EN, func_tps_EN, func_bat_EN, func_icv_EN, func_lam_EN, func_int_EN, func_mph_EN },
	{ func_rpm_IT, func_inj_IT, func_adv_IT, func_man_IT, func_air_IT, func_wat_IT, func_tps_IT, func_bat_IT, func_icv_IT, func_lam_IT, func_int_IT, func_mph_IT },
	{ func_rpm_DE, func_inj_DE, func_adv_DE, func_man_DE, func_air_DE, func_wat_DE, func_wat_DE, func_bat_DE, func_icv_DE, func_lam_DE, func_int_DE, func_mph_DE },
	{ func_rpm_FR, func_inj_FR, func_adv_FR, func_man_FR, func_air_FR, func_wat_FR, func_wat_FR, func_bat_FR, func_icv_FR, func_lam_FR, func_int_FR, func_mph_FR },
	{ func_rpm_NL, func_inj_NL, func_adv_NL, func_man_NL, func_air_NL, func_wat_NL, func_wat_NL, func_bat_NL, func_icv_NL, func_lam_NL, func_int_NL, func_mph_NL },
};

Have I missed something obvious?

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

barnacle wrote:
some (I'm looking at you, French!) take many more words to convey the same information as, say, English or Dutch
Try marrying someone Irish! ;-)

 

And, no I don't think you've missed much - separate strings then arrays of pointers to them is usually the approach to take. Of course what you might want to do (to prevent errors) is actually hold the text in some other format then use a scripting language like Python to auto-generate the actual source to be built.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why not make a linked list ?

Or just count number of zero terminations, I guess that speed don't matter as long it's less than 1/10 of a sec.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
i18n-faults[error][language]

Rather than have to index for language at every single use, might it not be simpler to set up a pointer to the appropriate language table at startup ... ?

 

Just a thought.

 

EDIT

 

I'm assuming you do want the language to be configurable at runtime - not just do a separate build for each language?

 

+1 to Cliff's idea of preprocessing rather than trying to create it all manually in C code.

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Thu. Jan 7, 2021 - 09:45 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

PS forgot to say that a further advantage of "pre-processing" from some text layout (JSON? XML? plain ASCII?) to C source is that you don't hit that problem where your French translator was not counting right and has entered 65 characters into a field that can only hold/display 64. At generation time either the tool can simply say "Oi matey, that one is too long" or it might even apply some kind of intelligent truncation algorithm.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@awneil - yes, the language has to be selectable by the user at any time, so the build has to include everything.

 

@sparrow - a string with internal '\0' separators, and two '\0' at the end, is one I considered (I think some of the MS WIndows APIs used to use a similar approach?) but rejected because of the extra calculation step required and the possibility of screwing it up spectacularly. The linked list has the same objections; it's a search each time I want to use a string. This way I think the compiler does all the maths at build time...

 

@Cliff - yeah, the possibility of automatic build scripts occurred to me, but for now the corpus is small enough to manage it manually. The bulk of the strings don't have a length limit, just the 21 character limit which the compiler currently complains about strings exceeding that limit. Though I may change that to the same mechanism; one way to manage longer strings is to scroll them sideways which looks impressive when the timing is right (the dislpay makes it easy to do pixel shifts) but I'm not sure if that really fits the desired user interface.

 

I discovered some of these issues the hard way fifteen years ago with the first AVR iteration of this device. That had only two languages - English and Italian - and my testing showed everything fine. Then the Italians started using it and discovered one particular sequence of configuration and displayed error that crashed the system. I hadn't used progmem for the string storage and so when the longer strings got stuffed into memory space they crashed the stack... <blush> Took a lot of rewrite to get everything into progmem, as I recall.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had this very problem and went the "Whole Hog" method of using a Python Program and the gettext module to build C & H files. Those C & H files being machine generated are essentially unreadable but I'll accept that because adding a new message and translating in as many or as few languages as I desire becomes fairly trivial. Should the User select a language for which no translation for a particular word/message/phrase exists; he gets English as default.

 

So each word is tokenised (a machine generated enum) and also each message string is tokenised.

 

As an example for Menu Text, I  have this: (MessageID is a typedef around an enum)

static const MessageID SetupMenuText[] = {
	MSG_LANGUAGE,
	MSG_OPERATING_REGION,
	MSG_PRINTER,
	MSG_TIME,
	MSG_DATE,
	MSG_ALARMS,
};

 

Messages are printed to the screen via an accessor function with a particularly pertinent name.

WriteStringToDisplay(babel_getMessage(MSG_PRESS_ENTER_TO_ABORT));

 

I can go into more detail should you be interested

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does your script actually do the translation ?

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No

awneil wrote:

Does your script actually do the translation ?

That part is handled by the standard gettext utilities. They use po files in specific folders (Messages\locale\es\LC_MESSAGES\messages.po) that consist of entries looking like this Spanish example:

#. Popup Warning Message on 2-lines (Maxlen=32)
msgid "Press enter to view"
msgstr "Pulsar enter para ver"

 

Non programmer Translation Agents have no trouble with this recognised format.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If size don't matter, but speed does, then make a script that generate pointers of all the text starts, so your C code have an array of pointers.