Using accented characters with Arduino

Go To Last Post
11 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

  While making a French/English flash card program, it became apparent that there was no easy way to work with accented
characters.  The simplist way is to ignore the accents, but that is not a real solution.  My platform is the Arduino
Mega2650, which has enough memory to store a non-trivial number of words and phrases.

 

 There is no single standard format for displaying accented chars as there is for ordinary ASCII characters. The
PC Windows format use two separate but equal ALT+keypad number codes: ALT+130 and ALT+0130 show different characters.
Most text editors like UltraEdit use the Latin Maschule-C format, while the ArduinoIDE (and serial monitor) use the
Unicode UFT-8 format.  The Arduino TFT screen libraries commonly use the original Adafruit 256-char font file, which
uses yet another encoding in the 'hidden' chars between 128 and 255. And then it seems that the accented characters
created by the Arduino IDE for the Mega are different from those created for the UNO/Nano.  All I wanted was to have
the same characters displayed on the Arduino TFT screen as were being displayed in the Arduino IDE editor.

 

I downloaded a long list of 1544 phrases from the french Reverso com website using a Select-All [Cntrl-A] and copy-
to-buffer [Cntrl-C].  Then I pasted this list into a new OpenOffice-Word file using Cntrl-V. MS Word is one of the
few programs that can save text in the Unicode UFT-8 format used by Arduino's IDE.  When the UTF-8 file of the phrases
loaded into the Arduino IDE, all the accented characters displayed correctly.  Each phrase was then framed for flash
memory storage using:

 

   const char frExpr0000[] PROGMEM = "Accuser r├⌐ception";
   const char frExpr0001[] PROGMEM = "Acheter/vendre chat en poche";

 

Next create an array consisting of the beginning flash address of each phrase:

 

   const char * const PROGMEM frExprArray[] = { frExpr0000, frExpr0001 };

 

The main code can now use the phrases directly from flash memory with:

 

   unsigned int  expressionNumber ;
   char          expressionBuffer[80];    // make sure this is large enough for the largest string it must hold

   strcpy_P( (char *) expressionBuffer, (char *) pgm_read_word (& frExprArray[expressionNumber]);

   The char array 'expressionBuffer[]' now holds one complete phrase, with a terminating zero at the end.  Each
char in the array has to be tested to determine if it is an accented char.  The UFT-8 format used by the .ino source
file uses a sentinal byte with value 0xC3 to indicate that an accented character follows. The next char has a
value between 128 and 255.   The char has to be converted to a byte in order to be tested.  For some reason,
Arduino binary code for the UNO/Nano doesn't precede the accented char with an 0xC3 sentinal byte, and the following
byte is offset by 0x40 from the Mega's compiled accent-char value.  I found this out by doing a hex-pair display
of the bytes stored in flash by the diffenent Arduino boards.  A Mega2650 will show:  

    0000:  61 C3 A9 62 63 ...   while a UNO/Nano will have  0000:  61 E9 62 63 ...   for the same character display.
    

   Finally the accented char has to be mapped to a bit-map found in the font table used by the TFT screen.   The
Adafruit 256-char font table has about 30 of the most commonly-used accented chars that are found in about 99%
of French words.  The big switch statement substitutes the char's UTF-8 format for the correct font bit-map.

                                                                                                                
   for (uint8_t index=0; expressionBuffer

!= 0; index++)  { // examine each char in the expression string       
       uint8_t myOneChar = (uint8_t)  expressionBuffer
; // convert the string char to byte              
       if (myOneChar < 128) tft.write(myOneChar); // TFT displays the char of ASCII# myOneChar                                         
       else {                                                                                                  
              if (myOneChar == 0xc3) {    // this is the Mega2650 version. I don't know why the                  
                   index++;               //  Mega version writes different numbers for accented chars.      
                   myOneChar = (uint8_t)  expressionBuffer
+ 0x40;                                      
              }                                                                                                
                                                                                                               
              switch (myOneChar) {  // if myOneChar is greater than 128, check if it is on the list.           
                 case 0xe9 :         // If it is, then display instead the alternate ASCII from the font table.
                   tft.write(char(130)); break;  // e aiguile  lower_case                                      
                 case 0xc9 :                                                                                   
                   tft.write(char(130)); break;  // e aiguile  upper_case {to lower case}                      
                 case 0xe0 :                                                                                   
                   tft.write(char(133)); break;  // a grave  lower_case                                        
                 case 0xc0 :                                                                                   
                   tft.write(char(133)); break;  // a grave  upper_case {to lower case}                        
                 case 0xe8 :                                                                                   
                   tft.write(char(138)); break;  // e grave  lower_case                                        
                 case 0xc8 :                                                                                   
                   tft.write(char(144)); break;  // e grave  upper_case                                        
                 case 0xf9 :                                                                                   
                   tft.write(char(151)); break;  // u grave  lower_case                                        
                 case 0xd9 :                                                                                   
                   tft.write(char(151)); break;  // u grave  upper_case                                        
                 case 0xe2 :                                                                                   
                   tft.write(char(131)); break;  // a circomflex lower_case                                    
                 case 0xea :                                                                                   
                   tft.write(char(136)); break;  // e circomflex lower_case                                    
                 case 0xca :                                                                                   
                   tft.write(char(136)); break;  // e circomflex upper_case {to lower case}              
                 case 0xee :                                                                                   
                   tft.write(char(140)); break;  // i circomflex lower_case                                    
                 case 0xf4 :                                                                                   
                   tft.write(char(147)); break;  // o circomflex lower_case                                    
                 case 0xfb :                                                                                   
                   tft.write(char(150)); break;  // u circomflex lower_case                                    
                 case 0xe7 :                                                                                   
                   tft.write(char(135)); break;  // c_cedille lower_case                                       
                 case 0xc7 :                                                                                   
                   tft.write(char(128)); break;  // c_cedille upper_case                                       
                 default :                                                                                     
                   tft.print(F(" -accent- ")); break;                                                          
               }  // switch                                                                                    
       }  // else                                                                                              
   }  // for (index=...            
   
   
   All in all, it was a lot of work for a character conversion that should easily done in the background. I suppose that
 this would have fixed (or made easier) long ago if English had lots of accented characters.  C'est la vie.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Are there really no translating libraries for Unicode?

- John

Last Edited: Thu. Dec 22, 2016 - 09:39 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Many hits using "arduino unicode translation library" with Google.

 

Ross McKenzie ValuSoft Melbourne Australia

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

why not just use a table with 128 bytes (use char - 128 as index)  that holds the translated char and then either the same number as the index or 0 for those you don't know. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It appears Adafruit use Code Page 437, also known as "OEM" https://en.wikipedia.org/wiki/Co.... I use a text editor that supports that (PsPad). However, my keyboard generates characters from ISO8859-1 or similar.

 

Unfortunately it is hard to get away from doing some conversion. UTF-8 seems to be commonly supported in editors, although this isn't maybe the best for small embedded systems.

Bob.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The Adafruit_GFX library defaults to the monospaced 7x5 "System font".   However it now supports a whole range of "Free" fonts.   Both monospaced and proportional.

 

It all depends on your target audience.    Either stick with the 7x5 and map characters to suit.   Or use a compliant Free font.

 

There is a similar situation with 16x2 LCDs.   The typical inbuilt ROM has several characters in the 128..255 range.    You have to map these too.

 

Life is simpler with a "smallish" font.    Full International character sets would be overkill for most AVR apps.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Having developed many Europe wide products over the years I'd say that if you are limited to 8bit symbols then Latin-1 is the best compromise.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What kind of font size are we talking about?

 

In the old days with a 8x8 font  our Å had to use a smaller A than normal etc.   

 

Guess that today with graphical displays I would have all the ü â ã å etc as a overlay thing to a normal letter, so the font itself could be rather small. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I didn't know about the "code page 437 OEM" name of the Latin Majeschule-C-cedille format.  The Adafruit 7x5 font supports this format.  A custom font would work also, but I want something that should be Arduino-simple (i.e. understandable by beginners/lay people who want to get a job done quickly and easily).  The Latin-1 format has the most Euro language chars, but their order is different from Page_437, and they would neither display correctly on the Arduino IDE nor most text editors.

 

  A 128-byte table and supporting code would be extensively elaborate, because about 70 of the chars are either "pretty print" formatting or Greek symbols. I think that for French accented chars, the switch statement is the best approach.

 

  Nearly all the custom fonts outside of the 7x5 use the left-to-right, top-to-bottom placement of the bit-map pixels.  This is different from the Arduino font which uses top-left_to_bottom-left pixel format with the last pixel displayed being the one in the lower-right corner.  Most of the Adafruit-based TFT libraries use this pixel placement technique for chars.  A command for a one-bit-sized window is written to the TFT controller, and then one pixel is stuffed into this window.   Most TFT bit-map routines write the pixels left to right, top to bottom.   A single command to establish a window size gets sent first, then all the bits are stuffed into this window.  So a 7x5 bit-map gets written twice as fast as a 7x5 character.  It makes little difference in the real world, because few aps write full pages of characters.  However, using available bit-map fonts means that the TFT library code has to studied and modified, which is outside of the Arduino-simple level that I would like to keep all my code in.

 

 

Hopefully this information will be accessible through a Google search to others in the future needing an easy solution to the Arduino accented chars problem.

 

Last Edited: Fri. Dec 23, 2016 - 09:46 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I say Latin-1 (ISO-8859-1) is the best compromise for Euro languages. A decent editor will let you type using it. Another good one is ISO-8859-15. 

 

The only thing that matters is that your source strings match the final output display device in terms of character ordering. If necessary the source can use things like

 

#define F_CEDILLA "\xB7" 

 

then 

 

#define MY_STRING "Woo" F_CEDILLA

 

etc. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I say Latin-1 (ISO-8859-1) is the best compromise for Euro languages.

(AKA Windows 1252 IIRC) That's what I used when I used to send some of my electronics signs to Europe.

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly