[C] Absolute fastest way to zero/clear a large array but also save some program space by using ST instead of STS

Go To Last Post
38 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello everyone.

 

Here is something that I found on one of my programming endeavor's. There was a task where I had to clear/fill an array with zero's as fast as possible for some refreshing purposes and to avoid a race condition. As I program only in C, I knew the fastest way would be to do this.

uint8_t data[64];

data[0] = 0;
data[1] = 0;
data[2] = 0;
data[3] = 0;
data[4] = 0;
data[5] = 0;
data[6] = 0;
data[7] = 0;
// .
// .
// .
// and so on.

My compiler was AVR-GCC. Now looking at the generated assembly in simulator.

data[0] = 0;	data[1] = 0;	data[2] = 0;	data[3] = 0;	data[4] = 0;	data[5] = 0;	data[6] = 0;	data[7] = 0;

00000054  STD Y+0,R1		Store indirect with displacement
00000055  STD Y+1,R1		Store indirect with displacement
00000056  STD Y+2,R1		Store indirect with displacement
00000057  STD Y+3,R1		Store indirect with displacement
00000058  STD Y+4,R1		Store indirect with displacement
00000059  STD Y+5,R1		Store indirect with displacement
0000005A  STD Y+6,R1		Store indirect with displacement
0000005B  STD Y+7,R1		Store indirect with displacement
// .
// .
// .
// and so on.

The compiler uses STD instruction which takes 1 word of program space and 2 clock cycles to complete. All fine until here. (this is about as fast as it can get on megaAVR)

But if I increase the array size to more than 64 bytes. Now how would complier handle this one? Because due to limitation of STD instruction, which can only add up to 63 to the address that it has.

 

 

To see let's increase the array size to 256 bytes. Here is dummy program. (I am writing statements horizontally to avoid huge scrolling)

 

static uint8_t data[256];

int main(void)
{
    while (1)
    {
        for(uint16_t x = 0; x < 256; x++)	// Filling the array with 0xFF
        {
            data[x] = 0xff;
        }

        data[0] = 0;	data[1] = 0;	data[2] = 0;	data[3] = 0;	data[4] = 0;	data[5] = 0;	data[6] = 0;	data[7] = 0;
        data[8] = 0;	data[9] = 0;	data[10] = 0;	data[11] = 0;	data[12] = 0;	data[13] = 0;	data[14] = 0;	data[15] = 0;
        data[16] = 0;	data[17] = 0;	data[18] = 0;	data[19] = 0;	data[20] = 0;	data[21] = 0;	data[22] = 0;	data[23] = 0;
        data[24] = 0;	data[25] = 0;	data[26] = 0;	data[27] = 0;	data[28] = 0;	data[29] = 0;	data[30] = 0;	data[31] = 0;
        data[32] = 0;	data[33] = 0;	data[34] = 0;	data[35] = 0;	data[36] = 0;	data[37] = 0;	data[38] = 0;	data[39] = 0;
        data[40] = 0;	data[41] = 0;	data[42] = 0;	data[43] = 0;	data[44] = 0;	data[45] = 0;	data[46] = 0;	data[47] = 0;
        data[48] = 0;	data[49] = 0;	data[50] = 0;	data[51] = 0;	data[52] = 0;	data[53] = 0;	data[54] = 0;	data[55] = 0;
        data[56] = 0;	data[57] = 0;	data[58] = 0;	data[59] = 0;	data[60] = 0;	data[61] = 0;	data[62] = 0;	data[63] = 0;

        data[64] = 0;	data[65] = 0;	data[66] = 0;	data[67] = 0;	data[68] = 0;	data[69] = 0;	data[70] = 0;	data[71] = 0;
        data[72] = 0;	data[73] = 0;	data[74] = 0;	data[75] = 0;	data[76] = 0;	data[77] = 0;	data[78] = 0;	data[79] = 0;
        data[80] = 0;	data[81] = 0;	data[82] = 0;	data[83] = 0;	data[84] = 0;	data[85] = 0;	data[86] = 0;	data[87] = 0;
        data[88] = 0;	data[89] = 0;	data[90] = 0;	data[91] = 0;	data[92] = 0;	data[93] = 0;	data[94] = 0;	data[95] = 0;
        data[96] = 0;	data[97] = 0;	data[98] = 0;	data[99] = 0;	data[100] = 0;	data[101] = 0;	data[102] = 0;	data[103] = 0;
        data[104] = 0;	data[105] = 0;	data[106] = 0;	data[107] = 0;	data[108] = 0;	data[109] = 0;	data[110] = 0;	data[111] = 0;
        data[112] = 0;	data[113] = 0;	data[114] = 0;	data[115] = 0;	data[116] = 0;	data[117] = 0;	data[118] = 0;	data[119] = 0;
        data[120] = 0;	data[121] = 0;	data[122] = 0;	data[123] = 0;	data[124] = 0;	data[125] = 0;	data[126] = 0;	data[127] = 0;

        data[128] = 0;	data[129] = 0;	data[130] = 0;	data[131] = 0;	data[132] = 0;	data[133] = 0;	data[134] = 0;	data[135] = 0;
        data[136] = 0;	data[137] = 0;	data[138] = 0;	data[139] = 0;	data[140] = 0;	data[141] = 0;	data[142] = 0;	data[143] = 0;
        data[144] = 0;	data[145] = 0;	data[146] = 0;	data[147] = 0;	data[148] = 0;	data[149] = 0;	data[150] = 0;	data[151] = 0;
        data[152] = 0;	data[153] = 0;	data[154] = 0;	data[155] = 0;	data[156] = 0;	data[157] = 0;	data[158] = 0;	data[159] = 0;
        data[160] = 0;	data[161] = 0;	data[162] = 0;	data[163] = 0;	data[164] = 0;	data[165] = 0;	data[166] = 0;	data[167] = 0;
        data[168] = 0;	data[169] = 0;	data[170] = 0;	data[171] = 0;	data[172] = 0;	data[173] = 0;	data[174] = 0;	data[175] = 0;
        data[176] = 0;	data[177] = 0;	data[178] = 0;	data[179] = 0;	data[180] = 0;	data[181] = 0;	data[182] = 0;	data[183] = 0;
        data[184] = 0;	data[185] = 0;	data[186] = 0;	data[187] = 0;	data[188] = 0;	data[189] = 0;	data[190] = 0;	data[191] = 0;

        data[192] = 0;	data[193] = 0;	data[194] = 0;	data[195] = 0;	data[196] = 0;	data[197] = 0;	data[198] = 0;	data[199] = 0;
        data[200] = 0;	data[201] = 0;	data[202] = 0;	data[203] = 0;	data[204] = 0;	data[205] = 0;	data[206] = 0;	data[207] = 0;
        data[208] = 0;	data[209] = 0;	data[210] = 0;	data[211] = 0;	data[212] = 0;	data[213] = 0;	data[214] = 0;	data[215] = 0;
        data[216] = 0;	data[217] = 0;	data[218] = 0;	data[219] = 0;	data[220] = 0;	data[221] = 0;	data[222] = 0;	data[223] = 0;
        data[224] = 0;	data[225] = 0;	data[226] = 0;	data[227] = 0;	data[228] = 0;	data[229] = 0;	data[230] = 0;	data[231] = 0;
        data[232] = 0;	data[233] = 0;	data[234] = 0;	data[235] = 0;	data[236] = 0;	data[237] = 0;	data[238] = 0;	data[239] = 0;
        data[240] = 0;	data[241] = 0;	data[242] = 0;	data[243] = 0;	data[244] = 0;	data[245] = 0;	data[246] = 0;	data[247] = 0;
        data[248] = 0;	data[249] = 0;	data[250] = 0;	data[251] = 0;	data[252] = 0;	data[253] = 0;	data[254] = 0;	data[255] = 0;

        for(uint16_t x = 0; x < 256; x++)	// Ignore, this is just so that compiler won't optimize everything away.
        {
            GPIOR0 = data[x];
        }
    }
}

And the assembly :-

data[56] = 0;	data[57] = 0;	data[58] = 0;	data[59] = 0;	data[60] = 0;	data[61] = 0;	data[62] = 0;	data[63] = 0;

000000B9  STD Y+56,R1		Store indirect with displacement
000000BA  STD Y+57,R1		Store indirect with displacement
000000BB  STD Y+58,R1		Store indirect with displacement
000000BC  STD Y+59,R1		Store indirect with displacement
000000BD  STD Y+60,R1		Store indirect with displacement
000000BE  STD Y+61,R1		Store indirect with displacement
000000BF  STD Y+62,R1		Store indirect with displacement
000000C0  STD Y+63,R1		Store indirect with displacement 

data[64] = 0;	data[65] = 0;	data[66] = 0;	data[67] = 0;	data[68] = 0;	data[69] = 0;	data[70] = 0;	data[71] = 0;

000000C1  MOVW R26,R12		Copy register pair
000000C2  ST X,R1		Store indirect
000000C3  MOVW R26,R14		Copy register pair
000000C4  ST X,R1		Store indirect
000000C5  MOVW R26,R16		Copy register pair
000000C6  ST X,R1		Store indirect
000000C7  MOVW R26,R22		Copy register pair
000000C8  ST X,R1		Store indirect
000000C9  MOVW R26,R20		Copy register pair
000000CA  ST X,R1		Store indirect
000000CB  MOVW R26,R18		Copy register pair
000000CC  ST X,R1		Store indirect
000000CD  MOVW R26,R10		Copy register pair
000000CE  ST X,R1		Store indirect
000000CF  MOVW R26,R8		Copy register pair
000000D0  ST X,R1		Store indirect
// .
// .
// .
// and so on.

This program on ATmega328P takes :-

Notice how after clearing first 64 bytes the compiler goes bonkers and generates messy assembly (and we loose our taking 2 clock cycles, clearing per byte). And this is with optimization level -O1

Let's bump up the optimization to -O3 and see.

 

data[56] = 0;	data[57] = 0;	data[58] = 0;	data[59] = 0;	data[60] = 0;	data[61] = 0;	data[62] = 0;	data[63] = 0;

000000C0  STS 0x0138,R1		Store direct to data space
000000C2  STS 0x0139,R1		Store direct to data space
000000C4  STS 0x013A,R1		Store direct to data space
000000C6  STS 0x013B,R1		Store direct to data space
000000C8  STS 0x013C,R1		Store direct to data space
000000CA  STS 0x013D,R1		Store direct to data space
000000CC  STS 0x013E,R1		Store direct to data space
000000CE  STS 0x013F,R1		Store direct to data space

data[64] = 0;	data[65] = 0;	data[66] = 0;	data[67] = 0;	data[68] = 0;	data[69] = 0;	data[70] = 0;	data[71] = 0;

000000D0  STS 0x0140,R1		Store direct to data space
000000D2  STS 0x0141,R1		Store direct to data space
000000D4  STS 0x0142,R1		Store direct to data space
000000D6  STS 0x0143,R1		Store direct to data space
000000D8  STS 0x0144,R1		Store direct to data space
000000DA  STS 0x0145,R1		Store direct to data space
000000DC  STS 0x0146,R1		Store direct to data space
000000DE  STS 0x0147,R1		Store direct to data space

data[72] = 0;	data[73] = 0;	data[74] = 0;	data[75] = 0;	data[76] = 0;	data[77] = 0;	data[78] = 0;	data[79] = 0;

000000E0  STS 0x0148,R1		Store direct to data space
000000E2  STS 0x0149,R1		Store direct to data space
000000E4  STS 0x014A,R1		Store direct to data space
000000E6  STS 0x014B,R1		Store direct to data space
000000E8  STS 0x014C,R1		Store direct to data space
000000EA  STS 0x014D,R1		Store direct to data space
000000EC  STS 0x014E,R1		Store direct to data space
000000EE  STS 0x014F,R1		Store direct to data space
// .
// .
// .
// and so on.

And this takes even more memory :-

 

The compiler just went STS all the way. Notice now STS takes 2 words of program memory and 2 clock cycles.

 

This got me thinking that I should use the ST instruction. Since we are just placing zero in consecutive memory addresses. That would be perfect. Why compiler is not using ST instead of STS? And the good thing about it that it also takes 2 clock cycles but takes 1 word of program memory. It's a win win situation.

 

It's equivalent of doing STD instruction over the whole array which that compiler can't do. (1 word instruction for clearing 1 byte in SRAM in 2 clock cycles

I tried writing code in every different way I could in C, so that compiler would use the ST instruction. But nope, that won't happen.

Then finally I threw some inline assembly in C. (never written assembly before and hoped for the best)

static uint8_t data[256];

int main(void)
{
    while (1)
    {
        for(uint16_t x = 0; x < 256; x++)
        {
            data[x] = 0xff;
        }

        __asm__ __volatile__(
        "push r30 \n\t"
        "push r31 \n\t"
        "push r16 \n\t"
        "ldi r30,lo8(data) \n\t"
        "ldi r31,hi8(data) \n\t"
        "ldi r16,0 \n\t"

        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"

        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"

        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"

        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"
        "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t" "st z+,r16 \n\t"

        "pop r16 \n\t"
        "pop r30 \n\t"
        "pop r31 \n\t"
        );

        for(uint16_t x = 0; x < 256; x++)   // Ignore, this is just so that compiler won't optimize everything away.
        {
            GPIOR0 = data[x];
        }
    }
}

And this takes :-

 

732 bytes cheeky. That's so much better than before. So I think compiler does not like generating assembly code where there is a very long repetition of same instruction even though knowing that, it is best for speed and also saving some space at the same time. 

 

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Sat. May 22, 2021 - 06:40 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Go on.  You simply ST Z+,R1 in a SUBI.W loop.

If time is really important you unroll 8 ST instructions in the main loop.   And finish with the odd count in a regular single ST loop.

 

No,  I have not checked the AVR syntax.   But the principle is there e.g.

single loop is 6 cycles per byte.  The unrolled loop is 20 cycles for 8 bytes (2.5 cycles per byte)

 

There is little to be gained by doing an 16 unroll (2.25 cycles).  The main gain is 6 cycles vs 2.5 cycles. 

 

I have no idea what all your massive arrays are doing.

 

Hint.  I have not looked.  But the crts.o is going to zero the BSS.   Why not look to see how the crts.S does it.

 

Oh,  you can gain a little more by doing an H-L nested loop.   Decrementing a single register is 1 cycle vs 2 cycles for a register-pair.

 

David. 

Last Edited: Fri. May 21, 2021 - 03:35 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
void * memset ( void *  dest,
    int  val,
    size_t  len 
  )    

Fill memory with a constant byte.

The memset() function fills the first len bytes of the memory area pointed to by dest with the constant byte val.

Returns

The memset() function returns a pointer to the memory area dest.

 

Source: AVR Libc

   memset(data, 0, sizeof(data));

verify syntax, but it will be some thing like that.

 

Edit: syntax corrected

 

 

Keys to wealth:

Invest for cash flow, not capital gains!

Wealth is attracted, not chased! 

Income is proportional to how many you serve!

Lets go Brandon!

Last Edited: Fri. May 21, 2021 - 06:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 2

Yes,  and I bet that memset() will have been written for health and efficiency.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you use r1 which is a zero register, r16 is unnecessary.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

might eve be used by the  crts.o ... ?

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does it really need to be all set to zeros ... ?

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:
Does it really need to be all set to zeros ... ?

OP showed:

data[x] = 0xff;

 

 

 

Keys to wealth:

Invest for cash flow, not capital gains!

Wealth is attracted, not chased! 

Income is proportional to how many you serve!

Lets go Brandon!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
SUBI.W loop.

David you just lost me here. Are you suggesting that there is even a better way? We talking absolute fastest way down to every last clock cycle.

ki0bk wrote:
The memset() function

Memset is slow. We are looking at sheer speed.

kabasan wrote:
If you use r1 which is a zero register, r16 is unnecessary.

Does r1 stays zero in a real program? I guess it's not. This just a subroutine, that can be called at any point in program to clear your array real fast with taking minimal space.

awneil wrote:
might eve be used by the  crts.o ... ?

What is crts.o?

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Fri. May 21, 2021 - 03:50 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:
Does it really need to be all set to zeros ... ?

Program is for demonstration purpose.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I forgot to mention this can be a subroutine for when you have to clear your array repeatedly in a program. (I am sure 99.9% don't need it). It's not for one time thing.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The obvious answer is to use memset() i.e. optimised Assembly code.

 

I am taking my temporary dog out.

 

I bet that you can write the loop in C.   And GCC will generate an almost perfect result.

 

Hey-ho.  I might try it when we get home.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
I bet that you can write the loop in C.   And GCC will generate an almost perfect result.

That's what I died trying.

 

The problem is only with arrays whos size is larger than 64 bytes and you just want put 0 in all the bytes as fast as possible.

No matter what I do in C, the complier will always use STS instruction (which is perfectly fine for speed) but not use ST instruction.

 

The difference is that they both take 2 clock cycles but STS take 2 words of program space and ST takes 1 word of program space. (meaning there is room for improvement in saving space department, and our speed will not be affected)

 

The compiler is not realizing that even when writing code for absolute speed he can also save space by just using ST instruction consecutively instead of STS.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Fri. May 21, 2021 - 04:33 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't get it. Why do you think that the obvious way (memset) wouldn't also be the fastest way? Don't you trust the libc authors?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
I don't get it. Why do you think that the obvious way (memset) wouldn't also be the fastest way? Don't you trust the libc authors?

Let me check, if I am doing something wrong here.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
I don't get it. Why do you think that the obvious way (memset) wouldn't also be the fastest way? Don't you trust the libc authors?

Yep, memset is way slower. memset is optimized for space.

 

To clear a 256 byte array memset will take +1500 clock cycles on any optimization level.

But that assembly subroutine will do that for you in 529 clock cycles.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Fri. May 21, 2021 - 04:55 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

Does it really need to be all set to zeros ... ?

Heisen wrote:
Program is for demonstration purpose.

That doesn't answer the question!

 

If the setting to zeros can be dispensed with, then the whole question of how to set to zeros - quickly or otherwise - becomes moot.

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Heisen wrote:
I had to clear/fill an array with zero's

awneil wrote:
Does it really need to be all set to zeros ... ?

ki0bk wrote:
OP showed:

data[x] = 0xff;

I think that's just as a non-zero starting value - to prove that the code actually has set everything to zero?

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK - So memset seems to be 5 cycles per byte if you use an 8-bit count. If you want to improve on that, your assembly code in #1 is probably the way to go.

I would probably write 16 or 32  st X+,__zero_reg__ instructions in a row and loop over those to avoid wasting flash.

 

Now what happens to this array after you've cleared it ?

I'll put money on it taking significantly longer to re-populate the array than to clear it using memset.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

awneil wrote:

Does it really need to be all set to zeros ... ?

Heisen wrote:

Program is for demonstration purpose.

That doesn't answer the question!

In my niche app it was necessary to clear a large array quickly. 

awneil wrote:
If the setting to zeros can be dispensed with, then the whole question of how to set to zeros - quickly or otherwise - becomes moot.

Yeah I guess, nobody will need it, like ever, it's already moot. I was just showing what I found. The STS and ST thing.

awneil wrote:
I think that's just as a non-zero starting value - to prove that the code actually has set everything to zero?

True.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Fri. May 21, 2021 - 05:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Memset source is here...
.
http://svn.savannah.gnu.org/viewvc/avr-libc/trunk/avr-libc/libc/string/memset.S?revision=2515&view=markup
.
Note that there are two variants one is OPTIMIZE_FOR_SPEED. Perhaps libc is not built with that setting? It should be possible to pull the source, add it to your project with that build option set. I think the lib version is weak yours should replace it. Also I think the implied -lc is last anyway but failing all else you should be able to do some -u magic I think.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
I'll put money on it taking significantly longer to re-populate the array than to clear it using memset.

that's kind of what I'm getting at with the "why does it need to be all set to zero?" question ...

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
I would probably write 16 or 32  st X+,__zero_reg__ instructions in a row and loop over those to avoid wasting flash.

+1 I like this idea.

N.Winterbottom wrote:
Now what happens to this array after you've cleared it ?

awneil wrote:
that's kind of what I'm getting at with the "why does it need to be all set to zero?" question ...

Lets say you have 16x16 pixel image. Total pixels 256.

 

Lets assume you are generating new images on the fly based on data the MCU receives or calculates from within or combination of both or whatever (meaning that these images are not prestored) and it takes time to generate data for those images.

 

To keep refreshing the image with new data, 

 

One approach is to update all the pixels every time. i.e write 256 bytes.
This means that there will be 256 pixel writes that will happen every time, which also means 256 pixels will be generated every time. 
Total time comes out to be (time taken to generate 256 pixels + time taken to write those 256 pixels).

 

Second approach is to write only the pixels that are necessary.
How can that be? Assume if in the image where there is nothing we set that pixel to zero.

In the very first step, we set everything to zero by the help of our quick routine in assembly, which will take N fixed amount of cycles every time. Now we only have to generate those pixels which are needed and update them.

Total time comes out to be (time taken to erase 256 pixels(which will be quick because assembly) + time taken to generate only the needed N pixels + time taken to write only the needed N pixels) Now N can be lower than 256, as compared to approach one. This means less time spent generating and writing pixels.

 

And if you are not updating every single pixel every time meaning data suggests there is no need too, because most of time a lot of pixels stay the same (background ones) or some of them, second approach can be faster. (resulting in higher refresh rate)

It's very dependent on application but now you know the context of clearing the array real fast. Some times it can be the way to go.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Fri. May 21, 2021 - 07:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

static uint8_t data[256];
should cause data to be initialized to zeros.
I'd expect crts.o to use a 5 or 6 cycle loop.
At startup, that should be fast enough.

6 cycles per byte is the slowest for which there is any excuse.
If straightforward C will not give you that and you want something faster,
look at this:
 

// assumes sizeof(data) a multiple of 8
// if not, at most 7 additional assignments are required.
uint16_t j;
uint8_t counterhi, counterlo;

for(j=0, counterhi=0, counterlo=...; counterhi< ...; ++counterhi) {
    do {
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
        data[j++]=0;
    } while(0 != (counterlo+=...)) ;
}

The ...'s are left as exercises for the reader.

The inner loop should be 2.38 cycles per byte.

If not, the code should not be hard to hand compile.
I think 15 explicit inline assembly instruction in total.
cc should be given as a clobber.
Initialization of the counterhi, counterlo and data+j
registers should be handled by the inline assembly
syntax that allows one to let the compiler choose registers.

 

Edit:

The outer loop is probably unnecessary.

Up to 8*256=2K can be handled without it.

Moderation in all things. -- ancient proverb

Last Edited: Fri. May 21, 2021 - 07:25 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well - now you have come into an area I can probably speak with some authority about.  Pixels on AVRs.

 

As others have suggested partially unrolled loops to trade speed for size.

 

However I am going to offer a 3rd approach that MIGHT be suitable if your ratio of set pixels to unset pixels is low (ie. a mostly black screen)

 

Keep a list of those pixels that are set and then only blank out them.

 

Also I am assuming your pixel array is not really 16x16 or the speed of clearing doing it the slowest way you can think of should not be a problem.  You are not prematurely optimizing are you?

 

 

Heisen wrote:

N.Winterbottom wrote:
I would probably write 16 or 32  st X+,__zero_reg__ instructions in a row and loop over those to avoid wasting flash.

+1 I like this idea.

N.Winterbottom wrote:
Now what happens to this array after you've cleared it ?

awneil wrote:
that's kind of what I'm getting at with the "why does it need to be all set to zero?" question ...

Lets say you have 16x16 pixel image. Total pixels 256.

 

Lets assume you are generating new images on the fly based on data the MCU receives or calculates from within or combination of both or whatever (meaning that these images are not prestored) and it takes time to generate data for those images.

 

To keep refreshing the image with new data, 

 

One approach is to update all the pixels every time. i.e write 256 bytes.
This means that there will be 256 pixel writes that will happen every time, which also means 256 pixels will be generated every time. 
Total time comes out to be (time taken to generate 256 pixels + time taken to write those 256 pixels).

 

Second approach is to write only the pixels that are necessary.
How can that be? Assume if in the image where there is nothing we set that pixel to zero.

In the very first step, we set everything to zero by the help of our quick routine in assembly, which will take N fixed amount of cycles every time. Now we only have to generate those pixels which are needed and update them.

Total time comes out to be (time taken to erase 256 pixels(which will be quick because assembly) + time taken to generate only the needed N pixels + time taken to write only the needed N pixels) Now N can be lower than 256, as compared to approach one. This means less time spent generating and writing pixels.

 

And if you are not updating every single pixel every time meaning data suggests there is no need too, because most of time a lot of pixels stay the same (background ones) or some of them, second approach can be faster. (resulting in higher refresh rate)

It's very dependent on application but now you know the context of clearing the array real fast. Some times it can be the way to go.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My apologies.   I walked the dog.   Saw that memset() was not perfect.   Went out on my bike.   Very windy.

 

Anyway.   I wrote a lttle program and simulated on a mega328P.   BSS loop is

0000003E 1d.92                ST X+,R1		Store indirect and postincrement 
0000003F a4.30                CPI R26,0x04		Compare with immediate 
00000040 b2.07                CPC R27,R18		Compare with carry 
00000041 e1.f7                BRNE PC-0x03		Branch if not equal 

memset() is inlined as

00000057 1d.92                ST X+,R1		Store indirect and postincrement 
00000058 21.50                SUBI R18,0x01		Subtract immediate 
00000059 30.40                SBCI R19,0x00		Subtract immediate with carry 
0000005A e1.f7                BRNE PC-0x03		Branch if not equal 

and my regular C is generated as

0000005F 81.93                ST Z+,R24		Store indirect and postincrement 
00000060 e2.17                CP R30,R18		Compare 
00000061 f3.07                CPC R31,R19		Compare with carry 
00000062 e1.f7                BRNE PC-0x03		Branch if not equal 

So they all produce 6-cycle loops.   Stopping at each NOP,  memset() was 1539 cycles.  my C code was 1538 cycles.

 

Yes,  you could split into a H-L nested loop with 8-bit decrement i.e. 5-cycle

And you can unroll several ST statements which would reduce the effect of the loop control statements.

 

My code is naughty in that it only works for 1 to 256 loops.   I could add an H outer loop to give a full 1 to 65536 range.   Not that any AVR has 64k SRAM.

 

I think that I have proved the point that the Compiler will generate X+ or Z+ statements.   The loop-control becomes insignificant when you unroll 8 or more Z+.

 

#define F_CPU 16000000
#include <avr/io.h>
#include <string.h>  //for memset()

volatile char david[260];  //in BSS

int main(void)
{
    asm("nop");
    david[0] = 'D';
    david[1] = 'P';
    david[256] = 0x55;
    asm("nop");
    memset(david, 0, 256);
    asm("nop");
    char *p = david;
    uint8_t count = 0, fill = 0xAA;
    do {
        *p++ = fill;
    }while(--count);
    asm("nop");
    while (1)
    {
    }
}

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
So they all produce 6-cycle loops.   Stopping at each NOP,  memset() was 1539 cycles.  my C code was 1538 cycles.

My old Imagecraft CC is similar using memset()

    00073 9321      ST	Z+,R18
    00074 9701      SBIW	R24,1
    00075 F7E9      BNE	0x0073

total cycle count was1561

 

Jim

 

Keys to wealth:

Invest for cash flow, not capital gains!

Wealth is attracted, not chased! 

Income is proportional to how many you serve!

Lets go Brandon!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Don't forget to put an external system clock function generator on the AVR to determine the fastest overclocking speed for this individual board.  If you can run a 20MHz AVR at 24MHz without error then you've got another 20% reduction in the time needed to fill the array.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

-- I'd still like to hear more about this absolute need for speed.  What are we racing?  [race was mentioned] 

 

-- What AVR model are we dealing with?  Respondents use a "generic" AVR8 such as '328.  But those models don't have any alternate data path to independently e.g. fill this mysterious buffer.

 

-- If indeed this is the critical part of the application, then pick a more appropriate model?  Xmega e.g. have DMA.  A double-buffered DMA can have the "spare" buffer already cleared.  Alternatively, it can fill quite quickly from a source.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Sat. May 22, 2021 - 12:34 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I simulated some XMEGA code using DMA and results aren't stellar. The best I could get was 5 cycles per byte cleared. The DMA controller throughput is limited by the CPU instruction fetch. It should be quicker using Idle sleep instead of busy wait, but the simulator doesn't handle sleep.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

At 20MHz, if one could see 1000fps, you still have 20,000 clocks to 'generate' every frame. For something more reasonable like 20fps you have 1million instructions to generate each frame. Whatever you do to refresh the display and at what rate is a separate unrelated matter.

 

You would also double buffer so the refresh is not showing the 'drawing' frame. I would imagine you could switch the buffers from 256 bytes to 32 bytes (1bit per pixel) if not enough ram for 512 bytes. Dealing with pixels in bit positions is not a big deal for either the drawing or refresh function. With the smaller buffers, you could also have many frames ready to go in advance (circular frame buffer), although probably would never even get to a point where the second buffer is not ready in time.

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

The way I see it, don't clear anything, double-buffer. Just define an area of memory to serve as a buffer, split it into two pages, and separately store a start address and a length for each page. Then just write to the inactive page, update your pointers, and flip the "active page" bit. Your speed will be the time it takes to write your data + ~3 cycles.

Last Edited: Sat. May 22, 2021 - 05:40 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I suggested in #2.   Unrolling a few ST Z+ instructions means that the byte time is 2.5 cycles for 8-unroll, 2.25 cycles for 16-unroll on an ATmega device.

i.e. 2 cycles per ST Z+ is the "theoretical best" you can do.   2.5 cycles seems a practical compromise.

 

On an Xmega,  ST Z+ is a 1-cycle instruction.  So your theoretical best is 1 cycle per byte.  8-unroll is 1.5 cycles.  16-unroll is 1.25 cycles.

I have not checked the newer DA and DB times for ST Z+

 

What are you really doing?   I would be happy enough with 1548 cycles for 256 bytes.  i.e. 6.05 cycles per byte.

648 cycles for 256 bytes seems practical.

Or change to Xmega and do it in 328 cycles.

 

If you are sending screen pixels via SPI, I2C, 8080-8, 6800-8, ... the important step is to minimise the traffic.     e.g. only update the smallest rectangle.   or selected pixels.

 

Yes,  if you have DMA like on an Xmega,  you can update one half of a buffer while the other half is blitting the pixels to the display controller.

Incidentally,   decoding a 16x16 JPEG tile takes much more grunt than blitting the resulting 256 pixels.

 

David.

Last Edited: Sat. May 22, 2021 - 06:29 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
I'd still like to hear more about this absolute need for speed. 

Indeed:

 

Heisen wrote:
Second approach is to write only the pixels that are necessary.
How can that be? Assume if in the image where there is nothing we set that pixel to zero.

In the very first step, we set everything to zero by the help of our quick routine in assembly

But if you only ever do this the zeroing once - at the very beginning - why this absolute need for speed in doing it?

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:

As I suggested in #2.   Unrolling a few ST Z+ instructions means that the byte time is 2.5 cycles for 8-unroll, 2.25 cycles for 16-unroll on an ATmega device.

i.e. 2 cycles per ST Z+ is the "theoretical best" you can do.   2.5 cycles seems a practical compromise.

+1

Now I understand what you guys meant by that. I tried 16 unroll like this.

volatile char david[256];  //in BSS

int main(void)
{
    char *p = david;
    uint8_t count = 16, fill = 0xAA;
    do {
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
        *p++ = fill;
    }while(--count);
    asm("nop");
    while (1)
    {
    }
}

This takes

607 cycles on -O1 Program Memory used - 202 bytes

623 cycles on -O2 Program Memory used - 200 bytes

512 cycles on -O3 Program Memory used - 1178 bytes

 

And that assembly routine in #1 post takes

512 cycles on any O level Program Memory used - 732 bytes (as I suspected)

 

So that inline assembly is a balance between -O2 and -O3.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Sat. May 22, 2021 - 10:45 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It’s something about optimizations that gets me going, seeing slight changes here and there in code can add up and could make big differences in timing/space occupied.

 

And truly for me that is satisfying to do (what a nerd). Well today new microcontrollers have gotten so fast that normal programming doesn’t require deep level of optimization. We pick the appropriate MCU for our job, we throw code at it and compiler takes care of everything. Which is nice, robust, economical and fast solution.

 

Today we write algorithms based on our problems, we take it for granted that MCU/compiler will handle it (whatever is on the internet which is famous is the best). Nobody writes algorithm taking into account the instruction set and the problem, specifically to run on an old MCU.

 

Me an odd ball here, seeing how far I can push the megaAVR. Playing it like a game. Hehe.

 

One thing I learned, note for other young players, don’t optimize things just for sake of optimizing. Have a set goal and work up to it, otherwise you can get into limbo mode.

 

Learned a lot from the forum I must say, still learning. Now I am really getting interested in writing some nice hand crafted assembly to see if I can beat the compiler in some niche tasks.

 cheeky

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian W. Kernighan

Last Edited: Sat. May 22, 2021 - 10:44 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First off.   Think about the inside loop.   i.e. functional statements.  loop control statements.

Any housekeeping,  initialisation,  outer loops, ... is not significant.

 

Write straightforward C code.   Then inspect the inside loop.

 

There is nothing better than ST Z+  (that I can think of)

If the Compiler has generated ST Z+ there is nothing to be gained from using gobbledygook.

 

Don't be proud.   Unroll-4, unroll-8,  ... are practical solutions.   Unrolling more than 16 gets a bit silly.

 

Regarding loop-control.  do-while probably makes a neater loop.  but while-do might feel more comfortable.   Once you have done an unroll-8 the loop control is insignificant.

 

Seriously.  You can create an efficient loop but you should concentrate on minimising pixel traffic.

Think about string sorting.   An Insertion Sort may be short and simple but a Hoare's Quick Sort has a dramatic performance.

Hint.   Try sorting 10k, 100k, ... items.

 

David.

Last Edited: Sat. May 22, 2021 - 11:56 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just an idle thought: Are there any points in your code, elsewhere, where you have to delay for small time periods? If so, could they be modified to do bits of the zeroing during those small time periods? If you had something that had to kill time for a hundred cycles or so every so often somewhere else, stealing those cycles would buy you a lot of time.

 

That might also be an advantage of DMA -- if it's slow, but it's slow while you are doing other things, it's still better than not doing it. Say you can write a loop that clears one byte per 2.5 cycles, and DMA can only clear one byte per 5 cycles. Okay. So hand a third of your bytes off to the DMA to zero, and you just zero the other two thirds while the DMA runs.