Posted by theusch: Mon. Sep 12, 2016 - 02:00 PM(Reply to #52)
1
2
3
4
5
Total votes: 0
sparrow2 wrote:
Do you talk about utoa(16bit) or ltoa (32bit)?
There is a big difference, a utoa can be done in about 70 clk (worst case)
"Here we go again!" ;)
Indeed, what are the rules for the "contest"? Obviously there will be a difference in time/space between 16-bit and 32-bit. Also a difference between unsigned and signed. And are you going to right-justify? Leading-zero-supress? And one always needs to tell what number is being converted, as that will change the cycles for most approaches.
Once over 32,768 the LCD displays - negative numbers.
This is the behaviour of a signed 16 bit, so I guess I jumped to the conclusion this thread was about 16bit numbers. Now I will read these other threads, I like to learn algos.
#define val_lo r22
#define val_hi r23
#define val_hlo r24
#define val_hhi r25
#define str_lo r20
#define radix r18
#define counter r19
#define digit r26
#define sign r27
ENTRY __ultoa_ncheck
clr sign
ENTRY __ultoa_common
X_movw ZL, str_lo
1: ; Saves one iteration of the digit-loop:
; If val < radix we can use the low byte of val as digit
mov digit, val_lo
cp val_lo, radix
cpc val_hi, __zero_reg__
cpc val_hlo, __zero_reg__
cpc val_hhi, __zero_reg__
; now C is set, if val < radix
sbc counter, counter
bst counter, 0
brts 4f
; counter == 0 here
; If val >= radix, then pop one digit from val
clr digit
2: ; Vanilla 32:8 quotient and remainder to pop the digit
; digit <- val % radix
; val <- val / radix
lsl val_lo
rol val_hi
rol val_hlo
rol val_hhi
rol digit
cp digit, radix
brlo 3f
sub digit, radix
; val |= 1
inc val_lo
3: ; Loop the 32 bits
subi counter, 8 ; 256 / 8 == 32 loops
brne 2b
4: ; Convert the digit to ASCII...
subi digit, -'0'
cpi digit, '9'+1
brlo 5f
subi digit, '0'-'a'+10
5: ; ... and store it to the reversed string
st Z+, digit
; Popped all digits?
brtc 1b
; Yes: Store the sign (if any)
cpse sign, __zero_reg__
st Z+, sign
; Terminate the string with '\0'
7: st Z, __zero_reg__
; Reverse the string and return the original string pointer
X_movw r24, str_lo
XJMP _U(strrev)
#include "macros.inc"
#define str_hi r25
#define str_lo r24
#define ltemp r23
#define rtemp r22
ASSEMBLY_CLIB_SECTION
.global _U(strrev)
.type _U(strrev), @function
_U(strrev):
X_movw XL, str_lo ; X is start of string
X_movw ZL, str_lo ; Z becomes end of string
; find end of string
1: mov rtemp, ltemp ; to obtain right nonzero character
ld ltemp, Z+
tst ltemp
brne 1b
sbiw ZL, 2 ; to last nonzero byte
rjmp 3f
; swap bytes
2: ld ltemp, X
st X+, rtemp
st Z, ltemp
ld rtemp, -Z
3: cp XL, ZL
cpc XH, ZH
brlo 2b
ret
If anyone can spot ways to speed/improve this feel free to say - you can become famous as one of the AVR-LibC authors and you get your name on this page...
After staring at this code for quite a while, I concluded they are doing a long division using some clever shifting of the dividend instead of the divisor. This cleverness means the code can probably not be much improved unless a different division algorithm is used (not long division).
I'm not sure if an alternative using the multiply by inverse algo would be faster, it would probably require a lookup table with "magic numbers" for all the valid inverses (1/2, 1/3....1/36).
I've had 1G37T30 times I've needed to use base 30!
But as the AVR-LibC [i|l]toa() functions has offered radix 36 since time immemorial I guess any replacement needs to continue to do so for the one person who's used a non 2/10/16 radix. Of course the code could look out for 2/10/16 and use a more optimal solution for those specific bases.
But as the AVR-LibC [i|l]toa() functions has offered radix 36 since time immemorial I guess any replacement needs to continue to do so
That's the issue. The code is probably fine, although we might be able to save a cycle here or there. The generic algo is likely good too.
The problem lies in the library interface itself. It makes sense to offer free radix in a powerful library, but we MCU users don't need it as much as we need it fast and small.
You said it yourself, Cliff: most ltoa usage is for user feedback, which means it's gonna be base 10 99% of the time. The need for base 2 and 16 is much smaller already (debugging?) and should not impact our "main" library code. The need for other radices pretty much does not exist (if you need it, just use ltoa and its sisters). If we kept to these radices, as sparrow2 mentioned, we could get much better/faster library code.
That would be the sort of library we really need, despite all the merits of a standardized libc.
Posted by Chuck99: Mon. Sep 12, 2016 - 08:50 PM(Reply to #50)
1
2
3
4
5
Total votes: 0
netizen wrote:
I wouldn't give up 2300+ cycles to put 1000000 in a string.
It's interesting that you consider 2300+ cycles (147us at 16MHz) to be a great burden that must be avoided.
Every program that I have written for microcontrollers has periods where it wastes cycles waiting for something to happen.
It is during these periods that I schedule operations such as ltoa(), so the burden is nil.
What is nice about having ltoa() in the library is that it's an alternative to sprinf() that uses less flash. Also, being in the library it is likely to be bug free.
If there was a smaller and faster equivalent function in the library that only did radix 10, I would gladly use that.
It's interesting that you consider 2300+ cycles (147us at 16MHz) to be a great burden that must be avoided.
Well, I get your point but I don't want to get drowned into that. Surely you'll agree that you can't implement a library like libc with this mindset though…
Chuck99 wrote:
If there was a smaller and faster equivalent function in the library that only did radix 10, I would gladly use that.
I'm glad to hear that. :-)
In fact, another non-libc dedicated function would be the best choice: I don't think it's realistic to imagine freaks will get to weight on the libc interface though. And if we keep to the current interface, even with branching to radix-optimized code, we'll still be clobering a parameter register and wasting a few cycles because we have to specify a 10 radix. It's not much, but it will happen 99% of the time…
Nevertheless, what we could do is propose alternative implementation for existing libc functions that do provide the benefits we need. That would be an improvement.
In fact, another non-libc dedicated function would be the best choice: I don't think it's realistic to imagine freaks will get to weight on the libc interface though. And if we keep to the current interface, even with branching to radix-optimized code, we'll still be clobering a parameter register and wasting a few cycles because we have to specify a 10 radix. It's not much, but it will happen 99% of the time…
Nevertheless, what we could do is propose alternative implementation for existing libc functions that do provide the benefits we need. That would be an improvement.
Maybe not, since the base is a known constant at compile time, the compiler might be smart enough to optimize all the overhead away, and directly call the "right" alternative function. Who knows, only with testing...
Maybe not, since the base is a known constant at compile time, the compiler might be smart enough to optimize all the overhead away, and directly call the "right" alternative function. Who knows, only with testing...
Good point. I don't think GCC will mess with a function definition (remove an unused/constant parameter) though…
If worst case needs to be improved some extra code can be added
either first (forget first digit it's not full) do sub 4000 until negative and then add 1000 until positiv then with 400 100 40 10 4 1.
or do some bit checks if bit 12 is set (and first digit found) init with 4 and sub 5000 (we know what we can do 4000) and check etc.
I have not made it but I guess about 100- 120 clk for worst-case can be done.
The good thing is that this code works on tiny's.
For mega's the HW mul can be used to make div by mul with 1/x.
And that can the be done in worst-case of 65 clk (my old version can be improved)
Both can be made for 32 bit
(with the 1/x version I would div 32 bit with 10000 and the use the 16 bit rutine without first digit, then div with 10000 again .. .. and the the last 2 digits)
I personally won't bother spending time "fixing" this unless the powers that be agree on the principle beforehand. In which case, they'll probably implement it themselves —but that's another story.
Do we have an agreement here that optimized code for radices 10, 2 and 16 would be beneficial, and provide more opportunities for these stdlib conversion functions to be used more widely?
Do we have an agreement here that optimized code for radices 10, 2 and 16 would be beneficial, and provide more opportunities for these stdlib conversion functions to be used more widely?
FWIW I don't use any of these XtoY functions for this reason alone, but perhaps that's just me…
Do we have an agreement here that optimized code for radices 10, 2 and 16 would be beneficial, and provide more opportunities for these stdlib conversion functions to be used more widely?
The way collaborative software tends to work is that when someone finds some missing/deficient feature in what exists that they simply cannot live without they put in the time/effort to implement it and then if they are a generous soul they donate their work back for others to share later. While it sounds like you may consider Xtoa() to be deficient I get the sneaking feeling that your need to have it "corrected" is possibly not great enough to drive you to be the person that must have it fixed at all costs ;-) If someone comes along saying "I need radix 10/16 as fast as possible at any cost possible!!" then I guess we may have found our candidate to make changes?
The way collaborative software tends to work is that when someone finds some missing/deficient feature in what exists that they simply cannot live without they put in the time/effort to implement it and then if they are a generous soul they donate their work back for others to share later.
My experience in dealing with established open source projects is slightly different: it is likely that your personal improvements will be refused because in the general case there are reasons not to implement them that way. It could be constraints you're not aware of, because you only see part of the picture. It could be some less significant habits that someone higher up in the food chain happens to have. Sometimes it might just be an ego trip from someone powerful enough to impose it's tantrum on the rest of us. I could link to examples of each from my own experience, but that wouldn't help. :-)
The thing is, coding something for one project and coding it for something like stdlib is not the same: of course the later requires much higher standards, and they obviously come at a cost. Thus my question: I'm not gonna bear that cost to play bingo and eventually see it all go to waste…
I've used base 2 sometimes but it sure would help if there'd also been a leading 0 padding option! You end up itoa() and then doing str*() to put in the leading zeroes anyway.
I've used base 2 sometimes but it sure would help if there'd also been a leading 0 padding option! You end up itoa() and then doing str*() to put in the leading zeroes anyway.
I feel you, but this ain't going to happen: we're not gonna change the stdlib API. What about optimized code for common radices?
I did eventually - there's something very odd about this whole Freaks/Communities site thing. It seemed to believe I *was* logged into Freaks but was not logged into Communities.
netizen wrote:
What about optimized code for common radices?
I'm not really the person to ask (just another user). The people who develop AVR-LibC gather here:
I'm not really the person to ask (just another user). The people who develop AVR-LibC gather here
I know where they gather. I just want to know where freaks stand before I bother them: it's not the same if it's just me, or experienced freaks backing up the effort.
Well, for binary keeping leading zeroes would be trivially simple, you just need to shift the bits to the carry flag, then adc with '0' char, repeat 32 times, add the zero string terminator, and that's it. For other bases that are powers of 2, it would also be very easy.
Nearly always I want my display numbers to be right-justified [and fixed width]. The biggest exception might be free--format "dump" strings, but then size/speed don't matter that much. So my home-built routines are almost always adaptations of "itoa()"-type.
The same goes for leading-zero suppression. Usually I want it for display, but not for manipulating setpoints and a few other situations. So I've got an option for that.
When catered for, both of those are virtually free on top of the actual conversion itself. IMO/IME a lot cheaper when integrated into the conversion than with a post-processing pass. Suit yourself. If the app has a need for (s)printf otherwise, I'll tend to use that and hang the extra cycles a few times a second. If the app is updating a many-digit 7-seg high-speed counter https://www.marshbellofram.com/a... then indeed I'll try to keep the binary=>BCD time down.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
I'm not really the person to ask (just another user). The people who develop AVR-LibC gather here
I know where they gather. I just want to know where freaks stand before I bother them: it's not the same if it's just me, or experienced freaks backing up the effort.
By the way, Lee, in light of this opinion: how come you're using your own version instead of a stdlib ready-made function?
Isn't that all explored in the link I gave in #53? (#21; #23; ...)
Indeed it is. But I don't want to take chances and let misinterpretation blur my point: I want your explicit opinion on this particular matter.
theusch wrote:
When catered for, both of those are virtually free on top of the actual conversion itself. IMO/IME a lot cheaper when integrated into the conversion than with a post-processing pass.
So, you think having optimized code for common radices is a "solution in search of a problem" (#81), yet you roll your own code with integrated right-justified and fixed width to save a few cycles of post-processing. Am I getting this right?
Define "few". The best itoa/utoa implementations, depending on the rules, are a couple hundred cycles. I'll wager that taking that output, placing right-justified into a fixed-width field, and leading-zero-supress will double that.
And indeed, IMO/IME radices other than 10 are rare. Base 16, once in a while. [BUT THAT IS SUCH A TRIVIAL CASE THAT IT WILL BLOW THE SHORTS OFF ANY GENERIC IMPLEMENTATION--the value is already in packed-BCD format.]
I don't know what you are pounding at. If you care to propose changes to this particular toolchain for consideration, that is up to you. No blessing from me is needed. It ain't my usual toolchain anyway.
netizen wrote:
instead of a stdlib ready-made function?
Remember that itoa and variants aren't part of any standard C library. You are free to extend the language however you see fit.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
Define "few". The best itoa/utoa implementations, depending on the rules, are a couple hundred cycles. I'll wager that taking that output, placing right-justified into a fixed-width field, and leading-zero-supress will double that.
First, remember the initial case for this thread is ltoa, not itoa/utoa. And the cycles count is 2300, thus ±200 would indeed be "few".
But let's make it easy on you and consider the fastest: utoa, so you might have a point on a relative basis.
The post-processing function would be something like:
_utoa(char *src, char *dst, char fill);
Its job is to move a pointer through *src until it encounters /0 (6 steps max), then move backward copying *src to *dst+i and filling what's left with fill (5 steps max). That, according to you, is 200 cycles long? Challenge accepted: show me the code!
theusch wrote:
And indeed, IMO/IME radices other than 10 are rare. Base 16, once in a while.
Alright, so you actually are agreeing with me/us. :-)
theusch wrote:
I don't know what you are pounding at. If you care to propose changes to this particular toolchain for consideration, that is up to you. No blessing from me is needed.
I need feedback from experienced people to support my case. You happen to be one of them, although you claim not to. :-)
Posted by theusch: Wed. Sep 14, 2016 - 06:16 PM(Reply to #96)
1
2
3
4
5
Total votes: 0
netizen wrote:
Alright, nevermind. You have chosen to object, no matter what. It's your choice and I respect that. Let's leave it at that. :-)
No. The 2300+ cycles that you are using was your number, from Chuck's ltoa() experiment with 1000000. How do you relate that to itoa or a utoa implementation?
In the other thread, my toolchain's itoa() was 341 cycles for 9999.
danni's approach was 165 cycles for 9999.
sparrow2's approach is about 100, right?
Mixing apples and oranges will give fruit salad. And my post-processing comments stand. As well as how widely arcane bases might be used.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
Your post-processing comments will be validated when you accept the challenge I gave you. Until then, they're just FUD.
theusch wrote:
No. The 2300+ cycles that you are using was your number, from Chuck's ltoa() experiment with 1000000. How do you relate that to itoa or a utoa implementation?
You wouldn't ask if you had followed this thread: they're all under-optimized for common radices because they use a generic algorithm that supports uncommon ones.
theusch wrote:
In the other thread, my toolchain's itoa() was 341 cycles for 9999. danni's approach was 165 cycles for 9999. sparrow2's approach is about 100, right?
Right. And precisely my point: I'd like to have optimized code such as these when I use a XtoY function without thinking. That's what stdlib should be about in AVR platforms anyway.
Look, I'm getting tired of this. Please, let's leave it at that.
This one uses the hardware multiplier, so it's not for ATtinys. And it's not working yet, most results differ from the right one by 1 unit less, needs a less sloppy division, so it will take more cycles. Working on it.
8 bit input in r16. Result in r18:r17:r16 (3 digits)
This incorrect version takes 14 cycles for 3 digits, less significant digit is in error by 1 unit. Sorry for no comments, but I'm a bit sleepy now, maybe tomorrow.
There is no byte-to-string facility in stdlib, though: candidates are int-to-string and long-to-string, both signed and unsigned.
Even if there's probably less to be gained than with ltoa —but it's probably more commonly used, here is utoa for reference (utoa_ncheck.S):
#if !defined (__AVR_TINY__)
#include "asmdef.h"
/* char *__utoa_ncheck (unsigned val, char *s, unsigned char radix)
This function is the utoa() without checking the radix for 2..36 margins.
*/
#define val_lo r24
#define val_hi r25
#define str_lo r22
#define radix r20
#define counter r21
#define digit r26
#define sign r27
// clock cycles
ENTRY __utoa_ncheck
clr sign // 1
ENTRY __utoa_common
X_movw ZL, str_lo // 1
clr counter // 1
1: ; Vanilla 16:8 quotient and remainder to pop the digit
; digit <- val % radix
; val <- val / radix
clr digit // 1
2: lsl val_lo // 1
rol val_hi // 1
rol digit // 1
cp digit, radix // 1
brlo 3f // 1/2
sub digit, radix // 1
inc val_lo // 1
3: subi counter, 16 ; 256 / 16 == 16 loops // 1
brne 2b // 1/2
; Convert the digit to ASCII...
subi digit, -'0' // 1
cpi digit, '9'+1 // 1
brlo 4f // 1/2
subi digit, '0'-'a'+10 // 1
4: ; ... and store it to the reversed string
st Z+, digit // 1(tiny/xmega)/2
; Iterate until all digits are sucked out of VAL.
sbiw val_lo, 0 // 2
brne 1b // 1/2
; Store the sign (if any)
cpse sign, __zero_reg__ // 1/2
st Z+, sign // 1(tiny/xmega)/2
; Terminate the string with '\0'
st Z+, __zero_reg__ // 1(tiny/xmega)/2
; Reverse the string and return the original string pointer
X_movw r24, str_lo // 1
XJMP _U(strrev) // …
ENDFUNC
#endif /* !__AVR_TINY__ */
And its front-end (stdlib.h):
/**
\ingroup avr_stdlib
\brief Convert an unsigned integer to a string.
The function utoa() converts the unsigned integer value from \c val into an
ASCII representation that will be stored under \c s. The caller
is responsible for providing sufficient storage in \c s.
\note The minimal size of the buffer \c s depends on the choice of
radix. For example, if the radix is 2 (binary), you need to supply a buffer
with a minimal length of 8 * sizeof (unsigned int) + 1 characters, i.e. one
character for each bit plus one for the string terminator. Using a larger
radix will require a smaller minimal buffer size.
\warning If the buffer is too small, you risk a buffer overflow.
Conversion is done using the \c radix as base, which may be a
number between 2 (binary conversion) and up to 36. If \c radix
is greater than 10, the next digit after \c '9' will be the letter
\c 'a'.
The utoa() function returns the pointer passed as \c s.
*/
#ifdef __DOXYGEN__
extern char *utoa(unsigned int val, char *s, int radix);
#else
extern __inline__ __ATTR_GNU_INLINE__
char *utoa (unsigned int __val, char *__s, int __radix)
{
if (!__builtin_constant_p (__radix)) {
extern char *__utoa (unsigned int, char *, int);
return __utoa (__val, __s, __radix);
} else if (__radix < 2 || __radix > 36) {
*__s = 0;
return __s;
} else {
extern char *__utoa_ncheck (unsigned int, char *, unsigned char);
return __utoa_ncheck (__val, __s, __radix);
}
}
#endif
Do you talk about utoa(16bit) or ltoa (32bit)?
There is a big difference, a utoa can be done in about 70 clk (worst case)
- Log in or register to post comments
Top"Here we go again!" ;)
Indeed, what are the rules for the "contest"? Obviously there will be a difference in time/space between 16-bit and 32-bit. Also a difference between unsigned and signed. And are you going to right-justify? Leading-zero-supress? And one always needs to tell what number is being converted, as that will change the cycles for most approaches.
Starting with
https://www.avrfreaks.net/forum/s...
...one toolchain took a bit over 100 cycles for itoa(9999). https://www.avrfreaks.net/comment...
sparrow2, what is your worst-case number to convert?
In that thread, danni gave a routine that handles up to 42000 in a couple hundred cycles and is small.
sparrow2 referred to a different approach, and posted the result in
https://www.avrfreaks.net/comment...
...
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
Topremember this thread is 100% 32 bit!
that was why I made a comment.
I would say that anything about 16 bit in this thread would be noise!
- Log in or register to post comments
Top???
It looks like unsigned 16-bit to me... "utoa"
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopChill out, the OP said:
This is the behaviour of a signed 16 bit, so I guess I jumped to the conclusion this thread was about 16bit numbers. Now I will read these other threads, I like to learn algos.
- Log in or register to post comments
Topsorry but the last 20 post's have been about ltoa.
- Log in or register to post comments
TopAs you have in the last link about 70 clk for a full 16 bit
and if max is 9999 it's about 50 clk
And about 300 if it's extented to a ltoa (use same rutine and make two div with 10000)
- Log in or register to post comments
TopNow I'll suspect your clock counts, if 20/50 is 100%. ;)
I'll have to examine your code. Will this give exact results for all input numbers?
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopFor the record this is what you are calling when you call ltoa() in AVR-LibC:
If anyone can spot ways to speed/improve this feel free to say - you can become famous as one of the AVR-LibC authors and you get your name on this page...
http://www.nongnu.org/avr-libc/u...
(note that you have to willing to make the code available under a BSD licence).
- Log in or register to post comments
Top+1
Let's make this interesting… ^^
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopYes
- Log in or register to post comments
TopAfter staring at this code for quite a while, I concluded they are doing a long division using some clever shifting of the dividend instead of the divisor. This cleverness means the code can probably not be much improved unless a different division algorithm is used (not long division).
I'm not sure if an alternative using the multiply by inverse algo would be faster, it would probably require a lookup table with "magic numbers" for all the valid inverses (1/2, 1/3....1/36).
I think I'll pass on this one.
- Log in or register to post comments
TopBut in reality who want other than 2 10 and 16 as base numbers?
and 2 and 16 are easy
so it's only 10 we need to look at.
- Log in or register to post comments
TopI've had 1G37T30 times I've needed to use base 30!
But as the AVR-LibC [i|l]toa() functions has offered radix 36 since time immemorial I guess any replacement needs to continue to do so for the one person who's used a non 2/10/16 radix. Of course the code could look out for 2/10/16 and use a more optimal solution for those specific bases.
- Log in or register to post comments
TopThat's the issue. The code is probably fine, although we might be able to save a cycle here or there. The generic algo is likely good too.
The problem lies in the library interface itself. It makes sense to offer free radix in a powerful library, but we MCU users don't need it as much as we need it fast and small.
You said it yourself, Cliff: most ltoa usage is for user feedback, which means it's gonna be base 10 99% of the time. The need for base 2 and 16 is much smaller already (debugging?) and should not impact our "main" library code. The need for other radices pretty much does not exist (if you need it, just use ltoa and its sisters). If we kept to these radices, as sparrow2 mentioned, we could get much better/faster library code.
That would be the sort of library we really need, despite all the merits of a standardized libc.
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopJust for fun: how many cycles does it take ltoa to convert 1000000 to binary?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopYou discarded an important part (stdlib.h):
There is provision here to do further radix checks and branch to radix-optimized code.
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopIt's interesting that you consider 2300+ cycles (147us at 16MHz) to be a great burden that must be avoided.
Every program that I have written for microcontrollers has periods where it wastes cycles waiting for something to happen.
It is during these periods that I schedule operations such as ltoa(), so the burden is nil.
What is nice about having ltoa() in the library is that it's an alternative to sprinf() that uses less flash. Also, being in the library it is likely to be bug free.
If there was a smaller and faster equivalent function in the library that only did radix 10, I would gladly use that.
- Log in or register to post comments
TopWell, I get your point but I don't want to get drowned into that. Surely you'll agree that you can't implement a library like libc with this mindset though…
I'm glad to hear that. :-)
In fact, another non-libc dedicated function would be the best choice: I don't think it's realistic to imagine freaks will get to weight on the libc interface though. And if we keep to the current interface, even with branching to radix-optimized code, we'll still be clobering a parameter register and wasting a few cycles because we have to specify a 10 radix. It's not much, but it will happen 99% of the time…
Nevertheless, what we could do is propose alternative implementation for existing libc functions that do provide the benefits we need. That would be an improvement.
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopMaybe not, since the base is a known constant at compile time, the compiler might be smart enough to optimize all the overhead away, and directly call the "right" alternative function. Who knows, only with testing...
- Log in or register to post comments
TopGood point. I don't think GCC will mess with a function definition (remove an unused/constant parameter) though…
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopFor fast and small take a look at danni's code there https://www.avrfreaks.net/comment...
It is small has a max of 170 clk.
If worst case needs to be improved some extra code can be added
either first (forget first digit it's not full) do sub 4000 until negative and then add 1000 until positiv then with 400 100 40 10 4 1.
or do some bit checks if bit 12 is set (and first digit found) init with 4 and sub 5000 (we know what we can do 4000) and check etc.
I have not made it but I guess about 100- 120 clk for worst-case can be done.
The good thing is that this code works on tiny's.
For mega's the HW mul can be used to make div by mul with 1/x.
And that can the be done in worst-case of 65 clk (my old version can be improved)
Both can be made for 32 bit
(with the 1/x version I would div 32 bit with 10000 and the use the 16 bit rutine without first digit, then div with 10000 again .. .. and the the last 2 digits)
- Log in or register to post comments
TopI don't think that the radix needs to be known at compile time, radix is not a const.
- Log in or register to post comments
TopThanks, sparrow2.
I personally won't bother spending time "fixing" this unless the powers that be agree on the principle beforehand. In which case, they'll probably implement it themselves —but that's another story.
Do we have an agreement here that optimized code for radices 10, 2 and 16 would be beneficial, and provide more opportunities for these stdlib conversion functions to be used more widely?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopFWIW I don't use any of these XtoY functions for this reason alone, but perhaps that's just me…
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
Top- Log in or register to post comments
TopMy experience in dealing with established open source projects is slightly different: it is likely that your personal improvements will be refused because in the general case there are reasons not to implement them that way. It could be constraints you're not aware of, because you only see part of the picture. It could be some less significant habits that someone higher up in the food chain happens to have. Sometimes it might just be an ego trip from someone powerful enough to impose it's tantrum on the rest of us. I could link to examples of each from my own experience, but that wouldn't help. :-)
The thing is, coding something for one project and coding it for something like stdlib is not the same: of course the later requires much higher standards, and they obviously come at a cost. Thus my question: I'm not gonna bear that cost to play bingo and eventually see it all go to waste…
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopBut I notice you conveniently dodged the question: do you support the effort or not?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopBy the way, Cliff: you've got mail. :-)
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopHow many times has anyone here used conversion to other than base 10 and an occasional base 16 in any application?
And as a "real" part of the application and not a curiosity or "don't care about the cycles"?
There are some that like their base 2 displays. I cannot remember ever doing that myself.
I can't remember doing octal since PDP and VAX days.
I remember a base 36 in a past life. Let's call that a curiosity.
So IMO your effort would be a solution in search of a problem.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopThanks. We happen to disagree, but at least you made your opinion clear. :-)
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopBy the way, Lee, in light of this opinion: how come you're using your own version instead of a stdlib ready-made function?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopI've used base 2 sometimes but it sure would help if there'd also been a leading 0 padding option! You end up itoa() and then doing str*() to put in the leading zeroes anyway.
- Log in or register to post comments
TopCliff: did you get my private message?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopI feel you, but this ain't going to happen: we're not gonna change the stdlib API. What about optimized code for common radices?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
Tophttps://lists.nongnu.org/mailman...
There is this too:
https://lists.nongnu.org/mailman...
(several say they prefer to deal with this stuff via email rather than internet fora).
- Log in or register to post comments
TopI know where they gather. I just want to know where freaks stand before I bother them: it's not the same if it's just me, or experienced freaks backing up the effort.
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopWell, for binary keeping leading zeroes would be trivially simple, you just need to shift the bits to the carry flag, then adc with '0' char, repeat 32 times, add the zero string terminator, and that's it. For other bases that are powers of 2, it would also be very easy.
- Log in or register to post comments
TopIsn't that all explored in the link I gave in #53? (#21; #23; ...)
https://www.avrfreaks.net/comment...
When catered for, both of those are virtually free on top of the actual conversion itself. IMO/IME a lot cheaper when integrated into the conversion than with a post-processing pass. Suit yourself. If the app has a need for (s)printf otherwise, I'll tend to use that and hang the extra cycles a few times a second. If the app is updating a many-digit 7-seg high-speed counter https://www.marshbellofram.com/a... then indeed I'll try to keep the binary=>BCD time down.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopCan I refer Sir back to:
#47: https://www.avrfreaks.net/comment...
My personal opinion has not changed from there I'm afraid.
- Log in or register to post comments
TopIndeed it is. But I don't want to take chances and let misinterpretation blur my point: I want your explicit opinion on this particular matter.
So, you think having optimized code for common radices is a "solution in search of a problem" (#81), yet you roll your own code with integrated right-justified and fixed width to save a few cycles of post-processing. Am I getting this right?
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopDefine "few". The best itoa/utoa implementations, depending on the rules, are a couple hundred cycles. I'll wager that taking that output, placing right-justified into a fixed-width field, and leading-zero-supress will double that.
And indeed, IMO/IME radices other than 10 are rare. Base 16, once in a while. [BUT THAT IS SUCH A TRIVIAL CASE THAT IT WILL BLOW THE SHORTS OFF ANY GENERIC IMPLEMENTATION--the value is already in packed-BCD format.]
I don't know what you are pounding at. If you care to propose changes to this particular toolchain for consideration, that is up to you. No blessing from me is needed. It ain't my usual toolchain anyway.
Remember that itoa and variants aren't part of any standard C library. You are free to extend the language however you see fit.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopFirst, remember the initial case for this thread is ltoa, not itoa/utoa. And the cycles count is 2300, thus ±200 would indeed be "few".
But let's make it easy on you and consider the fastest: utoa, so you might have a point on a relative basis.
The post-processing function would be something like:
Its job is to move a pointer through *src until it encounters /0 (6 steps max), then move backward copying *src to *dst+i and filling what's left with fill (5 steps max). That, according to you, is 200 cycles long? Challenge accepted: show me the code!
Alright, so you actually are agreeing with me/us. :-)
I need feedback from experienced people to support my case. You happen to be one of them, although you claim not to. :-)
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopSigh.
and then the complaint "Once over 32,768 the LCD displays - negative numbers."
The cycles count for what? Many cycle counts given in the other thread, and in this one.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopAlright, nevermind. You have chosen to object, no matter what. It's your choice and I respect that. Let's leave it at that. :-)
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopNo. The 2300+ cycles that you are using was your number, from Chuck's ltoa() experiment with 1000000. How do you relate that to itoa or a utoa implementation?
In the other thread, my toolchain's itoa() was 341 cycles for 9999.
danni's approach was 165 cycles for 9999.
sparrow2's approach is about 100, right?
Mixing apples and oranges will give fruit salad. And my post-processing comments stand. As well as how widely arcane bases might be used.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopYour post-processing comments will be validated when you accept the challenge I gave you. Until then, they're just FUD.
You wouldn't ask if you had followed this thread: they're all under-optimized for common radices because they use a generic algorithm that supports uncommon ones.
Right. And precisely my point: I'd like to have optimized code such as these when I use a XtoY function without thinking. That's what stdlib should be about in AVR platforms anyway.
Look, I'm getting tired of this. Please, let's leave it at that.
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopIf it only needs to work up to 9999 my routine take 45 clk (it's relatively hard to find the first digit)
and for all 16 bit it can be made in 65 clk max (min is almost the same 2 less ).
I was looking on some 6502 code for div a byte with 10
It's correct for all numbers (I have checked it)
input in r24
output in r24
r25 change
It take 13 clk.
for the itoa it only needs to be correct up to 99 are there there anyone that know if it can be done faster ? (with a bit more sloppy fraction code)
add:
fractions to the code
- Log in or register to post comments
TopThis one uses the hardware multiplier, so it's not for ATtinys. And it's not working yet, most results differ from the right one by 1 unit less, needs a less sloppy division, so it will take more cycles. Working on it.
8 bit input in r16. Result in r18:r17:r16 (3 digits)
This incorrect version takes 14 cycles for 3 digits, less significant digit is in error by 1 unit. Sorry for no comments, but I'm a bit sleepy now, maybe tomorrow.
- Log in or register to post comments
TopThanks for the code guys.
There is no byte-to-string facility in stdlib, though: candidates are int-to-string and long-to-string, both signed and unsigned.
Even if there's probably less to be gained than with ltoa —but it's probably more commonly used, here is utoa for reference (utoa_ncheck.S):
And its front-end (stdlib.h):
ɴᴇᴛɪᴢᴇᴎ
- Log in or register to post comments
TopPages