my_strerror() as API function

With commit ec268cc8df7c7a90811a099d422eef6a31bf9f8b, between 5.27.1
and 5.27.2, my_strerror() was removed from the public API.  I think this
was a mistake and it should be restored.  It provides a useful facility
that's otherwise difficult to achieve: errno-based messages that are
responsive to "use locale" in the same way as $!.  I use it for this
purpose in the CPAN module Hash::SharedMem, and its withdrawal from the
public API breaks its intended behaviour of matching $!.

-zefram
0
zefram
8/12/2017 10:53:54 AM
perl.perl5.porters 46755 articles. 0 followers. Follow

13 Replies
42 Views

Similar Articles

[PageSpeed] 41

I wrote:
>                                          It provides a useful facility
>that's otherwise difficult to achieve: errno-based messages that are
>responsive to "use locale" in the same way as $!.

Actually it's not quite the same, because there's an encoding issue.
In scope of "use locale", my_strerror() returns a string encoded in the
locale's charset.  $! uses a dodgy heuristic to sometimes decode this.

As a CPAN author, it'd be nice to have an API function that shows
what would go into $! for a given errno.  It'd have to return an SV,
or operate by writing to a supplied SV.  Currently the behaviour would
be my_strerror() plus dubious setting of SvUTF8.

As a core coder and general Perl programmer, it'd be nice to have proper
string decoding on $!.  It should be decoded based on the actual character
encoding of the locale that supplied the string, not just a guess.
It should be decoded regardless of what the encoding is, not only if
it's UTF-8.

-zefram
0
zefram
8/12/2017 5:55:00 PM
On 08/12/2017 11:55 AM, Zefram wrote:
> I wrote:
>>                                           It provides a useful facility
>> that's otherwise difficult to achieve: errno-based messages that are
>> responsive to "use locale" in the same way as $!.
> 
> Actually it's not quite the same, because there's an encoding issue.
> In scope of "use locale", my_strerror() returns a string encoded in the
> locale's charset.  $! uses a dodgy heuristic to sometimes decode this.
> 
> As a CPAN author, it'd be nice to have an API function that shows
> what would go into $! for a given errno.  It'd have to return an SV,
> or operate by writing to a supplied SV.  Currently the behaviour would
> be my_strerror() plus dubious setting of SvUTF8.
> 
> As a core coder and general Perl programmer, it'd be nice to have proper
> string decoding on $!.  It should be decoded based on the actual character
> encoding of the locale that supplied the string, not just a guess.
> It should be decoded regardless of what the encoding is, not only if
> it's UTF-8.

I don't understand much of your point, but patches welcome.

The heuristic you say is dodgy has been used traditionally in perl, and 
it actually works well.  For those of you who aren't familiar with it, 
it leaves the UTF-8 flag off on strings that have the same 
representation in UTF-8 as not.  For those, the flag's state is 
immaterial.  For other strings, it turns on the flag if and only if it 
is syntactically legal UTF-8.  It turns out, due to the structured 
nature of UTF-8 and the chance way that symbols vs word characters are 
encoded in Latin-1 that it's very unlikely that a string of real words 
that are UTF-8 variant will be incorrectly classified.  The comments in 
the code quote http://en.wikipedia.org/wiki/Charset_detection to that 
effect.

There is no way of being able to determine with total reliability the 
locale that something is encoded in across all systems that Perl can run on.
> 
> -zefram
> 
0
public
8/12/2017 10:31:40 PM
Karl Williamson wrote:
>The heuristic you say is dodgy has been used traditionally in perl,

I don't recall ever encountering it before.  Though looking now, I see
some other locale-related uses and, scarily, some in the tokeniser.

>                      For those of you who aren't familiar with it, it leaves
>the UTF-8 flag off on strings that have the same representation in UTF-8 as
>not.  For those, the flag's state is immaterial.

This is presupposing that the only thing to decide is whether to turn
on SvUTF8.  A more accurate statement of this part of the heuristic
would be that it interprets any byte sequence that could be valid ASCII
as ASCII.  This gives the correct result if the actual encoding of the
input is ASCII-compatible, which one would hope would always be the case
for locale encodings on an ASCII-based platform.  (I'm ignoring EBCDIC.)

>                                                  For other strings, it turns
>on the flag if and only if it is syntactically legal UTF-8.

So the effect is to decode as UTF-8 if it looks like UTF-8.  This will
correctly decode strings for any UTF-8 locale.  But you ignored what
happens in the other case: in your terminology it "leaves the flag off";
the effect is that it decodes as ISO-8859-1.  As you say, it will usually
avoid decoding as UTF-8 if the encoding was actually ISO-8859-1, so it'll
usually get a correct decoding for an ISO-8859-1 locale.  (Usually is
not always: I wouldn't want to rely on this for semantic purposes,
but if only message legibility is at stake then it might be acceptable.)

But since UTF-8 and ISO-8859-1 are the only decoding options (because
it's only willing to decide the SvUTF8 flag state), it's *guaranteed*
to decode incorrectly for anything that's neither of these encodings.
Cyrillic in ISO-8859-5?  Guaranteed to get that wrong.  And the layout of
ISO-8859-5 is very different from ISO-8859-1, having many more letters,
such that a natural string is considerably more likely to accidentally
look like UTF-8.  So no guarantee of which kind of mojibake you'll get.

$! used to be consistently mojibaked in a locale-to-Latin-1 manner.
That sucked.  Now, outside the scope of "use locale" it's consistently
English, which is better.  But if one wants localised messages and so
uses "use locale", now $! isn't consistently anything.  It's worse than
when it was wrong in a consistent way.

>There is no way of being able to determine with total reliability the locale
>that something is encoded in

Wrong question.  We're not given an arbitrary string and made to guess
its locale.  We *know* the locale, because it's the LC_MESSAGES setting
under which we just called strerror().  The tricky bit is to determine
the character encoding that the locale uses.

>                             across all systems that Perl can run on.

True, but we can do a lot better than we do now.  nl_langinfo(CODESET)
yields a string naming the encoding, on a lot of systems.  We can feed
that encoding name into Encode.

In fact, we've already got code using nl_langinfo() in the core, in
locale.c, to try to determine whether a locale uses the UTF-8 encoding.
Apparently to control the behaviour of -CL.  We could do a lot more
with this.

-zefram
0
zefram
8/12/2017 11:21:55 PM
Karl Williamson wrote:
>patches welcome.

Branch zefram/sv_string_from_errnum implements the first thing that
I want.  New API function, doing my_strerror() but with return as an SV.
There's no change in the decoding heuristics; this is just making $!'s
logic available to XS modules.  If there's any change in the decoding
for $! in the future (or any change in the locale selection logic),
that would be reflected in the behaviour of this API function.

I'm dubious about this function going into the "Magical Functions"
section of perlapi(1).  It ends up there by default, because it's in mg.c,
because it calls a static function that's defined there (and which is
called from other locations in that file).  It's not really a suitable
section, but I don't see what the correct section would be.

If there are no objections, I'll probably merge this in a few days,
possibly doing something about the doc section.

-zefram
0
zefram
8/13/2017 1:14:20 AM
On 08/12/2017 05:21 PM, Zefram wrote:
> Karl Williamson wrote:
>> The heuristic you say is dodgy has been used traditionally in perl,
> 
> I don't recall ever encountering it before.  Though looking now, I see
> some other locale-related uses and, scarily, some in the tokeniser.
> 
>>                       For those of you who aren't familiar with it, it leaves
>> the UTF-8 flag off on strings that have the same representation in UTF-8 as
>> not.  For those, the flag's state is immaterial.
> 
> This is presupposing that the only thing to decide is whether to turn
> on SvUTF8.  A more accurate statement of this part of the heuristic
> would be that it interprets any byte sequence that could be valid ASCII
> as ASCII.  This gives the correct result if the actual encoding of the
> input is ASCII-compatible, which one would hope would always be the case
> for locale encodings on an ASCII-based platform.  (I'm ignoring EBCDIC.)
> 
>>                                                   For other strings, it turns
>> on the flag if and only if it is syntactically legal UTF-8.
> 
> So the effect is to decode as UTF-8 if it looks like UTF-8.  This will
> correctly decode strings for any UTF-8 locale.  But you ignored what
> happens in the other case: in your terminology it "leaves the flag off";
> the effect is that it decodes as ISO-8859-1.  As you say, it will usually
> avoid decoding as UTF-8 if the encoding was actually ISO-8859-1, so it'll
> usually get a correct decoding for an ISO-8859-1 locale.  (Usually is
> not always: I wouldn't want to rely on this for semantic purposes,
> but if only message legibility is at stake then it might be acceptable.)

The point of this is message legibility. so that "$!" doesn't create 
mojibake.  We have had no complaints since it got fixed to work this 
way.  It was never intended to do what you want to extend it to.

Please don't use the terms decode and encode.  They are ambiguous.

What it appears you want to do is to translate the text from the user's 
locale into Perl's underlying encoding, which is ASCII/ISO8859-1, or 
UTF-8.  That may be a worthwhile enhancement, but for current purposes, 
as I said, that hasn't been necessary.  We don't analyze the error 
message; it's just displayed, and as long as it comes out in the 
encoding the user expects, it all works.
> 
> But since UTF-8 and ISO-8859-1 are the only decoding options (because
> it's only willing to decide the SvUTF8 flag state), it's *guaranteed*
> to decode incorrectly for anything that's neither of these encodings.
> Cyrillic in ISO-8859-5?  Guaranteed to get that wrong.  And the layout of
> ISO-8859-5 is very different from ISO-8859-1, having many more letters,
> such that a natural string is considerably more likely to accidentally
> look like UTF-8.  So no guarantee of which kind of mojibake you'll get.

As I said, we are not currently trying to find the encoding the text the 
message is in, just to prevent mojibake, and for that, all that is 
needed is to determine if something is UTF-8 or not, since UTF-8 is the 
only multi-byte encoding that Perl supports.

I had never really thought about this before, but I was wrong that the 
result depended on the particular way word characters vs punctuation 
were positioned in 8859-1.  I read that somewhere sometime, and just 
assumed it was true.  In fact, the range of code points 80 - 9F are not 
allocated in any ISO 8859 encoding.  This range is entirely controls, 
hardly used anywhere anymore, and certainly not in the middle of text. 
However, characters from this range are used in every non-ASCII UTF-8 
sequence as continuation bytes.  This means that the heuristic is 100% 
accurate in distinguishing UTF-8 from any of the 8859 encodings, 
contrary to what you said about 8859-5.

I concede that there are encodings that do use the 80-9F range, and 
these could be wrongly guessed.  The most likely one still in common use 
is CP 1252.  I did try once to create a string that made sense in both 
encodings, and I did succeed, but it was quite hard for me to do, and 
was very short; much shorter than an error message.
> 
> $! used to be consistently mojibaked in a locale-to-Latin-1 manner.
> That sucked.  Now, outside the scope of "use locale" it's consistently
> English, which is better.  But if one wants localised messages and so
> uses "use locale", now $! isn't consistently anything.  It's worse than
> when it was wrong in a consistent way.

That statement doesn't make sense to me.

> 
>> There is no way of being able to determine with total reliability the locale
>> that something is encoded in
> 
> Wrong question.  We're not given an arbitrary string and made to guess
> its locale.  We *know* the locale, because it's the LC_MESSAGES setting
> under which we just called strerror().  The tricky bit is to determine
> the character encoding that the locale uses.
> 
>>                              across all systems that Perl can run on.
> 
> True, but we can do a lot better than we do now.  nl_langinfo(CODESET)
> yields a string naming the encoding, on a lot of systems.  We can feed
> that encoding name into Encode.
> 
> In fact, we've already got code using nl_langinfo() in the core, in
> locale.c, to try to determine whether a locale uses the UTF-8 encoding.
> Apparently to control the behaviour of -CL.  We could do a lot more
> with this.
> 

Actually, this is used for various reasons.  Again perl internally 
currently and in the past only cares whether something is UTF-8 or not. 
That's been sufficient for our purposes.

If you look carefully, you will see that it doesn't trust the output of 
nl_langinfo. but checks that a claimed UTF-8 codeset has expected behavior.

I do not know if the codesets returned by nl_langinfo match Encode's 
names in all cases, or even if the names are standardized across 
platforms, and, to repeat, nl_langinfo is not available in some modern 
systems, such as win32, which doesn't even have LC_MESSAGES.

Note, that your translation will always end up being in UTF-8 for any 
8-bit encoding that is not ISO-8859-1.

And finally, I want to reiterate again that what you are proposing is 
not how perl has ever operated on locale data.  We do not care what the 
encoding is, except for UTF-8.  For all others, it's just a series of 
bytes that should make sense to the user.

Also, what you are proposing should be trivially achievable in pure Perl 
using POSIX::nl_langinfo and Encode.  If you were to prototype it that 
way you could find out if there are glitches between the names each 
understands.
0
public
8/15/2017 1:46:19 AM
Karl Williamson wrote:
>fact, the range of code points 80 - 9F are not allocated in any ISO 8859
>encoding.  This range is entirely controls, hardly used anywhere anymore, and
>certainly not in the middle of text. However, characters from this range are
>used in every non-ASCII UTF-8 sequence as continuation bytes.  This means
>that the heuristic is 100% accurate in distinguishing UTF-8 from any of the
>8859 encodings, contrary to what you said about 8859-5.

No, that's not correct.  The C1 controls are indeed there, in all the ISO
8859 encodings, but they only cover half the range of UTF-8 continuation
bytes.  0xa0 to 0xbf are also continuation bytes.  So many, not all,
multibyte UTF-8 character representations consist entirely of byte values
that represent printable characters in ISO-8859-*.  The thing about the
distribution of letters and symbols comes from the fact that none of 0xa0
to 0xbf represent letters in ISO-8859-1.  But most of them are letters
in ISO-8859-5.  (Luckily they're capital letters, which provides some
lesser degree of safety against accidentally forming UTF-8 sequences.)

>And finally, I want to reiterate again that what you are proposing is not how
>perl has ever operated on locale data.

True, but how it's operating now is crap.  It was somewhat crap
when it didn't decode locale strings at all, and just trusted that
the bytes should make sense to the user.  It was an oversight that
when Unicode was embraced this wasn't changed to decode to the native
Unicode representation.  But at least it was consistent in providing
a locale-encoding byte string.  Now it's inconsistent: $! may provide
either the locale-encoded byte string or the character string that the
byte string probably represents.  Consistently decoding it to a character
string would certainly be novel, but it's the only behaviour that makes
sense in a Unicode environment.

>Also, what you are proposing should be trivially achievable in pure Perl
>using POSIX::nl_langinfo and Encode.

It's not trivial to apply this to $!, because of the aforementioned
inconsistency.  It's *possible* with some mucking about with SvUTF8,
but we'd never say that that kind of treatment of $! was a supported
interface.

>                                      If you were to prototype it that way
>you could find out if there are glitches between the names each understands.

Yes, ish.  The basic decoding can certainly be prototyped this way, and so
can the additional logic for places where nl_langinfo() is unavailable or
where we can detect that it gives bad data.  But this doesn't sound all
that useful as an investigatory tool.  The way to find out how useful
this logic is is to gather strerror()/nl_langinfo() pairs from a wide
range of OSes.  In any case, as a porting task it's not something that
one person can do alone.

-zefram
0
zefram
8/15/2017 4:38:03 AM
Quoth Karl Williamson:
> I concede that there are encodings that do use the 80-9F range, and =
these could be wrongly guessed.  The most likely one still in common use =
is CP 1252.  I did try once to create a string that made sense in both =
encodings, and I did succeed, but it was quite hard for me to do, and =
was very short; much shorter than an error message.

Actual, non-synthetic example:
    https://en.wikipedia.org/wiki/Muvrar%C3%A1%C5%A1%C5%A1a

The name "Muvrar=C3=A1=C5=A1=C5=A1a" can be encoded in Windows-1252 as =
the octets
(hex) 4D 75 76 72 61 72 E1 9A 9A 61
which is also the correct UTF-8 encoding of the string "Muvrar=E1=9A=9Aa",=

where the next-to-last character is U+169A OGHAM LETTER PEITH.


/Bo Lindbergh
0
blgl
8/15/2017 8:01:43 PM
On 08/14/2017 10:38 PM, Zefram wrote:
> Karl Williamson wrote:
>> fact, the range of code points 80 - 9F are not allocated in any ISO 8859
>> encoding.  This range is entirely controls, hardly used anywhere anymore, and
>> certainly not in the middle of text. However, characters from this range are
>> used in every non-ASCII UTF-8 sequence as continuation bytes.  This means
>> that the heuristic is 100% accurate in distinguishing UTF-8 from any of the
>> 8859 encodings, contrary to what you said about 8859-5.
> 
> No, that's not correct.  The C1 controls are indeed there, in all the ISO
> 8859 encodings, but they only cover half the range of UTF-8 continuation
> bytes.  0xa0 to 0xbf are also continuation bytes.  So many, not all,
> multibyte UTF-8 character representations consist entirely of byte values
> that represent printable characters in ISO-8859-*.  The thing about the
> distribution of letters and symbols comes from the fact that none of 0xa0
> to 0xbf represent letters in ISO-8859-1.  But most of them are letters
> in ISO-8859-5.  (Luckily they're capital letters, which provides some
> lesser degree of safety against accidentally forming UTF-8 sequences.)

I'm sorry I got confused, and additionally misstated stuff I wasn't 
confused about.  I do sometimes space out that continuation bytes go up 
through BF.  And my point about the C1 controls was not that they are 
unusable in 8859 texts, but that they are separate from 8859, unlike 
Windows CP1252 which does use most of the C1-defined code points to 
represent graphic characters.

But my point remains, you are just not going to see C1 controls in text.

This leaves the range A0-BF that are legal continuation bytes, and are 
mostly symbols in 8859-1, and so makes it hard to confuse that encoding 
with UTF-8.  And that's why the layout of 8859-1 does make a difference, 
but the chances of confusion are above 0%

The range A0-BF in 8859-5 is almost entirely letters.  Modern Russian is 
represented by the 4 rows B0-EF, plus A1 and F1 (though these last two 
are often transliterated to others these days).  The other word 
characters in 8859-1 are used in other Cyrillic languages.  Text in 
these language will use a mixture of Russian characters plus characters 
from the A0-AF row and the F0-FF row.

The capital letters are those up through CF; anything above is lowercase.

For a byte sequence to be confusable with a UTF-8-encoded character, it 
must begin with something C0 and greater, followed by one or more of 
something lower than C0.

The range C0-CF is essentially the last half of the capital letters in 
the modern Russian alphabet, including half the vowels.

Let's take that case first.  To be valid UTF-8, the next byte must be 
below C0, and hence must also be uppercase.  If this represents a word 
in Cyrillic, the next byte must again be C0 and above, and so on.  One 
could construct a confusable sequence of uppercase letters as long as 
every other one comes from the last half of the Russian alphabet, and 
the others from the first half, or are from Macedonian, Ukrainian and 
the like.

I took Russian in college; the capitalization rules are similar to 
English.  You just don't see strings of all caps.  So yes, this is 
confusable for short strings of all caps, provided the other conditions 
are met.  Something like the Cyrillic equivalent of EINVAL might be 
confusable.

Now let's look at the other case, where the first byte is D0 or above. 
This is a lowercase letter, and it must be followed by one or more bytes 
that are all uppercase.  Again, you won't see things like aB, bAR, eINV 
in text.

I looked at the remaining 8859 code pages
-2  only one vowel below C0
-3  only one vowel below C0
-4  only two vowels below C0
-6  no letters below C0
-7  7 letters below C0, all polytonic Greek, and I'm not qualified to 
analyze this.
-8 only punctuation below E0
-9 only punctuation below C0
-10 almost all characters C0 and above are vowels
-11 I'm not qualified to analyze Thai, but I notice that of the code 
points C0 and above, more than half are: 1) unassigned; 2) digits; 3) 
must immediately follow another byte; whereas in UTF-8 they are start bytes.
-12 this code page was never finished
-13 only three letters (2 of them vowels) below C0
-14 almost all the letters C0 and above are vowels, so the text would 
have to mostly be vc vcc vccc.  That's quite unlikely for more than a 
couple of words in a row
-15 only two vowels below C0
-16 only three vowels below C0

It looks to me like this heuristic can fail on strings of a few bytes, 
but for real text does a pretty good job.
> 
>> And finally, I want to reiterate again that what you are proposing is not how
>> perl has ever operated on locale data.
> 
> True, but how it's operating now is crap.  It was somewhat crap
> when it didn't decode locale strings at all, and just trusted that
> the bytes should make sense to the user.  It was an oversight that
> when Unicode was embraced this wasn't changed to decode to the native
> Unicode representation.  But at least it was consistent in providing
> a locale-encoding byte string.  Now it's inconsistent: $! may provide
> either the locale-encoded byte string or the character string that the
> byte string probably represents.  Consistently decoding it to a character
> string would certainly be novel, but it's the only behaviour that makes
> sense in a Unicode environment.

I don't believe most of this.  Perhaps some of that is because you used 
the word 'decode' again in a way that obscures your meaning.
> 
>> Also, what you are proposing should be trivially achievable in pure Perl
>> using POSIX::nl_langinfo and Encode.
> 
> It's not trivial to apply this to $!, because of the aforementioned
> inconsistency.  It's *possible* with some mucking about with SvUTF8,
> but we'd never say that that kind of treatment of $! was a supported
> interface.

Since I don't understand and don't believe the above stuff, I don't see 
that writing in C gives you any more tools than pure perl.
> 
>>                                       If you were to prototype it that way
>> you could find out if there are glitches between the names each understands.
> 
> Yes, ish.  The basic decoding can certainly be prototyped this way, and so
> can the additional logic for places where nl_langinfo() is unavailable or
> where we can detect that it gives bad data.  But this doesn't sound all
> that useful as an investigatory tool.  The way to find out how useful
> this logic is is to gather strerror()/nl_langinfo() pairs from a wide
> range of OSes.  In any case, as a porting task it's not something that
> one person can do alone.
> 
> -zefram
> 
0
public
8/16/2017 4:11:26 AM
On 08/15/2017 10:11 PM, Karl Williamson wrote:
> I looked at the remaining 8859 code pages
> -2  only one vowel below C0
> -3  only one vowel below C0
> -4  only two vowels below C0
> -6  no letters below C0
> -7  7 letters below C0, all polytonic Greek, and I'm not qualified to 
> analyze this.
> -8 only punctuation below E0
> -9 only punctuation below C0
> -10 almost all characters C0 and above are vowels
> -11 I'm not qualified to analyze Thai, but I notice that of the code 
> points C0 and above, more than half are: 1) unassigned; 2) digits; 3) 
> must immediately follow another byte; whereas in UTF-8 they are start 
> bytes.
> -12 this code page was never finished
> -13 only three letters (2 of them vowels) below C0
> -14 almost all the letters C0 and above are vowels, so the text would 
> have to mostly be vc vcc vccc.  That's quite unlikely for more than a 
> couple of words in a row
> -15 only two vowels below C0
> -16 only three vowels below C0


I realized that my analysis is flawed for the code pages that are some 
variant of Latin.  With the Cyrillic script, you aren't going to be 
using any of the ASCII letters to fill out words, because they aren't in 
the same script.  But the Latin variants can have ASCII letters 
intermixed to make words, so the constraints aren't as severe as I 
indicated.  Take for example 8859-2, which has only one non-ASCII vowel 
below C0.  One could imagine a word that starts with a consonant C0 and 
above, then has that vowel, and the rest are ASCII.  That would be 
confusable, and if it were the only word with non-ASCII in the text, the 
guess would be wrong.

An exercise one could do is take dictionaries in various languages in 
the appropriate code pages, filtering out all the words that are just 
AsCII, and then check each word to see if it is legal UTF-8.  That would 
quantify how good the heuristic (which I suspect has been around since 
UTF-8 was added to perl) is.  This would be pretty easy to mostly 
automate if there were a source of dictionaries in UTF-8.  It could be 
brute forced, just trying every encoding Encode knows about on every 
dictionary, and then ignoring that case if Encode says it can't output 
that dictionary in that encoding.
0
public
8/16/2017 5:36:37 AM
On 08/15/2017 02:01 PM, Bo Lindbergh wrote:
> Quoth Karl Williamson:
>> I concede that there are encodings that do use the 80-9F range, and th=
ese could be wrongly guessed.  The most likely one still in common use is=
 CP 1252.  I did try once to create a string that made sense in both enco=
dings, and I did succeed, but it was quite hard for me to do, and was ver=
y short; much shorter than an error message.
>=20
> Actual, non-synthetic example:
>      https://en.wikipedia.org/wiki/Muvrar%C3%A1%C5%A1%C5%A1a
>=20
> The name "Muvrar=C3=A1=C5=A1=C5=A1a" can be encoded in Windows-1252 as =
the octets
> (hex) 4D 75 76 72 61 72 E1 9A 9A 61
> which is also the correct UTF-8 encoding of the string "Muvrar=E1=9A=9A=
a",
> where the next-to-last character is U+169A OGHAM LETTER PEITH.
>=20
>=20
> /Bo Lindbergh
>=20

I'm curious how you found this?

(This particular example could be solved by realizing that Ogham is not=20
a script likely to be represented in 1252.)
0
public
8/18/2017 11:07:44 PM
On 08/12/2017 11:55 AM, Zefram wrote:
> In scope of "use locale", my_strerror() returns a string encoded in the
> locale's charset.  $! uses a dodgy heuristic to sometimes decode this.

I have addressed any "dodginess" by

commit a8f4b0c691d6f1b08948976e74087b646bf8c6ef
  Author: Karl Williamson <khw@cpan.org>
  Date:   Fri Aug 18 13:46:25 2017 -0600

      Improve heuristic for UTF-8 detection in "$!"

      Previously, the stringification of "$!" was considered to be UTF-8 
if it
      had any characters with the high bit set, and everything was
      syntactically legal UTF-8.  This may incorrectly guess on short 
strings
      where there are only a few non-ASCII bytes.  This could happen in
      languages based on the Latin script where many words don't use
      non-ASCII.

      This commit adds a check that the locale is a UTF-8 one.  That 
check is
      a call to an already-existing subroutine which goes to some lengths to
      get an accurate answer, and should be essentially completely 
reliable on
      modern systems that have nl_langinfo() and/or mbtowc().
0
public
8/18/2017 11:10:59 PM
I wrote:
>Branch zefram/sv_string_from_errnum implements the first thing that
>I want.  New API function, doing my_strerror() but with return as an SV.

Now applied to blead as commit 658db62260a2a680132cf1a36a3788db37a6941b.

-zefram
0
zefram
8/18/2017 11:12:34 PM
Quoth Karl Williamson:
> On 08/15/2017 02:01 PM, Bo Lindbergh wrote:
>> Actual, non-synthetic example:
>>     https://en.wikipedia.org/wiki/Muvrar%C3%A1%C5%A1%C5%A1a
>=20
> I'm curious how you found this?

Back when the English-language Wikipedia used Windows-1252,
people used to create redirects to articles with non-ASCII titles
from the UTF-8-interpreted-as-Windows-1252 title.  These were all
made obsolete by the transition to UTF-8 but not cleared away
until several years later.

Circa 2008, I was experimenting with automatic detection of these
using a database dump as input, and Muvrar=E1=9A=9Aa was one of the very =
few
false positives that came up.  (I never finished this project.)


/Bo Lindbergh
0
blgl
8/19/2017 5:08:36 AM
Reply: