[Encode] Encode::Supported revised

Folks,

   Encode is near completion.  I am still bulding djgpp environment for 
possible fixes needed but anything else is over.
   Meanwhile,  Please have a look at Encode::Supported revised for added 
Encodings (now Encode comes with all encodings covered by 
http://www.unicode.org/Public/MAPPINGS/ -- except for Indics  which are 
beyond cap. of the current encengine;  Algorithmical approaches still 
possible.  Porters wanted.  See below).  Enjoy.

Dan the Encode Maintainer

=head1 NAME

Encode::Supported -- Supported encodings by Encode

=head1 DESCRIPTION

=head2 Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

        o The MIME name as defined in IETF RFCs.
        o The name in the IANA registry.
        o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case
encodings have state, "Encode" uses the encoding object internally
once an operation is in progress.

=head1 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurrance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases.  Encode.pm will automatically load those modules in need.

=head2 Built-in Encodings

The following encodings are always available.

   Canonical     Aliases                      Comments & References
   ----------------------------------------------------------------
   US-ascii      ascii                                       [ECMA]
   iso-8859-1    latin1                                       [ISO]
   UCS-2         ucs2, iso-10646-1                    [IANA, et al]
   UCS-2le
   UTF-8         utf8                                     [RFC2279]
   ----------------------------------------------------------------

=head2 Encode::Byte -- Extended Asci

Encode::Byte implements most of single-byte encodings except for
Symbols and EBCDIC. The following encodings are based single-byte
encoding implemented as extended ASCII.  For most cases it uses
\x80-\xff (upper half) to map non-ASCII characters.

=over 2

=item ISO-8859 and corresponding vendor mappings

Since there are so many, They are presented in table format with
Languages and corresponding encoding names by vendors.  Note the table
is sorted in order of ISO-8859 and the corresponding vendor mappings
are slightly different from that of ISO.  See
L<http://czyborra.com/charsets/iso8859.html> for details.

   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
   ----------------------------------------------------------------
   U.S           (ASCII)         cp437        AdobeStandardEncoding
                                 cp863 (DOSCanadaF)
   W.  Europe    (iso-8859-1)    cp850   cp1252  MacRoman  nextstep
                                                          hp-roman8
                                 cp860 (DOSPortuguese)
   CE. Europe    iso-8859-2      cp852   cp1250  MacCentralEurRoman
                                                 MacCroatian
                                                 MacRomanian
                                                 MacRumanian
   Latin3(*3)    iso-8859-3
   Latin4(*4)    iso-8859-4
   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
     (Also see next section)     cp866           MacUkrainian
   Arabic        iso-8859-6      cp864   cp1256  MacArabic
                                 cp1006          MacFarsi
   Greek         iso-8859-7      cp737   cp1253  MacGreek
                                 cp869 (DOSGreek2)
   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
   Nordics       iso-8859-10     cp865
                                 cp861           MacIcelandic
                                                 MacSami
   Thai          iso-8859-11     cp874           MacThai
   (iso-8859-12 is nonexistent. Reserved for Indics?)
   Baltics      iso-8859-13      cp775           cp1257
   Celtics      iso-8859-14
   Latin9(*15)  iso-8859-15
   Latin10      iso-8859-16
   Vietnamese    viscii                  cp1258  MacVietnamese
   ----------------------------------------------------------------

   (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
   (*4) Baltics.  Now on 8859-10
   (*9) Nicknamed Latin0; Euro sign as well as  French and Finnish
        letters that are missing from 8859-1 are added.

All cp* are also available as ibm-*, ms-*, and windows-* .  See also
L<http://czyborra.com/charsets/codepages.html>.

Macintosh encodings don't seem to be registered in such entities as
IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
for details

=item KOI8 - De Facto Standard for Cyrillic world

Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
in the Net.   L<Encode> comes with the following KOI charsets.  for
gory details, See <http://czyborra.com/charsets/cyrillic.html> for
details.

   ----------------------------------------------------------------
   koi8-f
   koi8-r cp878                                           [RFC1489]
   koi8-u                                                 [RFC2319]


=item gsm0338 - Hentai Latin 1

GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
control character ranges and other parts are mapped very differently,
presumablly to store Cyrillics.  This one is also covered in
Encode::Byte even thought this one does not comply extended ASCII.

=back

=head2 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are implemented in distinct module by
languages, due the the size concerns.  Please also refer to their
respective document pages.

=over 4

=item Encode::CN -- Continental China

   Standard      DOS/Win Macintosh       Comment
   ----------------------------------------------------------------
   euc-cn                MacChineseSimp  GB2312 is aliased to this
   (gbk)         cp936                   GBK is aliased to to this
   gb12345-raw                           GB12345 as is
   gb2312-raw                            GB2312 as is
   hz
   iso-ir-165
   ----------------------------------------------------------------

=item Encode::JP -- Japan

   Standard      DOS/Win Macintosh       Comment/Reference
   ----------------------------------------------------------------
   euc-jp
   shiftjis      cp932   macJapanese
   7bit-jis        jis
   euc-jp          ujis
   iso-2022-jp                           [RFC1468]
   iso-2022-jp-1                         [RFC2237]
   ----------------------------------------------------------------

=item Encode::KR -- Korea

   ----------------------------------------------------------------
   euc-kr                MacKorean
                 cp949                   ks_c_5601-1987
   iso-2022-kr                           [RFC1557]
   johab
   ksc5601-raw                           KSC5601 as is
   ----------------------------------------------------------------

=item Encode::TW -- Taiwan

   ----------------------------------------------------------------
   big5          cp950   MacChineseTrad
   big5-hkscs
   ----------------------------------------------------------------

=item Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.

   ----------------------------------------------------------------
   gb18030
   euc-tw
   big5plus
   ----------------------------------------------------------------

=back

=head2 Miscellaneous encodings

=over 4

=item Encode::EBCDIC

See perlebcdic for details.

   ----------------------------------------------------------------
   cp1047
   cp37
   posix-bc
   ----------------------------------------------------------------

=item Encode::Symbols

For symbols  and dingbats.

   ----------------------------------------------------------------
   symbol
   dingbats
   MacDingbats
   AdobeZdingbat
   AdobeSymbol
   ----------------------------------------------------------------

=back

=head1 Unsupported encodings

The following are not supported as yet.  Some because they are rarely
usede, some because of technical difficulty.  They may be supported by
external modules via CPAN in future, however.

=over 4

=item   ISO-2022-JP-2 [RFC1554]

Not very popular yet.  Needs Unicode Database or equivalent to
implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
GB2312 sumulteniously, which code points in unicode overlap.  So you
need to lookup the database to determine what character set a given
Unicode character should belong).

=item   ISO-2022-CN [RFC1922]

Not very popular.  Needs CNS 11643-1 and 2 which are not available in
this module.  CNS 11643 is supported (via euc-tw) in
Encode::HanExtra.  Autrijus may add support for this encoding in his
module in future

=item various UP-UX encodings

The following are unsoported due to the lack of mapping data.

   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
   '15' - japanese15, korean15, and  roi15

=item Cyrillic encoding ISO-IR-111

Anton doubts its usefulness.

=item ISO-8859-8-1 [Hebrew]

None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
MacHebrew are supported because and just because there were mappings
available at L<http://www.unicode.org/>).  Contribution welcome.

=item Thai encoding TCVN

Ditto.

=item Vietnamese encodings VPS

Ditto.

=item various Mac encodings

The following are unsoported due to the lack of mapping data.

   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
   MacVietnamese

The rest of which already available are based upon the vendor mapping
available at L<http://www.unicode.org/>

=item (Mac) Indic encodings

The maps for the following is available at L<http://www.unicode.org/>
but remains unsupport because those encordigs need algorithmical
approach, unsupported by F<enc2xs>

   MacDevanagari
   MacGurmukhi
   MacGujarati

For details, please see C<Unicode mapping issues and notes:> at
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .

I believe this issue is prevalent not only for Mac Indics but also in
other Indic encodings but those mentions were the only Indic encodings
maps that I could find at L<http://www.unicode.org/> .

=back

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

=over 2

=item Character I<Set> (I<charset> for short)

Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).

=item Character I<Encoding>

Is a way to represent character set(s) in a stream of bits.

=back

A character encoding may contain a single character set
(i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).

A character encoding may also encode character set as-is (also called
a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

As the name suggests, the Encode module supports encodings, not
individual charsets.

However, the word I<charset> is casually used even in Internet
Assigned Number Authority to actually mean I<encoding>.  Encode tries
to soothe this misconception via aliases.  For instance,
C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
available as C<gb2312-raw>.

=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their
applicability for information exchange over the Internet and to
choose the most suitable aliases to name them in the context of
such communication.

=over 2

=item *

To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
,available from CPAN.

=back

Encoding names

   US-ASCII    UTF-8     ISO-8859-*  KOI8-R
   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
   EUC-KR      Big5

are registered to IANA as preferred MIME names and may probably be used 
over the Internet.

C<Shift_JIS> is no longer Microsft proprietary since it has been
officialized by JIS X 0208-1997.

   EUC-CN

has not been registered with IANA (as of march 2002) but
seems to be supported by major web browsers. In Encode, GB2312
is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
as gb2312-raw.  See L<Encode::CN> for details.

   KS_C_5601-1987

has been registered to IANA but when they are used, they are
EUC-coded.  Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".

   UTF-16
   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.

   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
   VISCII
   GB 12345
   GB 18030 (*)  (see links bellow)
   EUC-TW   (*)

are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.

   BIG5PLUS (*)

is a bit proprietary name.

=head1 Bookmarks

=over 2

=item czyborra.com

<http://czyborra.com/>

Contains a a lot of useful information, especially gory details of ISO
vs. vendor mappings.

=item Assigned Charset Names by IANA

L<http://www.iana.org/assignments/character-sets>

Most of the C<canonical names> in Encode derive from this list
so you can directly apply the string you have extracted from MIME
header of mails and we pages.

=item CJK.inf

L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

Somewhat obsolete (last update in 1996), but still useful.  Also try

L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>

You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>

=item EMCA-035 (eq C<ISO-2022>)

L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>

The very dspecification of ISO-2022 is available from the link above.

=back

=head1 See Also

L<Encode>,
L<Encode::Byte>,
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>

=cut

I could not find this page because the hostname doesn't resolve!

  Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

0
dankogai (76)
4/3/2002 2:02:14 PM
perl.unicode 837 articles. 0 followers. Follow

14 Replies
1014 Views

Similar Articles

[PageSpeed] 57
Get it on Google Play
Get it on Apple App Store

Hello, Dan!

DK>    Encode is near completion.  I am still bulding djgpp environment for
DK> possible fixes needed but anything else is over.

My congratulations! :-)

DK>    (*9) Nicknamed Latin0; Euro sign as well as  French and Finnish
DK>         letters that are missing from 8859-1 are added.
Hmmm.. There seems to be no (*9) footnote in the table, but there's
a dangling (*15)..

DK> in the Net.   L<Encode> comes with the following KOI charsets.  for
DK> gory details, See <http://czyborra.com/charsets/cyrillic.html> for
         ^^
DK> details.
    ^^


DK> "Encoding vs Charset"

Hmm.. I seem to have a "special opinion" on this!

Though I'm still rewriting this I'm making half-cooked variant
available:
http://tagunov.tripod.com/survey2.html
(under construction)
(http://tagunov.tripod.com/survye.html
(original variant, that
I have decided to nuke - complex and hash)

In short,
- [RFC 2130], [RFC 2278] have established CCS, CES terminology

- "Coded Character Set" sounds ambiguous with the ISO terminology,
  as cited by [RFC 1345].

- My opinion is that 'ISO Coded Character Set' = 'CES + CCS'

- CCS is not ambiguous with ISO terminology, as the abbreviation
  has first been introduced by RFC 2130 and seemed not to be
  used before
  
  I have already seen in some articles CCS being used to
  mean "Coded Character Set" in the [RFC 2130] meaning.

- note that [RFC 2278] (logical successor to [RFC 2130])
  recommends to tear apart the "charset" abbreviation from
  "character set" and "coded character set".

  It recommends to use "charset" in a meaning
  identical to CES.

So we have

'RFC 2130 Coded Character Set'             = CCS
'RFC 2130 Character Encoding Scheme'       = CES
                                           = encoding(?)
'MIME charset, as recommended by RFC 2278' = CES

'ISO Coded Character Set' = CCS+CES
'Coded Character Set', 'Character Set' are not clear
                                       outside of context


So maybe this heading should better become
"Encoding vs CCS" or "CES vs CCS"? I know it sounds less
understandable, but maybe it is a less controversial approach?

DK> However, the word I<charset> is casually used even in Internet
DK> Assigned Number Authority to actually mean I<encoding>.

Ooops! Haven't seen this when writing my prev. set of comments!
However CCS sounds more accurate (more scaring and less
understandable too) to me. I prefer accuracy, and you?

DK> Encode tries to soothe this misconception via aliases.

Hmm.. this leaves an impression that this is the only thing
that aliases do. Was that the intent?

Otherwise the description of the difference between CCS and CES
sounds _very_ good to me :-)


DK> The very dspecification of ISO-2022 is available from the link above.
            -dspecification
            + specification

DK> I could not find this page because the hostname doesn't resolve!

DK>   Brief description for most of the mentioned CJK encodings
DK> L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

Okay, let's nuke this!
- it is over-covered by cybozza
- I have no trouble resolving the host but the page fails to load
  to the end anyway

- Anton


0
tagunov
4/4/2002 12:21:39 AM
DK>> "Encoding vs Charset"

AT> Hmm.. I seem to have a "special opinion" on this!

AT> Though I'm still rewriting this I'm making half-cooked variant
AT> available:
AT> http://tagunov.tripod.com/survey2.html
AT> (under construction)
AT> (http://tagunov.tripod.com/survye.html

A typo:

http://tagunov.tripod.com/survey.html

AT> (original variant, that
AT> I have decided to nuke - complex and hash)


0
tagunov
4/4/2002 12:23:41 AM
Hello, Dan!

  None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
- MacHebrew are supported because and just because there were mappings
+ MacHebrew are supported just because there were mappings
  available at L<http://www.unicode.org/>).  Contribution welcome.

? - Anton


0
tagunov
4/4/2002 12:41:52 AM
On Wed, 3 Apr 2002, Dan Kogai wrote:

  Dan,

  Thank you for your write-up. Below are some comments.

>         o The MIME name as defined in IETF RFCs.
>    UCS-2         ucs2, iso-10646-1                    [IANA, et al]
>    UCS-2le
>    UTF-8         utf8                                     [RFC2279]
>    ----------------------------------------------------------------

  How about UCS-2BE? Of course, if UCS-2 is network byte order
(big endian), it's not necessary. In that case, you may alias UCS-2
to UCS-2BE.


> =item Encode::KR -- Korea
> 
>    ----------------------------------------------------------------
>    euc-kr                MacKorean      

     euc-kr                MacKorean      [RFC1557, IANA, KS X 2901]

>                  cp949                   ks_c_5601-1987
>    iso-2022-kr                           [RFC1557]
>    johab                                 

     johab                                 KS X 1001:1998 Annex 3

>    ksc5601-raw                           KSC5601 as is
>    ----------------------------------------------------------------

> =item Vietnamese encodings VPS

  Mozilla supports VPS. See

   http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf
   http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut


> =head1 Encoding vs. Charset
> 
> Character encoding (or just "encoding") and Character Set (or just
> "charset") are often used interchangeably but they are different
> concepts.

> =item Character I<Set> (I<charset> for short)

  Could you please just say 'Encoding vs Character Set'
and remove parenthetical 'charset for short' or 'just charset' following
'character set'?  I agree to your distinction between 'encoding' and
'character set', but what is bothering me is that you treat 'charset'
as a synonym to 'character set'.

Whether you like it or not, 'charset' is overloaded by MIME to mean
'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Everyday numerous html documents are produced with meta tags that read
'Content-Type=text/html; charset=XXXX'. The same is true of email
messages with C-T header like 'text/plain; charset=ISO-2022-JP'.
Therefore 'Encoding vs Charset' can be interpreted as 'Encoding vs
Encoding'.  On the other hand, no one with *sufficient understanding*
of the issue uses 'character set' to mean encoding.


> Is a collection of characters in which each character is distinguished
> with unique ID (in most cases, ID is number).

  Some people like to distinguish between a mere collection of characters
and a collection of characters with uniq(numeric) ID /code points.
The former is sometimes refered to as a character repertoire
or a character set whereas the latter is called a 'coded character set'.

> =item Character I<Encoding>
> 
> A character encoding may also encode character set as-is (also called
> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
> as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
> 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

   In a strict sense, the concept of 'raw' or 'as-is' (which you
apparently use to mean a coded character set invoked on GL)  is not
appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
characters to their GL position when enumerating characters in their
charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
column numbers.


>    KS_C_5601-1987
> 
> has been registered to IANA but when they are used, they are
> EUC-coded.  Internet community in Korea is not happy with this.
> so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
> of C<euc-kr>, with ksc5601-raw for "uncooked".

  I'm afraid this could give an impression that
IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
MIME charset designation (although the general public used KS C 5601 or
Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
*enhanced* version of EUC-KR. CP949 doesn't have some nice properties
of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
Linux/Unix.  We'll just go straight to UTF-8.


>    UTF-16
>    KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
> 
> are IANA-registered (C<UTF-16> even as a preferred MIME name)
> but probably should be avoided as encoding for web pages due to
> the lack of browser supports.

  Not that I'd encourage people to use UTF-16 for their web pages,
but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
and Mozilla.

> =item CJK.inf
> 
> L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
> Somewhat obsolete (last update in 1996), but still useful.  Also try

  Is there any rule against mentioning a book in print as opposed
to online docs :-) ?  Why don't you also  refer to a successor to
CJK.inf, CJKV Information Processing with a very comprehensive coverage
on character sets and encodings.

   Cheers,

  Jungshik 

0
jshin
4/4/2002 6:06:32 AM
Annyonhaseyo!

   You are definitely my strict.pm!  Never write a serious program 
without it :)

On Thursday, April 4, 2002, at 03:06 , Jungshik Shin wrote:
>>         o The MIME name as defined in IETF RFCs.
>>    UCS-2         ucs2, iso-10646-1                    [IANA, et al]
>>    UCS-2le
>>    UTF-8         utf8                                     [RFC2279]
>>    ----------------------------------------------------------------
>
>   How about UCS-2BE? Of course, if UCS-2 is network byte order
> (big endian), it's not necessary. In that case, you may alias UCS-2
> to UCS-2BE.

   And UCS2-NB (Network Byte order)?  Unicode terminology is confusing 
sometimes.
   I've checked http://www.unicode.org/glossary/ and it seems that the 
canonical - alias order should be as follows.

    UCS-2         ucs2, iso-10646-1, utf-16be
    UTF-16LE      ucs2-le
    UTF-8         utf8

   I left UCS-2 as is because it is IANA registered. UCS-2 is indeed a 
name of encoding as the URL above clearly states.  It is also less 
confusing than UTF-16.
   ucs2-le will be fixed.

>>    euc-kr                MacKorean
>
>      euc-kr                MacKorean      [RFC1557, IANA, KS X 2901]

Kamsahamnida.

>>    johab
>
>      johab                                 KS X 1001:1998 Annex 3

Ditto.


>> =item Vietnamese encodings VPS
>
>   Mozilla supports VPS. See
>
>    http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf
>    http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut

   Thank you.  I've been away from the Mozilla team for too long (I help 
the first build documentation and Lesstif stuff more than 100 million 
seconds ago)...
   Hopefully added in near future.

>   Could you please just say 'Encoding vs Character Set'
> and remove parenthetical 'charset for short' or 'just charset' following
> 'character set'?  I agree to your distinction between 'encoding' and
> 'character set', but what is bothering me is that you treat 'charset'
> as a synonym to 'character set'.
> [snip]

    Now I agree.  charset is more appropriate for "coded character set" 
and that was MIME header's first intention.  EUC is indeed a coded 
character set but charset=ISO-2022-(JP|KP|CN)(-\d+)?  is absolutely 
confusing --  it is a character encoding scheme at best.  I am thinking 
of adding a small glossary to this document as follows.

>    In a strict sense, the concept of 'raw' or 'as-is' (which you
> apparently use to mean a coded character set invoked on GL)  is not
> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
> characters to their GL position when enumerating characters in their
> charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
> codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
> and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
> column numbers.

   I wonder whether ku-ten form is canonical or derived.  JIS X 0208 was 
clearly designed to be ISO-2022 compliant.  Technically speaking 
0x21-0x7e should the original and 1 - 94 is derived to make decimal 
people happier.  But you've got a point.
   Speaking of '-raw'  that's a BSD sense of calling unprocessed data and 
for a Deamon freak it came out naturally.

>   I'm afraid this could give an impression that
> IANA is to blame for misuse of the CCS name to mean encoding/CES. 
> Whether
> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
> MIME charset designation (although the general public used KS C 5601 or
> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties
> of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
> in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
> Linux/Unix.  We'll just go straight to UTF-8.

    If I were IANA (or W3 or whosoever writes a standard), I'd have done 
the same because we need Content-Encoding: for other purpose (such as 
compression).  But once again this Content-Encoding: should have been 
called Content-Encapsulation: to make room for Content-Encoding :)  At 
any rate, it is too late.

>> are IANA-registered (C<UTF-16> even as a preferred MIME name)
>> but probably should be avoided as encoding for web pages due to
>> the lack of browser supports.
>
>   Not that I'd encourage people to use UTF-16 for their web pages,
> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
> and Mozilla.

   The problem is not just browsers.  As a network consultant I would 
advised against UTF-16 or any text encoding that may croak cat(1) and 
more(1) (We can go frank on "Mojibake"  For cases like mojibake, the 
text goes to EOF).  After all, we have UTF-8 already that good old cat 
of ours can read till EOF with no problem.

>> =item CJK.inf
>>
>> L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
>> Somewhat obsolete (last update in 1996), but still useful.  Also try
>
>   Is there any rule against mentioning a book in print as opposed
> to online docs :-) ?  Why don't you also  refer to a successor to
> CJK.inf, CJKV Information Processing with a very comprehensive coverage
> on character sets and encodings.

   No.  I was just too lazy to browse for ISBN number and such (I know it 
is trivially easy to search at amazon and such but sometimes a simple 
copy and paste is too hard for my finger :).

And  Here is a glossary I manually parsed out of 
http://www.unicode.org/glossary/ , right after the signature.

Dan the Linted Man

=head2 Glossary

=over 2

=item character repertoire

A collection of unique characters.  A I<character> set in the most 
strict sense. At this stage characters are not numberd.

=item coded character set (CCS)

A character set that is mapped in a way computers can use directly.  
Many character encodings including EUC falls in this category.

=item character encoding scheme (CES)

An algorithm to map a character set to a byte sequence.   You don't have 
to be able to tell which character set a given byte sequence belongs.  
7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an example of 
being both a CCS and CES.

=item Unicode

A Character Set.

=item UCS

Short for I<Universal Character Set>.  When you say just UCS, it means 
I<Unicode>

=item UCS-2

ISO/IEC 10646 encoding form: Universal Character Set coded in two octets.

=item UTF

Short for I<Unicode Transformation Format>.

=item UTF-16

A UTF in 16-bit encoding.  Can either be in big endian or little 
endian.  Big endian version is called UTF-16BE and little endian version 
is UTF-16LE.

=back

0
dankogai
4/4/2002 9:37:41 AM
Hello, Dan!

1)
This my second portion of comments on the renewed Supported.pod.
This part is 100% orthogonal to the first part

2)

This patch
- changes status of KOI8-U on Jungshik's comment
  (sorry, I have never tested that myself :-(
- upgrades GB2312 to the "first class citizen"
  (why not?)
- adds a section on Microsoft naming acrobatics
- that patch includes a comment on the Shift_JIS
  differences between JIS X 0208-1997 Appendix 1
  and cp932
- ...
- this patch also makes clear that Encode supports
  the standards for GB2312 and Big5 not Microsoft
  extensions (have I grasped it right? :-)

--- ext/Encode/lib/Encode/Supported.pod.orig    Mon Apr  1 03:42:52 2002
+++ ext/Encode/lib/Encode/Supported.pod Thu Apr  4 15:16:10 2002
@@ -308,8 +308,8 @@
 
 =item * 
 
-To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
-,available from CPAN.
+To (en|de) code Encodings marked as C<(*)>, You need 
+C<Encode::HanExtra>, available from CPAN.
 
 =back
 
@@ -317,33 +317,43 @@
 
   US-ASCII    UTF-8     ISO-8859-*  KOI8-R
   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
-  EUC-KR      Big5
+  EUC-KR      Big5      GB2312
 
-are registered to IANA as preferred MIME names and may probably be used over the Internet.
+are registered to IANA as preferred MIME names and may probably 
+be used over the Internet.
 
-C<Shift_JIS> is no longer Microsft proprietary since it has been
-officialized by JIS X 0208-1997.
+C<Shift_JIS> has been officialized by JIS X 0208-1997.
+L<Microsoft-related naming mess> gives details.
+
+C<GB2312> is the IANA name for C<EUC-CN>.
+See L<Microsoft-related naming mess> for details.
+
+C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
+with Encode. See L<Encode::CN -- Continental China> for details.
 
   EUC-CN
+  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 
-has not been registered with IANA (as of march 2002) but
-seems to be supported by major web browsers. In Encode, GB2312
-is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
-as gb2312-raw.  See L<Encode::CN> for details.
+have not been registered with IANA (as of March 2002) but
+seem to be supported by major web browsers. 
+IANA name for C<EUC-CN> is C<GB2312>.
 
   KS_C_5601-1987
 
-has been registered to IANA but when they are used, they are
-EUC-coded.  Internet community in Korea is not happy with this.
-so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
-of C<euc-kr>, with ksc5601-raw for "uncooked".
+is heavily misused.
+See L<Microsoft-related naming mess> for details.
+
+C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
+with Encode. See L<Encode::KR -- Korea> for details.
 
   UTF-16 
-  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 
-are IANA-registered (C<UTF-16> even as a preferred MIME name)
+=for comment
+waiting for comments from Jungshik Shin to soften this - Anton
+
+is a IANA-registered preferred MIME name
 but probably should be avoided as encoding for web pages due to 
-the lack of browser supports.
+the lack of browser support.
 
   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
@@ -360,6 +370,73 @@
   BIG5PLUS (*)
 
 is a bit proprietary name. 
+
+=head2 Microsoft-related naming mess
+
+Microsoft products misuse the following names:
+
+=over 2
+
+=item KS_C_5601-1987
+
+Microsoft extension to C<EUC-KR>.
+
+Proper name: C<CP949>.
+
+See
+http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
+for details.
+
+Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
+this common misusage. 
+I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
+
+See L<Encode::KR -- Korea> for details.
+
+=item GB2312
+
+Microsoft extension to C<EUC-CN>.
+
+Proper names: C<CP936>, C<GBK>.
+
+C<GB2312> has been registered in the C<EUC-CN> meaning at
+IANA. This has partially repaired the situation: Microsoft's 
+C<GB2312> has become a superset of the official C<GB2312>.
+
+Encode aliases C<GB2312> to C<euc-cn> in full agreement with
+IANA registration. C<cp936> is supported separately.
+I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+
+See L<Encode::CN -- Continental China> for details.
+
+=item Big5
+
+Microsoft extension to C<Big5>.
+
+Proper name: C<CP950>.
+
+Encode separately supports C<Big5> and C<cp950>.
+
+=item Shift_JIS
+
+Microsoft's understanding of C<Shift_JIS>.
+
+JIS has not endorsed the full Microsoft standard however.
+The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
+subsets, while Microsoft has always been meaning C<Shift_JIS> to
+encode a wider character repertoire.
+
+As a historical predecessor Microsoft's variant
+probably has more rights for the name, albeit it may be objected
+that Microsoft shouldn't have used JIS as part of the name
+in the first place.
+
+Unabiguous name: C<CP932>.
+
+Encode separately supports C<Shift_JIS> and C<cp932>.
+
+=back
+
 
 =head1 Bookmarks
 
What do you think of it, Dan? :-)

3)

Jungshik, I would have certainly advocated linking not only to
http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
but also to your comments on the KS_C_5601-1987 in the list archive,
but all your mails were on several subjects each.

Jungshik> ... refer to Ken Lunde's CJKV Information Processing
Jungshik> about that 'epic war' between two camps. (see p.197 of
Jungshik> the book and http://jshin.net/faq/qa8.html)
Jungshik> We even set up a web page to prevent M$ from spreading that
Jungshik> ill-defined name.

maybe we may link to this page? What is the address?

4)

Certainly the
[ID 20020312.006] pod2html does not translate space to '_' in L<>-s
bug still spoils our links. I have sent a new mail on that to
perl5-porters..

Furthermore, I don't understand why C<gb2312-raw> converts
to <CODE>gb2312-raw> while C<GB2312> becomes a link?

Anyway I have gone for putting C<> around, but if that feature/bug
persists maybe it's better to drop the C<> in my patch.

- Anton


0
tagunov
4/4/2002 11:30:36 AM
Hello Jungshik!

Our comments go in the same direction, but will you
let me strengthen your statements a bit?

>> =head1 Encoding vs. Charset
JS> Whether you like it or not, 'charset' is overloaded by MIME to mean
JS> 'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Indeed it is.
RFC 2278 additionally makes it explicit.

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

[ECMA-35, (equivalent of ISO 2022?)]:
coded character set; code
  A set of unambiguous rules that establishes a
  character set and the one-to-one relationship between the 
  characters of the set and their coded representation.

[RFC 1345]:
  The ISO definition of the term "coded character set" is as
  follows: "A set of unambiguous rules that establishes a 
  character set and the one-to-one relationship between the 
  characters of the set and their coded representation."

Hmmm... can this potentially lead to messing "character set" for
a short form of "coded character set" (in the ISO meaning)?

I see that these definitions themselves make a distinction between a
"character set"       (= repertoire    ) and
"coded character set" (= CCS + encoding = CCS + CES),

Jungshik?

>> Is a collection of characters in which each character is distinguished
>> with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.
or rather CCS to rule out the ISO understanding

>> =item Character I<Encoding>
>> 
>> A character encoding may also encode character set as-is (also called
>> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
>> as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
>> 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
Looks like RFC 1345 has made one big pile:

  JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
  GB_1988-80
  KS_C_5601-1987
  
are all listed in a similar manner there. Does this RFC change
anything?

JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
JS> codepoints.
Thanks a lot! I would have never caught this subtlety from what
reading I have.

JS>  That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
JS> and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
JS> column numbers.
I used to be advocating for the RFC 1345 names, but they apparently
were not something to ease the situation (too long and too complex :)


>>    KS_C_5601-1987
>> 
>> has been registered to IANA but when they are used, they are
>> EUC-coded.  Internet community in Korea is not happy with this.
>> so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
>> of C<euc-kr>, with ksc5601-raw for "uncooked".

JS>   I'm afraid this could give an impression that
JS> IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
JS> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
JS> MIME charset designation (although the general public used KS C 5601 or
JS> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
JS> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
JS> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties
JS> of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
JS> in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
JS> Linux/Unix.  We'll just go straight to UTF-8.

I have incorporated your ideas into a patch, let's see what Dan
thinks on it! (patch sent in reply to Dan's core message on
Supported.pod renewal)


>>    UTF-16
Awaiting for more comments from you (see bellow)

>>    KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
Haven't tested this one myself :-(
No objections to changing its status. My patch has that.

>> 
>> are IANA-registered (C<UTF-16> even as a preferred MIME name)
>> but probably should be avoided as encoding for web pages due to
>> the lack of browser supports.

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.
Hmm.. My attempts to use UTF-16 failed with IE5.5..
Has anyone demonstrated it to work?

>> =item CJK.inf
>> 
>> L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
>> Somewhat obsolete (last update in 1996), but still useful.  Also try

JS>   Is there any rule against mentioning a book in print as opposed
JS> to online docs :-) ?  Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing with a very comprehensive coverage
JS> on character sets and encodings.

http://www.oreilly.com/catalog/cjkvinfo/ is the link for the book
"CJKV Information Processing" is the name
But someone has to write a good recommendation for that.
Let it be someone who has the book ;-)

Or may it be

Ken Lunde's book "CJKV Information Processing"
http://www.oreilly.com/catalog/cjkvinfo/

Successor to CJK.inf. Features a very comprehensive coverage
on CJKV character sets and encodings.

?

Heartiest regards, Anton


0
tagunov
4/4/2002 11:45:59 AM
On Thursday, April 4, 2002, at 08:30 , Anton Tagunov wrote:
> Hello, Dan!
>
> 1)
> This my second portion of comments on the renewed Supported.pod.
> This part is 100% orthogonal to the first part
>
> 2)
>
> This patch
> - changes status of KOI8-U on Jungshik's comment
>   (sorry, I have never tested that myself :-(
> - upgrades GB2312 to the "first class citizen"
>   (why not?)
> - adds a section on Microsoft naming acrobatics
> - that patch includes a comment on the Shift_JIS
>   differences between JIS X 0208-1997 Appendix 1
>   and cp932
> - ...
> - this patch also makes clear that Encode supports
>   the standards for GB2312 and Big5 not Microsoft
>   extensions (have I grasped it right? :-)

Spahsseebah.  Will be reflected in the next revision.

    __
   / |
/---+ AH

P.S.  I tried to compose UTF-8 version thereof, only to find my Mac OS X 
is yet to support Russian script.  I could've resort to yank it out of 
ucm but that's too much :)

0
dankogai
4/4/2002 12:07:42 PM
Hello Dan!
Double glad to hear from you ;-)

Anton> This patch .. <snip/>


DK> Spahsseebah.
??? :-))

DK> Will be reflected in the next revision.
:-)

DK>     __
DK>    / |
DK> /---+ AH
I guess this is some Ideograph :-)


DK> P.S.  I tried to compose UTF-8 version thereof,
Dan, you're so mysterious today! UTF-8 version of what? :-))

DK> only to find my Mac OS X 
DK> is yet to support Russian script.

DK> I could've resort to yank it out of 
DK> ucm but that's too much :)

- Anton


0
tagunov
4/4/2002 12:16:36 PM
Anton Tagunov wrote on 2002-04-04 11:45 UTC:
> [RFC 1345]:
>   The ISO definition of the term "coded character set" is as

Why only quote old secondary literature, if the ISO standards themselves
are available online:

  http://www.evertype.com/sc2wg3.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

0
Markus
4/4/2002 1:46:51 PM
On Thu, 4 Apr 2002, Anton Tagunov wrote:

 Hi Anton,

 Thanks a lot.

> - changes status of KOI8-U on Jungshik's comment
>   (sorry, I have never tested that myself :-(

  I haven't test it either :-), but both Mozilla/Netscape6 and MS IE
list it in view|encoding  menu, which I interpret as having support
for it.


>    UTF-16 
> -  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
>  
> -are IANA-registered (C<UTF-16> even as a preferred MIME name)
> +=for comment
> +waiting for comments from Jungshik Shin to soften this - Anton
> +
> +is a IANA-registered preferred MIME name
>  but probably should be avoided as encoding for web pages due to 
> -the lack of browser supports.
> +the lack of browser support.

   The reason your test didn't work with MS IE was probably
you didn't prepend your UTF-16 html doc. with BOM(byte order mark).
It's to be noted that a conventional way of informing web browsers
of MIME charset by putting <meta> tag doesn't work for UTF-16/UTF-32.
Either you have to configure your web server to emit C-T header with
'charset=UTF-16(LE|BE)' or you have to put BOM at the beginning.
When BOM is present, MS IE 5/6, Mozilla/Netscape6 and Netscape4
have no problem rendering UTF-16(LE|BE) encoded pages. I put
up a couple of test pages at

   http://jshin.net/i18n/utf16le_kr2.html
   http://jshin.net/i18n/utf16be_kr2.html

For more details on UTF-16 and HTML, you can refer to HTML4 spec. at
 
  http://www.w3.org/TR/html4/charset  (see section 5.2.1)

As I wrote before, I have no intention to encourage use of UTF-16 over
UTF-8 although some people  whose primary script  has a more 'economical'
(in terms of file size) representation in UTF-16 than in UTF-8 may want
to use it.


> +=head2 Microsoft-related naming mess
> +
> +Microsoft products misuse the following names:
> +
> +=over 2
> +
> +=item KS_C_5601-1987
> +
> +Microsoft extension to C<EUC-KR>.
> +
> +Proper name: C<CP949>.
> +
> +See
> +http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
> +for details.

 Wow, I didn't know that Martin wrote this. Thanks a lot for
digging this up.  He 'rediscovered' what a lot of people in Korea had
complained about. One thing I don't agree with him is what designation
to use for  CP949. I think it'd better be 'windows-949' because that's
more in line with other MS code pages such as windows-125x (for European
scripts). By the same token, MS version of Shift_JIS can be labeled as
'windows-932. At the moment, Mozilla uses 'x-windows-949' for CP949/UHC
because it's not yet registered with IANA. Probably, I have to contact
Martin and discuss this issue.

> +Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
> +this common misusage. 

 If my patch is accepted, cp949 has a couple of more aliases,
'uhc' and '(x-)-windows-949'. CP949 is commonly known as 
'���� �ϼ���'(Unified Hangul Code) in Korea.


> +I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.

  ksc5601-raw had better be renamed  ksx1001-raw and ksc5601-raw
can be made an alias to ksx1001-raw. Pls, note that now what's now called
ksc5601-raw has two new characters which were only added in Dec. 1998
over a year after the name change (KS C 5601 -> KS X 1001).

> +=item GB2312
> +
> +Encode aliases C<GB2312> to C<euc-cn> in full agreement with
> +IANA registration. C<cp936> is supported separately.
> +I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.

  Oops... You meant gb2312-raw, didn't you? :-)


> Jungshik, I would have certainly advocated linking not only to
> http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
> but also to your comments on the KS_C_5601-1987 in the list archive,
> but all your mails were on several subjects each.
> 
> Jungshik> ... refer to Ken Lunde's CJKV Information Processing
> Jungshik> about that 'epic war' between two camps. (see p.197 of
> Jungshik> the book and http://jshin.net/faq/qa8.html)
> Jungshik> We even set up a web page to prevent M$ from spreading that
> Jungshik> ill-defined name.
> 
> maybe we may link to this page? What is the address?

  The campaign web has disappeared since. It was almost 5 years
ago :-). However, my Hangul FAQ subject 8 deals with the issue
(http://jshin.net/faq/qa8.html) so that you may add the link to it.
Well, be aware that it's been untouched for a few years (if not longer)
and needs a complete overhaul.




0
jshin
4/4/2002 9:49:18 PM
On Thu, 4 Apr 2002, Anton Tagunov wrote:

 Hi Anton !!

AT> Our comments go in the same direction, but will you
AT> let me strengthen your statements a bit?

  Thank you !

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

AT> [ECMA-35, (equivalent of ISO 2022?)]:

  Yes, I think they're a verbatim equivalent of ISO 2022. I'd never
have been able to read ISO 2022 unless ECMA released it free as ECMA 35.

AT> coded character set; code
AT>   A set of unambiguous rules that establishes a
AT>   character set and the one-to-one relationship between the 
AT>   characters of the set and their coded representation.

AT> [RFC 1345]:
AT>   The ISO definition of the term "coded character set" is as
AT>   follows: "A set of unambiguous rules that establishes a 
AT>   character set and the one-to-one relationship between the 
AT>   characters of the set and their coded representation."

AT> Hmmm... can this potentially lead to messing "character set" for
AT> a short form of "coded character set" (in the ISO meaning)?

AT> I see that these definitions themselves make a distinction between a
AT> "character set"       (= repertoire    ) and
AT> "coded character set" (= CCS + encoding = CCS + CES),

> Jungshik?

  Hmm, I feel like being treated as 'the' ultimate something here, which
I'm certainly not and never wanted to be :-)

  I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN,
EUC-TW and even UTF-8 could be regarded as both CCS and CES. Even though
they involve multiple character set standards, the mapping from abstract
characters in those multiple character set standards to integers (despite
being of multiple 'lengths') is strictly one-to-one.  I didn't realize
that it's possible to view things that way until he wrote that. On the
other hand, as he wrote, any encoding that utilize any form of escape
sequence (locking/single shift, designator, etc) , whether defined in
ISO 2022 or not (I have HZ in mind here)  cannot be called a CCS because
just providing the mapping alone cannot fully specify the way actual
text in that encoding is 'serialized' in octet-sequence. Therefore,
I believe the below doesn't hold true for all encodings we have
to deal with although it's the case for some encodings.

AT> "coded character set" (= CCS + encoding = CCS + CES),

Then, I realize that RFC 1345 has the following after quoting
ISO definition of coded character set which you quoted above.

1345> This memo does not put further
1345> restrictions on the term of "coded character set" than the following:
1345>  "A coded character set is a set of rules that unambiguously and
1345>  completely determines which sequence of characters, if any, is
1345>  represented by each possible sequence of n-bit bytes for a certain
1345>  value of n." This implies that e.g. a coded character set extended
1345>  with one or more other coded character sets by means of the extension
1345>  techniques of ISO 2022 constitutes a coded character set in its own
1345>  right.  In this memo the term "charset" is used to refer to the above
1345>  interpretation of the ISO term "coded character set".

However, even RFC 1345 came up with a new term 'charset' for its
*extended* definition of 'coded character set'  to distinguish it from
the original ISO definition. The definition of 'charset' in RFC 1345
is actually in line with RFC 2130/2278. Therefore, what I wrote about
the statement that "coded character set" (= CCS + encoding = CCS + CES)
is still the case, IMO.



DOC> Is a collection of characters in which each character is distinguished
DOC> with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.

AT> or rather CCS to rule out the ISO understanding

  I don't see any conflict between RFC 2130 CCS and ISO coded character
set _quoted_ in RFC 1345. It's not the original ISO definition of 'coded
character set' but  RFC 1345's extension of the definition that made
things complicated. However, even RFC 1345 gave it a new term 'charset'
to tell it from the original ISO defintion.


DOC> =item Character I<Encoding>
DOC> A character encoding may also encode character set as-is (also called
DOC> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
AT> Looks like RFC 1345 has made one big pile:

AT>   JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
AT>   GB_1988-80
AT>   KS_C_5601-1987
AT>   
AT> are all listed in a similar manner there. Does this RFC change
AT> anything?

  As we all know well now (and you documented), at least Encode cannot
use 'ks_c_5601-1987' to mean what's described in RFC 1345 (mapping
bet. characters and row/column numbers) because MS took it away for
their own CP949. A similar misuse of GB2312 made it not desirable to
use GB_2312-80 to mean row/column (or GL) repr. of GB 2312-1980 in Encode.


JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001 
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL 
JS> codepoints.

AT> Thanks a lot! I would have never caught this subtlety from what
AT> reading I have.

  Then, you also have to note what Dan wrote about the difference. JIS and
KS may have tried to 'please' the decimal-oriented :-)  Reading
what RFC wrote about GB 2312-80, 

1345> Considering the Chinese standard GB 2312-1980, the
1345> Japanese standards JIS X0208 and JIS X0212, and the Korean standard
1345> KS C 5601, they are all given by row and column numbers between 1 and
1345> 94. So two positions for row and column and a character set
1345> identifier of one character would be almost as short as possible

I developed a reservation about what I wrote about GB 2312-80.  Either I
(or Ken Lunde) am(is) wrong or the author of RFC 1345 was wrong. Or,
both could be right because it's possible that the printed version of
GB 2312-80 in Chinese used GL code points while the English document
submitted to ISO to register GB 2312-80 used row/column number.


DOC>    KS_C_5601-1987

JS>   I'm afraid this could give an impression that
JS> IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
JS> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
JS> MIME charset designation (although the general public used KS C 5601 or
JS> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
JS> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
JS> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties

 By 'nice properties', I mean you don't have to go back and forth
to figure out which character set any given octet point in a file/stream
belong to because all octets to represent characters in KS X 1001
have MSB=1 while octets for US-ASCII have MSB=0. That doesn't hold
true for CP949/UHC, Shift_JIS, Big5, and Johab.


AT> I have incorporated your ideas into a patch, let's see what Dan
AT> thinks on it! (patch sent in reply to Dan's core message on
AT> Supported.pod renewal)

  Thanks a lot.


DOC>    UTF-16

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.

AT> Hmm.. My attempts to use UTF-16 failed with IE5.5..
AT> Has anyone demonstrated it to work?

  See my comment about this in my other reply to you. It works fine
with MS IE,Netscape 4, Netscape 6 and Mozilla.



JS> to online docs :-) ?  Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing with a very comprehensive coverage
JS> on character sets and encodings.

AT> http://www.oreilly.com/catalog/cjkvinfo/ is the link for the book
AT> "CJKV Information Processing" is the name
AT> But someone has to write a good recommendation for that.
AT> Let it be someone who has the book ;-)

  Hmm, is it me :-) ? A collection of reviews is supposed to be at

ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/review/cjkv-reviews.txt

At the moment, the link is broken, though.

AT> Ken Lunde's book "CJKV Information Processing"
AT> http://www.oreilly.com/catalog/cjkvinfo/

  Or, his web page on the book at

  http://www.oreilly.com/~lunde/cjkv-ip.html

AT> Successor to CJK.inf. Features a very comprehensive coverage
AT> on CJKV character sets and encodings.

 How about just adding the following after '...sets and encodings'

  along with many other issues faced by anyone trying to
  better support CJKV languages/scripts in all the areas of information
  processing.

  Cheers,

  Jungshik 

0
jshin
4/5/2002 3:29:25 AM
Hello, Jungshik!


JS> One thing I don't agree with him is what designation
JS> to use for  CP949. I think it'd better be 'windows-949'

To me that's no problem. Currently I have written

Proper name: C<CP949>.
Proper names: C<CP936>, C<GBK>.
Proper name: C<CP950>.

How do you advice to rewrite this?

JS>  because that's
JS> more in line with other MS code pages such as windows-125x (for European
JS> scripts). By the same token, MS version of Shift_JIS can be labeled as
JS> 'windows-932. At the moment, Mozilla uses 'x-windows-949' for CP949/UHC
JS> because it's not yet registered with IANA. Probably, I have to contact
JS> Martin and discuss this issue.

Anton> +Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
Anton> +this common misusage.

JS>  If my patch is accepted, cp949 has a couple of more aliases,
JS> 'uhc' and '(x-)-windows-949'. CP949 is commonly known as 
JS> '���� �ϼ���'(Unified Hangul Code) in Korea.

Adding this to the [level 1/2+0.1] patch :-)

>> +I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.

JS>   ksc5601-raw had better be renamed  ksx1001-raw and ksc5601-raw
JS> can be made an alias to ksx1001-raw. Pls, note that now what's now called
JS> ksc5601-raw has two new characters which were only added in Dec. 1998
JS> over a year after the name change (KS C 5601 -> KS X 1001).

Sure it's not me, it's Dan! :-)

JS>   Oops... You meant gb2312-raw, didn't you? :-)
Thanks for tracking this!

JS>   The campaign web has disappeared since. It was almost 5 years
JS> ago :-). However, my Hangul FAQ subject 8 deals with the issue
JS> (http://jshin.net/faq/qa8.html) so that you may add the link to it.
JS> Well, be aware that it's been untouched for a few years (if not longer)
JS> and needs a complete overhaul.

The only thing I'm worried about with this is whether I have given
a proper annotation for this (in the patches)!

- Anton


0
tagunov
4/5/2002 11:22:44 AM
Hello, Jungshik!

http://tagunov.tripod.com/survey2.html is largely an answer,
so, if you allow, I will comment with links into this page :)

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

ISO> coded character set; code
ISO>   A set of unambiguous rules that establishes a
ISO>   character set and the one-to-one relationship between the
ISO>   characters of the set and their coded representation.

AT> Hmmm... can this potentially lead to messing "character set" for
AT> a short form of "coded character set" (in the ISO meaning)?

JS>   I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN,
JS> EUC-TW and even UTF-8 could be regarded as both CCS and CES.
They can :)
http://tagunov.tripod.com/survey2.html#BD
classifies it as the ISO point of view: every encoding inevitably
defines a "Character Set" too.

I understand that this is CCS, not
a character repertoire. And you?

JS> Even though
JS> they involve multiple character set standards, the mapping from abstract
JS> characters in those multiple character set standards to integers (despite
JS> being of multiple 'lengths') is strictly one-to-one.  I didn't realize
JS> that it's possible to view things that way until he wrote that.
Neither did I!

JS> On the other hand, as he wrote, any encoding that utilize any form of
JS> escape sequence (locking/single shift, designator, etc) , whether
JS> defined in ISO 2022 or not (I have HZ in mind here)  cannot be called
JS> a CCS because just providing the mapping alone cannot fully specify
JS> the way actual text in that encoding is 'serialized' in octet-sequence.
I agree that EUC-JP is "more" a CCS then ISO-2022-JP :-)
Still, as I write at
http://tagunov.tripod.com/survey2.html#BD
I think that the [RFC 2130] approach is better then ISO, and you? ;)

JS> Therefore, I believe the below doesn't hold true for all encodings
JS> we have to deal with although it's the case for some encodings.
I'm afraid I just do not understand you well here, Jungshik.
AT> "coded character set" (= CCS + encoding = CCS + CES),
My statement is "ISO coded character set" = CCS + CES
This does always hold, does not it?

JS> Then, I realize that RFC 1345 has the following after quoting
JS> ISO definition of coded character set which you quoted above.
1345> This memo does not put further
1345> restrictions on the term of "coded character set" than the following:
1345>  "A coded character set is a set of rules that unambiguously and
1345>  completely determines which sequence of characters, if any, is
1345>  represented by each possible sequence of n-bit bytes for a certain
1345>  value of n." This implies that e.g. a coded character set extended
1345>  with one or more other coded character sets by means of the extension
1345>  techniques of ISO 2022 constitutes a coded character set in its own
1345>  right.  In this memo the term "charset" is used to refer to the above
1345>  interpretation of the ISO term "coded character set".
JS> However, even RFC 1345 came up with a new term 'charset' for its
JS> *extended* definition of 'coded character set'  to distinguish it from
JS> the original ISO definition. The definition of 'charset' in RFC 1345
JS> is actually in line with RFC 2130/2278.
I just more then happy when I opened 2277. The 'charset' definition
there is the best I have seen :-))

Yes 1345 second definition of "coded character set", also named
'charset' is identical to RFC 2130/2277/2278.

JS> Therefore, what I wrote about
JS> the statement that "coded character set" (= CCS + encoding = CCS + CES)
JS> is still the case, IMO.
I'm sorry, Jungshik. I'm afraid I did not understand that. Could you
explain that again?

DOC> Is a collection of characters in which each character is distinguished
DOC> with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.

AT> or rather CCS to rule out the ISO understanding

JS>   I don't see any conflict between RFC 2130 CCS and ISO coded character
JS> set _quoted_ in RFC 1345.
Thanks to Markus G. Kuhn we how have the
http://www.evertype.com/standards/iso8859/8859-14-en.pdf link :)
Both 8859-14-en.pdf and ECMA 35 contain a very close, a bit reworded
wording:
ISO 8859-14> coded character set; code
ISO 8859-14>   A set of unambiguous rules that establishes a
ISO 8859-14>   character set and the one-to-one relationship between the
ISO 8859-14>   characters of the set and their bit combinations.

2130>  A Coded Character Set (CCS) is a mapping from a set of abstract
2130>  characters to a set of integers.
Does the conflict look more evident now?
[RFC 2130] CCS is not at all about encoding. It rather is about
_enumerating_ set of characters IMO.
Here's how I try to reword the [RFC 2130] CCS defintion:
http://tagunov.tripod.com/survey2.html#BB what do you think of it? ;-)

JS>  It's not the original ISO definition of 'coded
JS> character set' but  RFC 1345's extension of the definition that made
JS> things complicated. However, even RFC 1345 gave it a new term 'charset'
JS> to tell it from the original ISO defintion.
Yes, it does conflict, '[RFC 2130] CCS' and '[RFC 2277] charset'==encoding

And furthermore, my opinion is that
http://tagunov.tripod.com/survey2.html#A3.1
ISO coded character set == CCS + CES
Do you approve?

So,
'ISO coded character set' is a 'charset' (not vice versa)
'ISO coded character set' is a CCS       (not vice versa)
'charset'  == 'encoding' == 'RFC 1345 second definition'

DOC> =item Character I<Encoding>
DOC> A character encoding may also encode character set as-is (also called
DOC> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
AT> Looks like RFC 1345 has made one big pile:

AT>   JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
AT>   GB_1988-80
AT>   KS_C_5601-1987
AT>
AT> are all listed in a similar manner there. Does this RFC change
AT> anything?

JS>   As we all know well now (and you documented), at least Encode cannot
JS> use 'ks_c_5601-1987' to mean what's described in RFC 1345 (mapping
JS> bet. characters and row/column numbers) because MS took it away for
JS> their own CP949. A similar misuse of GB2312 made it not desirable to
JS> use GB_2312-80 to mean row/column (or GL) repr. of GB 2312-1980 in Encode.
Yes, yes, yes!
But we're speaking about beautiful theory, not rude practice! :-)
And even in theory the situation is fun to me:
GB 2312-80 _has_ defined a raw CES
JIS X 0208 and KS X 5601 _haven't_
But [RFC 1345] has messed them together and has defined a raw
encoding for each, hasn't it?

JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
JS> codepoints.

AT> Thanks a lot! I would have never caught this subtlety from what
AT> reading I have.

JS>   Then, you also have to note what Dan wrote about the difference. JIS and
JS> KS may have tried to 'please' the decimal-oriented :-)
:-) given we're hex oriented, rather decimal-oriented, does
http://tagunov.tripod.com/survey2.html#BB please us?

JS> Reading what RFC wrote about GB 2312-80,

1345> Considering the Chinese standard GB 2312-1980, the
1345> Japanese standards JIS X0208 and JIS X0212, and the Korean standard
1345> KS C 5601, they are all given by row and column numbers between 1 and
1345> 94. So two positions for row and column and a character set
1345> identifier of one character would be almost as short as possible

Just what I was speaking about. [RFC 1345] has neglected that
difference and has messed them all up. And has presented us with raw
encodings for each!!

(Quite useless as I retell your, Autrijus's and
Dan's explanations in  http://tagunov.tripod.com/survey2.html#A5.3)

JS> I developed a reservation about what I wrote about GB 2312-80.  Either I
JS> (or Ken Lunde) am(is) wrong or the author of RFC 1345 was wrong. Or,
JS> both could be right because it's possible that the printed version of
JS> GB 2312-80 in Chinese used GL code points while the English document
JS> submitted to ISO to register GB 2312-80 used row/column number.
The world is a mess :-)
And seems [RFC 2130] has added to the mess.
No matter that Microsoft has stolen the name, the raw encoding
continues to live. As I've recently heard on perl5-porters,
jis201-raw and jis208-raw are probably going to get back, because
of some issues I do not understand. I'm indifferent about it, just
noting that I blame (or prise :-) [RFC 1345] for bringing them to us.

JS><snip/>

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.
Was in my last patch.

JS> Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing
JS> ...
JS>   Hmm, is it me :-) ?
;-)
JS>   ...
JS>   along with many other issues faced by anyone trying to
JS>   better support CJKV languages/scripts in all the areas of information
JS>   processing.
Done.
Thanks to Dan for speedy application!

My ultimate regards,
 - Anton

P.S.

JS>   Hmm, I feel like being treated as 'the' ultimate something here, which
JS> I'm certainly not and never wanted to be :-)

Settled :)


0
tagunov
4/6/2002 8:17:03 AM
Reply:

Similar Artilces:

doubled Encode.pm: ext\Encode\Encode.pm and lib\Encode.pm
Hello, developers! Currently @15439 I see some surprising situation: Encode.pm has doubled: ext\Encode\Encode.pm lib\Encode.pm moreover, its documentation friends have also doubled: ext\Encode\lib\EncodeFormat.pod ext\Encode\lib\Encode\Details.pod ext\Encode\lib\Encode\Supported.pod ext\Encode\lib\Encode\Encoding.pm lib\EncodeFormat.pod lib\encode\Details.pod lib\encode\Supported.pd lib\encode\Encoding.pm and so have the .enc files: ext\Encode\Encode lib\Encode So I see the following locations have surprising content: lib\Encode ext\Encode ext\Encode\E...

Switch from current encoding to specified encoding not supported.
 I use the following code to write xml to aspx page: context.Response.ContentType = "text/xml";             context.Response.Write(sr.ToString());  But this page give me the following error: I hope that one expert can help me to fix this problem?  Switch from current encoding to specified encoding not supported. Line: 1 Character: 40<?xml version="1.0" encoding="utf-16"?>  Thanks. Hi jakein2008, You may use the wrong encoding, please refer this: http://www.w3...

Support(ed|ing) ietnamese encoding in Encode
Dear Mr. Kogai, Your Encode supports Vietnamese viscii and CP1258 encoding. As far as I know, there are more 4 legacy Vietnamese out there: VNI VPS TCVN VIQR # Refer to http://vietunicode.sourceforge.net/charset/ # for mapping tables. As a request, can you make Perl's Encode support these encodings? It seems that all we have to do is make mapping tables under "ucm" folder. Because I am quite new to Perl ( and Encode! ), I will not change Encode but ask you to do so. I will test the modification when the encodings are added. Best regards, Nguyen Vu...

superreview requested: [Bug 245684] Add image encoding support : [Attachment 196485] Jpeg encoder patch combined with libpr0n/encoder/jpeg files
Brett Wilson <brettw@gmail.com> has asked Vladimir Vukicevic (:vlad) <vladimir@pobox.com> for superreview: Bug 245684: Add image encoding support https://bugzilla.mozilla.org/show_bug.cgi?id=245684 Attachment 196485: Jpeg encoder patch combined with libpr0n/encoder/jpeg files https://bugzilla.mozilla.org/attachment.cgi?id=196485&action=edit ------- Additional Comments from Brett Wilson <brettw@gmail.com> Glenn: Thanks for combining ...

superreview granted: [Bug 245684] Add image encoding support : [Attachment 196485] Jpeg encoder patch combined with libpr0n/encoder/jpeg files
Vladimir Vukicevic (:vlad) <vladimir@pobox.com> has granted Brett Wilson <brettw@gmail.com>'s request for superreview: Bug 245684: Add image encoding support https://bugzilla.mozilla.org/show_bug.cgi?id=245684 Attachment 196485: Jpeg encoder patch combined with libpr0n/encoder/jpeg files https://bugzilla.mozilla.org/attachment.cgi?id=196485&action=edit ------- Additional Comments from Vladimir Vukicevic (:vlad) <vladimir@pobox.com> r=vladimir ...

[PATCH] Encode::Encoding
package Encode::MyEncoding; use base qw(Encode::Encoding); __PACKAGE__->Define(qw(myCanonical myAlias)); dies saying: Error: Undefined subroutine &Encode::define_encoding called at ... Patch follows after sig. -- Tatsuhiko Miyagawa <miyagawa@edge.co.jp> --- lib/Encode/Encoding.pm~ Sun Apr 28 05:09:05 2002 +++ lib/Encode/Encoding.pm Mon May 6 18:48:59 2002 @@ -3,6 +3,8 @@ use strict; our $VERSION = do { my @r = (q$Revision: 1.29 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; +require Encode; + sub Define { ...

Encode vs encoding
Hi to all list users. Can someone *please* explain me the difference between (except the scope) encoding and Encode::encode()? I know encoding affects all the code, but what else does it do to do the right thing or am I missing something with Encode? I'm using ActivePerl 5.8.4 build 810 under Windows 2000 and here are the examples: #!/usr/bin/perl -w use strict; my $char = "\xFE"; print ord $char; # prints 254 #!/usr/bin/perl -w use strict; use Encode; my $char = "\xFE"; $char = encode 'ISO-8859-9', $char; print ord $char; # prints 6...

Html.Encode encodes too much
Hi, When I output some french characters like é è ç à with Hml.Encode, they are escaped as &#233; &#232; &#231; &#224 in the generated HTML. I fear to see almost everything escaped if I translate my application into korean or russian. If I am not wrong, only < > & and " should be escaped. All the other characters are handled directly by the HTTP response encoding (iso-8859-1 for my french characters) and thus should be kept as is in the generated HTML.   Is there a way to override the default Html.Encode by another one ?   It's p...

[Encode] Encoding vs. Charset
Encode hackers (Especially Autrijius) I am now fairly content with the feature set of Encode so I decided to write some programs based upon it. And I have found that most of Chinese (Continental; seems like Taiwanese are much more technically correct) and Korean mails and web pages confuse "charset" and "encodings". That is, charset="gb2312" really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr. Sadly this misconception is enbedded to popular browsers. So when you try something like my ($encname) = /^Content...

Encode::encode MIME-Header
The routine should not gobble up white space. This leads to all manner of bad side-effects, such as encoded words starting and ending with spaces where you'd want words surrounded by spaces, or encoding sequences of words instead of single words. Fix: When constructing $especials, change the first line to read join( '|' => '\s', map { quotemeta( chr($_) ) } ---------------^^^^^ ...

charset/base64 encoding/encode.
Still futzing around with email and character sets. Under Encode and perluniintro there's mention of octet \x{..} (255 chars up to \xff string some internal representation code point \x{...} 1, 2 or more bytes of data But I'm not sure about the order of things. So I'll try this: I have a MIME messsage part like the following: Content-Type: text/plain; charset="BIG5" Content-Transfer-Encoding: base64 1eLKx9K7t+JIVE1MuPHKvdDFvP6joQ0KCqFYoVihWKFYoVihWKFYoVihWKFYoVihWKFYoVihWKFY oVihWKFYoVihWKFYoVihWKFYoVihWKFYoVihWAqhaapgt06haqRX...

[PATCH 5.7.3 Encode] encoding.t not properly skipped when Encode extension not built
--- lib/encoding.t 2002/03/28 09:20:19 1.1 +++ lib/encoding.t 2002/03/28 09:20:34 @@ -1,4 +1,9 @@ BEGIN { + require Config; import Config; + if ($Config{'extensions'} !~ /\bEncode\b/) { + print "1..0 # Skip: Encode was not built\n"; + exit 0; + } if (ord("A") == 193) { print "1..0 # encoding pragma does not support EBCDIC platforms\n"; exit(0); Same problem here in lib/open.t, but this one isn't so easy to fix. Someone with more understanding should look at it. ...

superreview granted: [Bug 263087] Remove unused fast integer jpeg decode/encode routines : [Attachment 161254] remove fast integer encode/decode support
T Rowley (IBM) <tor@acm.org> has granted T Rowley (IBM) <tor@acm.org>'s request for superreview: Bug 263087: Remove unused fast integer jpeg decode/encode routines https://bugzilla.mozilla.org/show_bug.cgi?id=263087 Attachment 161254: remove fast integer encode/decode support https://bugzilla.mozilla.org/attachment.cgi?id=161254&action=edit ...

superreview canceled: [Bug 232515] S/MIME support not RFC2633 compliant: transfer-encoding problem
Magnus Melin <mkmelin+mozilla@iki.fi> has canceled Magnus Melin <mkmelin+mozilla@iki.fi>'s request for superreview: Bug 232515: S/MIME support not RFC2633 compliant: transfer-encoding problem - no transfer encoding applied when ascii msg body https://bugzilla.mozilla.org/show_bug.cgi?id=232515 Attachment 330455: proposed fix https://bugzilla.mozilla.org/attachment.cgi?id=330455&action=edit ------- Additional Comments from Magnus Melin <mkmelin+mozilla@iki.fi> Fixed the first review comment. Carrying forward r=standard8. ...

Web resources about - [Encode] Encode::Supported revised - perl.unicode

Resources last updated: 12/7/2015 12:39:04 AM