-T result changed at 5.22.4

--------------7ADE922B10C9EA35BB9C4BD8
Content-Type: multipart/alternative;
 boundary="------------0E1F8BFCCFC6185984381959"


--------------0E1F8BFCCFC6185984381959
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

My customer has a script that uses -T to test whether it is looking at a 
binary file, and this script changed behavior on perl upgrade when 
invoked on the same input file.  I narrowed the version range down to 
5.20.3 (last version returning false) to 5.22.4 (first version returning 
true).  gdb output (attached) suggests this is due to this change from 
perl5220delta:

    It has always been the intention for the |-B| and |-T| file test
    operators to treat UTF-8 encoded files as text. (-X FILEHANDLE
    <https://perldoc.perl.org/functions/-X.html> has been updated to say
    this.) Previously, it was possible for some files to be considered
    UTF-8 that actually weren't valid UTF-8. This is now fixed. The
    operators now work on EBCDIC platforms as well.


However, the file being tested is not UTF-8 and should surely be 
interpreted as binary as it was before.  Most of it is null charcaters.

I am attaching the file in question.  Test with:

perl -le 'print "Version $] : ", -T "binary_file_test" ? "text" : "not 
text"'

Output:

Version 5.020003 : not text
Version 5.022004 : text

I am not perlbugging yet because I have been disconnected from P5P for 
some time and I may be wrong in my assessment, in which case I would 
appreciate an explanation.  But if this is a bug I'm happy to file the 
report. My question boils down to: Is this change in behavior warranted 
for this input?

----
Peter Scott



--------------0E1F8BFCCFC6185984381959
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <tt>My customer has a script that uses -T to test whether it is
      looking at a binary file, and this script changed behavior on perl
      upgrade when invoked on the same input file.  I narrowed the
      version range down to 5.20.3 (last version returning false) to
      5.22.4 (first version returning true).  gdb output (attached)
      suggests this is due to this change from perl5220delta:<br>
    </tt><br>
    <blockquote><tt>It has always been the intention for the <code
          class="inline">-B</code> and <code class="inline">-T</code>
        file test operators to
        treat UTF-8 encoded files as text. (<a
          href="https://perldoc.perl.org/functions/-X.html">-X
          FILEHANDLE</a> has
        been updated to say this.) Previously, it was possible for some
        files to be
        considered UTF-8 that actually weren't valid UTF-8. This is now
        fixed. The
        operators now work on EBCDIC platforms as well.</tt><br>
    </blockquote>
    <tt><br>
      However, the file being tested is not UTF-8 and should surely be
      interpreted as binary as it was before.  Most of it is null
      charcaters.<br>
      <br>
      I am attaching the file in question.  Test with:<br>
      <br>
      perl -le 'print "Version $] : ", -T "binary_file_test" ? "text" :
      "not text"'<br>
      <br>
      Output:<br>
      <br>
      Version 5.020003 : not text<br>
      Version 5.022004 : text<br>
      <br>
      I am not perlbugging yet because I have been disconnected from P5P
      for some time and I may be wrong in my assessment, in which case I
      would appreciate an explanation.  But if this is a bug I'm happy
      to file the report. My question boils down to: Is this change in
      behavior warranted for this input?<br>
      <br>
      ---- <br>
      Peter Scott<br>
      <br>
      <br>
    </tt>
  </body>
</html>

--------------0E1F8BFCCFC6185984381959--

--------------7ADE922B10C9EA35BB9C4BD8
Content-Type: application/x-gzip;
 name="dash-T-change.tar.gz"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="dash-T-change.tar.gz"

H4sIANaZBFsAA+1Y627iRhSOWqmV9ikm+wNsIImvGJqwFUkgi0ogIqRKqkojg23qrmNbtsM2
vbxJH6cP1hlz8QyEdHMBIvV8EjI2c75z5syZ8eEbuL4Z3WPH9Wyc2HGyswZIBGVNo1fZ0GVy
VVS9XE6f02eKoezIqirJsqyWVfI7udG1HSStI5hF3MWJGSG0E/4aD4Nk9fw/24NNhLNp1I9P
Dk4bTfQV+f4N+ZiDIbZsB8sVSZEq9GLIOr1ochVbjiUrONwnV7SMb4k5/fzz99eUdu/65qf6
dicHAAAexcga7IV25O3p+4q0r+57wejVfTx+/suypEqz818rS+n5Tw4fOP83AVWp6LMDPInu
6+f1s9aJQzsBfH4mXLRxEO59CEKc3Ic2qtVQ9wI3+/3GdR99j/L9PPoO5Y/z4uE7gRSSiPx3
hM+Y8bkOYhgczxzFKEcYHNxrNMWJCTWozgxsL7ZTK8YsjNyxmdjEUOhehM0+vuzXT35onP45
uas3+40eTsSMriqzL6XRGNUQYSPvtNH4cG0+VdZnTH0WCB0preEnHIeZ36q2FNx5/ea4gU8b
JCf47EfsB7cjIR6LcxtVM3hyHNtJOBYm/Ilv3toldDm+mJjiYeCTtfMDz/YpDcOjSywPnfKu
4IQ0PWT/t7pk8sSEEl1PSFgPYgm9j96Looj+yOKqsnzTwemUOlft9vr8lvUFv54ZJ/GkQGl9
ksXqZ97L5QdHm8nY9KYxtEkQDn0kTCOi3bAfkCCJ+9w0mqE5/MVmsvlGouBqY7rhWO4jJHG5
eyNxGNwmnToauP5tYNmC2f94jZETllD+KF9CXXzc6tR7N6W0rhinBleAMYk/Gdw5h6/sRGFJ
SCaywo1s0xIoA3VbQrH7ux04Ar1ht53BnQ3COHAtcUow9ILYpondnjvuPKLrRl0e1fjVqiqL
o3YLMRmAHsRBAfl3nod8e2xHyPS84LNtIddHif1bggoHM1pNUhZNKXWrg9vdk3q7gXtXnX7r
vIFyOeTG9fbFxzpunwjEs5hFNl9iJ4iQ4JJsSYfIJeVG5kG+FIskUcUiMxnFmJdqS1VQYFmp
zQsYNYk74OYvlAJ928lKhR2qqKuGHiESTS73gjjWtEyQEcgIZAQyAhmBjEBG/q8Z0RSuU0SI
DNmtofzPfp72Z7O7iLsb5DkG/WGGhLNxmDvFEDPrMloAad2KxZc0bmtbMnXRlHaZxVoaSDZK
Wxw1IB3+p/kATZ0vYypPUI4CUtEHSiNSLeYxeYb8YybxpkEi+UBNI2Bab5X7a9Ts416jf9Xr
dLoz99vWpQCbAav/KvvaFvRfie7tTP+VtVT/1WTQfzcBVX1F/Zecak9Vf9XXVWLVL1F/X93n
F6q/6jPUX02v8OTPVH+1MtcUUE2PeF+lujJmy6/IZfGWsi0LtWQvs7YrBeJthGY8Rb3VDK73
cKzMR6aNHm6Ge0lnJRa8tqpVlAf9P6zxOtZKUVerqIvOHhV1t+iY2yPPE3q1CqcmP1151apc
Gf+H8rppd/y5uELo1apcMS4p6rpUYRl2kRtj1x+bkWv6CVnJyPVHgnBVQQURxaW0T2XYdXlJ
bCb2d4lTmZpiLxjy5qRE7FDMCLjqZ2pjZRfMeNe5hM9b3pvGZTa/yUn0F3S/AAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAN4p/AaJvcpAAUAAA
--------------7ADE922B10C9EA35BB9C4BD8--
0
peter
5/22/2018 10:36:33 PM
perl.perl5.porters 47325 articles. 0 followers. Follow

1 Replies
43 Views

Similar Articles

[PageSpeed] 17

On 05/22/2018 04:36 PM, Peter Scott wrote:
> My customer has a script that uses -T to test whether it is looking at =
a=20
> binary file, and this script changed behavior on perl upgrade when=20
> invoked on the same input file.=C2=A0 I narrowed the version range down=
 to=20
> 5.20.3 (last version returning false) to 5.22.4 (first version returnin=
g=20
> true).=C2=A0 gdb output (attached) suggests this is due to this change =
from=20
> perl5220delta:
>=20
>     It has always been the intention for the |-B| and |-T| file test
>     operators to treat UTF-8 encoded files as text. (-X FILEHANDLE
>     <https://perldoc.perl.org/functions/-X.html> has been updated to sa=
y
>     this.) Previously, it was possible for some files to be considered
>     UTF-8 that actually weren't valid UTF-8. This is now fixed. The
>     operators now work on EBCDIC platforms as well.
>=20
>=20
> However, the file being tested is not UTF-8 and should surely be=20
> interpreted as binary as it was before.=C2=A0 Most of it is null charca=
ters.
>=20
> I am attaching the file in question.=C2=A0 Test with:
>=20
> perl -le 'print "Version $] : ", -T "binary_file_test" ? "text" : "not=20
> text"'
>=20
> Output:
>=20
> Version 5.020003 : not text
> Version 5.022004 : text
>=20
> I am not perlbugging yet because I have been disconnected from P5P for=20
> some time and I may be wrong in my assessment, in which case I would=20
> appreciate an explanation.=C2=A0 But if this is a bug I'm happy to file=
 the=20
> report. My question boils down to: Is this change in behavior warranted=
=20
> for this input?

Please file a bug report.

I looked at this, and in fact all bytes of this file are legal UTF-8,=20
hence the heuristic kicks in to mark it as text.

But all bytes are ASCII except for one two byte sequence \xC4\x9C which=20
happens to form the code point U+011C.

So the heuristic should be improved.  I need to give some thought to=20
how.  Suggestions welcome.

Looking at how it currently works for non-UTF-8 text, it looks in the=20
first 512 bytes for characters it deems "odd", which are C0 controls=20
minus most of the \s ones, and minus \e.  If more than a third of the=20
buffer is odd, it is considered binary.  Except, if even a single NUL is=20
found, it is marked as binary.  Now I'm not so sure that is correct.  I=20
could potentially see a file consisting of C strings.  But in that case=20
there wouldn't be multiple NULs in a row, except perhaps at the end of=20
the file.

>=20
> ----
> Peter Scott
>=20
>=20
0
public
5/23/2018 2:08:58 AM
Reply: