TONT 40093 有些文件在记事本里打开时怪怪的

联通电池。

原文链接:https://blogs.msdn.microsoft.com/oldnewthing/20040324-00/?p=40093

David Cumps discovered that certain text files come up strange in Notepad.

David Cumps发现某些特定的文本文件在记事本中打开时好像有些怪怪的。

The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

原因是记事本需要应对不同编码的文件,而当被逼到没法的时候,(文件的编码)也就只能靠猜了。

Here’s the file “Hello” in various encodings:

以下是包含字符串『Hello』的文本文件,但编码不同:

48 65 6C 6C 6F

This is the traditional ANSI encoding.(这是传统的ANSI编码。)

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.(这是不带BOM的小端序Unicode编码。)

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

这是带BOM的小端序Unicode编码。BOM(即开头的FF FE)用途有二:一来,标示该文件为Unicode编码;二来,这两个字节的顺序表明这个文件是小端序的。

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

这是不带BOM的大端序Unicode编码。记事本不支持这种编码。

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

这是带BOM的大端序Unicode编码,注意BOM的字节顺序与小端序BOM相反。

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

这是UTF-8编码,开头的三个字节是UTF-8编码的BOM。

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding. The first five bytes are the UTF-7 encoding of the BOM. Notepad doesn’t support this encoding.

这是UTF-7编码,开头的五个字节是UTF-7编码的BOM,记事本不支持这种编码。

Notice that the UTF7 BOM encoding is just the ASCII string “+/v8-“, which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).

请注意,UTF-7的BOM头的编码正好是ASCII字符串『+/v8-』,如果文本文件正好以这五个字符开头,对猜测其编码会造成一定的困难(虽然以这五个字符开头本身就有点怪怪的了)。

The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., “plain ASCII”) and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

不包含任何特殊的前缀、但仍被记事本支持的编码是传统的ANSI编码(亦即所谓的纯ASCII)和不带BOM的小端序Unicode编码。当面对没有特殊前缀的文本文件时,记事本将被迫猜测文件实际使用的编码。用以处理这项业务的函数叫IsTextUnicode,通过对一块字节进行研究、并进行某些统计性分析来对文件的编码进行猜测。

And as the documentation notes, “Absolute certainty is not guaranteed.” Short strings are most likely to be misdetected.

并且这个函数的文档亦有注明『无法保证对编码绝对准确的猜测』。短小的字符串被猜错的几率相对会比较大。

Comments

  1. 以前版本的Windows(大概是Win2K)中记事本存在的一个编码检测bug,复现步骤如下:
    1、用记事本新建一个文本文档,录入『联通』两个字;
    2、将该文件保存为ANSI编码的文本文件;
    3、关闭记事本,然后重新打开刚刚保存的文件。
    此时由于『联通』二字的编码开头数个字节与Unicode的特征头类似,记事本将使用Unicode编码尝试加载该文件,结果是用户看到的是一个小黑块,被用户戏称为『烧焦的联通电池』。
    该问题直到Windows 10 1809中的新版记事本仍然存在,虽然不再显示为一个小黑块了,以及,甚至用Notepad++打开也会有这个问题。
    更为详细的解释可以参考:https://www.cnblogs.com/candyboy/articles/1743033.html

发表评论

电子邮件地址不会被公开。 必填项已用*标注

 剩余字数 ( Characters available )

注:请不要在评论中插入任何链接,否则将自动被识别为垃圾评论,博主将完全看不到。

Notice: please DO NOT add any links in your comment, otherwise it would be identified as SPAM automatically.

*