TONT 40093 有些文件在记事本里打开时怪怪的

联通电池。

原文链接:https://blogs.msdn.microsoft.com/oldnewthing/20040324-00/?p=40093

David Cumps discovered that certain text files come up strange in Notepad.

David Cumps发现某些特定的文本文件在记事本中打开时好像有些怪怪的。

The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

原因是记事本需要应对不同编码的文件,而当被逼到没法的时候,(文件的编码)也就只能靠猜了。

Here’s the file “Hello” in various encodings:

以下是包含字符串『Hello』的文本文件,但编码不同:

48 65 6C 6C 6F

This is the traditional ANSI encoding.(这是传统的ANSI编码。)

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.(这是不带BOM的小端序Unicode编码。)

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

这是带BOM的小端序Unicode编码。BOM(即开头的FF FE)用途有二:一来,标示该文件为Unicode编码;二来,这两个字节的顺序表明这个文件是小端序的。

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

这是不带BOM的大端序Unicode编码。记事本不支持这种编码。

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

这是带BOM的大端序Unicode编码,注意BOM的字节顺序与小端序BOM相反。

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

这是UTF-8编码,开头的三个字节是UTF-8编码的BOM。

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding. The first five bytes are the UTF-7 encoding of the BOM. Notepad doesn’t support this encoding.

这是UTF-7编码,开头的五个字节是UTF-7编码的BOM,记事本不支持这种编码。

Notice that the UTF7 BOM encoding is just the ASCII string “+/v8-“, which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).

请注意,UTF-7的BOM头的编码正好是ASCII字符串『+/v8-』,如果文本文件正好以这五个字符开头,对猜测其编码会造成一定的困难(虽然以这五个字符开头本身就有点怪怪的了)。

The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., “plain ASCII”) and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

不包含任何特殊的前缀、但仍被记事本支持的编码是传统的ANSI编码(亦即所谓的纯ASCII)和不带BOM的小端序Unicode编码。当面对没有特殊前缀的文本文件时,记事本将被迫猜测文件实际使用的编码。用以处理这项业务的函数叫IsTextUnicode,通过对一块字节进行研究、并进行某些统计性分析来对文件的编码进行猜测。

And as the documentation notes, “Absolute certainty is not guaranteed.” Short strings are most likely to be misdetected.

并且这个函数的文档亦有注明『无法保证对编码绝对准确的猜测』。短小的字符串被猜错的几率相对会比较大。

注:所有评论将在审核通过后显示,请不要在评论内容的任何位置出现链接,否则您的评论将被自动移入回收站,且永远不会被复审。

All comments will be available after being manually reviewed, please do not include any links anywhere in your comment, otherwise your comment will be automatically deleted and are not eligible for review.

3 条评论

    1. 以前版本的Windows(大概是Win2K)中记事本存在的一个编码检测bug,复现步骤如下:
      1、用记事本新建一个文本文档,录入『联通』两个字;
      2、将该文件保存为ANSI编码的文本文件;
      3、关闭记事本,然后重新打开刚刚保存的文件。
      此时由于『联通』二字的编码开头数个字节与Unicode的特征头类似,记事本将使用Unicode编码尝试加载该文件,结果是用户看到的是一个小黑块,被用户戏称为『烧焦的联通电池』。
      该问题直到Windows 10 1809中的新版记事本仍然存在,虽然不再显示为一个小黑块了,以及,甚至用Notepad++打开也会有这个问题。
      更为详细的解释可以参考:https://www.cnblogs.com/candyboy/articles/1743033.html

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注