TONT 40093 有些文件在记事本里打开时怪怪的

联通电池。

The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

原因是记事本需要应对不同编码的文件，而当被逼到没法的时候，（文件的编码）也就只能靠猜了。

Here’s the file “Hello” in various encodings:

以下是包含字符串『Hello』的文本文件，但编码不同：

48 65 6C 6C 6F

This is the traditional ANSI encoding.（这是传统的ANSI编码。）

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.（这是不带BOM的小端序Unicode编码。）

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

这是带BOM的小端序Unicode编码。BOM（即开头的FF FE）用途有二：一来，标示该文件为Unicode编码；二来，这两个字节的顺序表明这个文件是小端序的。

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

这是不带BOM的大端序Unicode编码。记事本不支持这种编码。

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

这是带BOM的大端序Unicode编码，注意BOM的字节顺序与小端序BOM相反。

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

这是UTF-8编码，开头的三个字节是UTF-8编码的BOM。

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding. The first five bytes are the UTF-7 encoding of the BOM. Notepad doesn’t support this encoding.

这是UTF-7编码，开头的五个字节是UTF-7编码的BOM，记事本不支持这种编码。

Notice that the UTF7 BOM encoding is just the ASCII string “+/v8-“, which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).

请注意，UTF-7的BOM头的编码正好是ASCII字符串『+/v8-』，如果文本文件正好以这五个字符开头，对猜测其编码会造成一定的困难（虽然以这五个字符开头本身就有点怪怪的了）。

The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., “plain ASCII”) and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

不包含任何特殊的前缀、但仍被记事本支持的编码是传统的ANSI编码（亦即所谓的纯ASCII）和不带BOM的小端序Unicode编码。当面对没有特殊前缀的文本文件时，记事本将被迫猜测文件实际使用的编码。用以处理这项业务的函数叫IsTextUnicode，通过对一块字节进行研究、并进行某些统计性分析来对文件的编码进行猜测。

And as the documentation notes, “Absolute certainty is not guaranteed.” Short strings are most likely to be misdetected.

并且这个函数的文档亦有注明『无法保证对编码绝对准确的猜测』。短小的字符串被猜错的几率相对会比较大。

3 条评论

石樱灯笼说道：

2019年2月14日 17:09

联通电池是什么？

回复
1. mmiaow说道：
  
  2019年2月14日 19:52
  
  以前版本的Windows（大概是Win2K）中记事本存在的一个编码检测bug，复现步骤如下：
  1、用记事本新建一个文本文档，录入『联通』两个字；
  2、将该文件保存为ANSI编码的文本文件；
  3、关闭记事本，然后重新打开刚刚保存的文件。
  此时由于『联通』二字的编码开头数个字节与Unicode的特征头类似，记事本将使用Unicode编码尝试加载该文件，结果是用户看到的是一个小黑块，被用户戏称为『烧焦的联通电池』。
  该问题直到Windows 10 1809中的新版记事本仍然存在，虽然不再显示为一个小黑块了，以及，甚至用Notepad++打开也会有这个问题。
  更为详细的解释可以参考：https://www.cnblogs.com/candyboy/articles/1743033.html
  
  回复
石樱灯笼说道：

2019年2月21日 15:08

一直是用 UTF-8 ，从不相信非国际通用的编码

回复

3 条评论

发表回复 取消回复

发表回复取消回复