分类:Win故知新

TONT 36673 保持错误代码向下兼容的重要性

原文链接:https://devblogs.microsoft.com/oldnewthing/20050118-00/?p=36673

I remember a bug report that came on in an old MS-DOS program (from a company that is still in business so don’t ask me to identify them) that attempted to open the file “”. That’s the file with no name.

我记得有一个bug报告,是关于一个老旧的 MS-DOS 程序(开发这个程序的公司目前仍在存续中,所以不要问我具体是哪家公司)尝试打开文件“”,也就是一个没有名字的文件。

This returned error 2 (file not found). But the program didn’t check the error code and though that 2 was the file handle. It then began writing data to handle 2, which ended up going to the screen because handle 2 is the standard error handle, which by default goes to the screen.

这样做会使系统报告错误代码2(文件未找到),但程序没有检查错误代码,以为2就是文件句柄,然后就会开始向句柄2填充数据,而数据会显示在屏幕上,而这是因为句柄2是标准错误输出句柄,其默认行为就是输出到屏幕上。

It so happened that this program wanted to print the message to the screen anyway.

碰巧这个程序要做的就是向屏幕输出消息。

In other words, this program worked completely by accident.

换句话说,这个程序只是撞了大运正常工作了。

Due to various changes to the installable file system in Windows 95, the error code for attempting to open the null file changed from 2 (file not found) to 3 (path not found) as a side-effect.
Watch what happens.

在 Windows 95 可以安装在其上的文件系统设计的几次变动中,其中一个副作用是:尝试打开一个不存在的文件回报的错误代码从2(文件不存在)变成了3(找不到路径)。现在来看看会发生什么事。

The program tries to open the file “”. Now it gets error 3 back. It mistakenly treats the 3 as a file handle and writes to it.

程序尝试打开文件“”(空文件名)。现在它获得了错误代码3。程序照旧误打误撞将3作为文件句柄,并开始向其中写入数据。

What is handle 3?

那么句柄3是什么呢?

The standard MS-DOS file handles are as follows:

标准的 MS-DOS 文件句柄如下所示:

句柄 名称 含义
0 stdin 标准输入设备
1 stdout 标准输出设备
2 stderr 标准错误输出
3 stdaux 标准辅助设备(串口)
4 stdprn 标准打印机

What happens when the program writes to handle 3?

当程序尝试向句柄3写入时会发生什么呢?

It tries to write to the serial port.

会尝试向串口写数据。

Most computers don’t have anything hooked up to the serial port. The write hangs.

大多数计算机的串口上什么也没连,所以写操作挂起了。

Result: Dead program.

结果就是:程序死掉了。

The file system folks had to tweak their parameter validation so they returned error 2 in this case.

文件系统开发组对参数校验做了些调整,使其在这种情况下返回错误代码2(来解决这个问题)。

TONT 36683 MS-DOS 是如何报告错误代码的?

原文链接:https://devblogs.microsoft.com/oldnewthing/20050117-00/?p=36683

The old MS-DOS function calls (ah, int 21h), typically indicated error by returning with carry set and putting the error code in the AX register. These error codes will look awfully familiar today: They are the same error codes that Windows uses. All the small-valued error codes like ERROR_FILE_NOT_FOUND go back to MS-DOS (and possibly even further back).

旧式的 MS-DOS 功能调用(啊,INT 21h)通常通过在返回中设置carry标志、并将错误代码放在AX寄存器中来表明发生了错误。这些错误代码即使今天看起来也极其眼熟,因为 Windows 也使用了相同的错误代码。所有这些由小小的数字代表的错误代码——如 ERROR_FILE_NOT_FOUND ——都可以追溯到 MS-DOS(并且可能更早)。

Error code numbers are a major compatibility problem, because you cannot easily add new error code numbers without breaking existing programs. For example, it became well-known that “The only errors that can be returned from a failed call to OpenFile are 3 (path not found), 4 (too many open files), and 5 (access denied).” If MS-DOS ever returned an error code not on that list, programs would crash because they used the error number as an index into a function table without doing a range check first. Returning a new error like 32 (sharing violation) meant that the programs would jump to a random address and die.

错误代码是一项主要的兼容性问题,因为你无法简单地增加新的错误代码,而不影响已有的应用程序。例如,广为人知的是『调用 OpenFile 且失败时,可能的返回只会是3(找不到路径)、4(打开的文件数已超出上限)或者5(拒绝访问)』。如果 MS-DOS 返回了一个不在这个列表上的错误代码,(第三方)程序们就会崩溃,因为这些程序将错误代码用作了函数列表的索引,甚至连边界检查都没做。返回一个新的错误代码(例如32)会让这些程序跳到一个随机的地址,然后炸掉。

More about error number compatibility next time.

下次有机会时,我们再来说有关错误代码兼容性的事。

When it became necessary to add new error codes, compatibility demanded that the error codes returned by the functions not change. Therefore, if a new type of error occurred (for example, a sharing violation), one of the previous “well-known” error codes was selected that had the most similar meaning and that was returned as the error code. (For “sharing violation”, the best match is probably “access denied”.) Programs which were “in the know” could call a new function called “get extended error” which returned one of the newfangled error codes (in this case, 32 for sharing violation).

等到增加新的错误代码变得有必要时,兼容性需求会要求函数返回的错误代码不能改变。因此,当某个新型的错误发生时(例如共享违例),会返回一个之前『最广为人知』且含义最为接近的的错误代码。(对于『共享违例』来说,最佳的匹配项是『拒绝访问』)。那些『知道内情』的(新)程序可以通过调用名为『获取扩展错误代码』的方法来获取那些『新奇』的错误代码(在前面的例子中,程序会获得32——共享违例)。

The “get extended error” function returned other pieces of information. It gave you an “error class” which gave you a vague idea of what type of problem it is (out of resources? physical media failure? system configuration error?), an “error locus” which told you what type of device caused the problem (floppy? serial? memory?), and what I found to be the most interesting meta-information, the “suggested action”. Suggested actions were things like “pause, then retry” (for temporary conditions), “ask user to re-enter input” (for example, file not found), or even “ask user for remedial action” (for example, check that the disk is properly inserted).

这个『获取扩展错误代码』方法还返回了其它的信息,它会给你返回一个『错误类』来通知你关于问题的大致类别(资源不足?媒体硬件损坏?系统设置出错?),一个『错误核心』来告知你导致错误发生的具体设备类型(软驱?串口?内存?),以及我认为最有趣的元信息部分——『建议操作』。『建议操作』会是类似『暂停,然后重试』(对于暂时性的问题来说),『要求用户重新提供输入』(例如找不到文件这类错误),甚至『要求用户实行补救措施』(例如检查磁盘是否正确插入了)等等。

The purpose of these meta-error values is to allow a program to recover when faced with an error code it doesn’t understand. You could at least follow the meta-data to have an idea of what type of error it was (error class), where the error occurred (error locus), and what you probably should do in response to it (suggested action).

这些有关错误的元数据有助于程序在面对一个其不了解的错误代码时,从错误中恢复过来。至少你可以从元数据所描述中,知晓出错的类型(错误类)、出错的所在(错误核心)以及面对错误时可能应该进行的操作(建议操作)。

Sadly, this type of rich error information was lost when 16-bit programming was abandoned. Now you get an error code or an exception and you’d better know what to do with it. For example, if you call some function and an error comes back, how do you know whether the error was a logic error in your program (using a handle after closing it, say) or was something that is externally-induced (for example, remote server timed out)? You don’t.

可惜的是,这种丰富的错误信息设计随着16位程序退出历史舞台被遗弃了。现在当你面对错误代码或异常信息时,你最好知道自己应该做什么。例如,如果你调用了某个方法,然后返回了一个错误,你如何知道这是你程序设计中的逻辑错误(例如在关闭某个句柄后又去使用它),还是某些外界因素的导致的(例如远程服务器超时)?你没法知道。

This is particularly gruesome for exception-based programming. When you catch an exception, you can’t tell by looking at it whether it’s something that genuinely should crash the program (due to an internal logic error – a null reference exception, for example) or something that does not betray any error in your program but was caused externally (connection failed, file not found, sharing violation).

这种情形在面对以异常为错误机制的编程时尤为可惧。当你捕获了一个异常时,你没有办法通过观察异常信息,来判断是什么地方真的让你的程序崩溃了(来自内部的逻辑设计错误,例如空引用异常等等),还是某些实际上与你的程序无关、而是某些外界因素导致的(例如连接失败、未找到文件、共享违例等等)。

TONT 36743 为什么\\不会触发自动完成、并列出网络上所有的计算机?

原文链接:https://devblogs.microsoft.com/oldnewthing/20050111-00/?p=36743

Wes Haggard wishes that \ would autocomplete to all the computers on the network. [Link fixed 10am.] An early beta of Windows 95 actually did something similar to this, showing all the computers on the network when you opened the Network Neighborhood folder. And the feature was quickly killed.

Wes Haggard 希望(在『运行』对话框或地址栏中)输入 \\ 时,自动完成功能可以列出网络上的所有计算机。Windows 95 的一个早期 beta 版本实际上有一个与此类似的功能,当你打开『网上邻居』文件夹时便列出网络上的所有计算机,然而这个功能很快就被砍掉了。

Why?

为什么呢?

Corporations with large networks were having conniptions because needlessly enumerating all the machines on the network can bring a large network to its knees. Think about all the times you type “\\”. Now imagine if every single time you did that, Explorer started enumerating all the machines on the network. And imagine how your network administrator would feel if their network traffic saturated with enumerations each time you did that.

拥有大型网络的企业对此大动肝火,因为毫无必要地枚举出网络上所有的计算机有将一个大型网络搞到跪的能力。想像一下每次你输入 \\ 的时候。然后再想像一下每次你这样做的时候,资源管理器都会开始枚举网络上所有的计算机。再想像一下每次你这样做时,网络上的巨额流量会让网管的脸有多难看。

Network administrators made it clear in no uncertain terms that having Windows casually enumerate all the machines on their LAN was totally unacceptable.

网管们非常清楚且毫不含糊地表示,让 Windows 随随便便就在局域网上枚举所有计算机是完全不可接受的事情。

The needs of the corporate environment are very different from those of the home network, and Windows needs to operate in both worlds.

企业环境的需求与家庭网络大相径庭,而 Windows 需要在两种环境下都能正常操作。

TONT 37003 追寻更加迅速的syscall陷阱

原文链接:https://devblogs.microsoft.com/oldnewthing/20041215-00/?p=37003

The performance of the syscall trap gets a lot of attention.

有关 syscall 陷阱的效率问题吸引了很多人的注意。

I was reminded of a meeting that took place between Intel and Microsoft over fifteen years ago. (Sadly, I was not myself at this meeting, so the story is second-hand.)

我想起了十五年前 Intel 和微软之间的一次会议。(很遗憾当时我没有亲自在场,所以接下来的故事是转述的。)

Since Microsoft is one of Intel’s biggest customers, their representatives often visit Microsoft to show off what their latest processor can do, lobby the kernel development team to support a new processor feature, and solicit feedback on what sort of features would be most useful to add.

鉴于微软是 Intel 最大的客户之一,Intel 的代表经常到访微软,炫耀他们最新款的处理器的能力,游说内核开发团队支持一项新的处理器功能,并且征求有关有意向添加到处理器中的、最有用的功能类别。

At this meeting, the Intel representatives asked, “So if you could ask for only one thing to be made faster, what would it be?”

在那次会议上,Intel 的代表问道,『如果只有一件事可以提速,你们希望是什么呢?』

Without hesitation, one of the lead kernel developers replied, “Speed up faulting on an invalid instruction.”

内核团队的一位领头开发者不假思索地回答道:『执行无效指令时的出错再快一点。』

The Intel half of the room burst out laughing. “Oh, you Microsoft engineers are so funny!” And so the meeting ended with a cute little joke.

会议室里 Intel 一侧的人们大笑起来,『哎呀,你们微软的工程师可真有意思!』会议在这个小而有趣的玩笑中收场了。

After returning to their labs, the Intel engineers ran profiles against the Windows kernel and lo and behold, they discovered that Windows spent a lot of its time dispatching invalid instruction exceptions. How absurd! Was the Microsoft engineer not kidding around after all?

等回到实验室之后,Intel 的工程师们对 Windows 的内核进行了测评,出乎意料地发现 Windows 花了大量的时间来调度无效的指令异常。这也太荒谬了吧!微软的那些工程师原来并不是在开玩笑吗?

No he wasn’t.

还真不是。

It so happens that on the 80386 chip of that era, the fastest way to get from V86-mode into kernel mode was to execute an invalid instruction! Consequently, Windows/386 used an invalid instruction as its syscall trap.

原来在那个时代的 80386 处理器上,从虚拟8086模式切换到内核模式最快的方法,正是执行一个无效的指令!因此,Windows/386 将无效指令作为了 syscall 的陷阱。

What’s the moral of this story? I’m not sure. Perhaps it’s that when you create something, you may find people using it in ways you had never considered.

至于这个故事教给我们的道理是什么,我并不太确定。大概是当你创造了一项事物时,你会发现人们会用你从未考虑过的方式去使用它。

TONT 37153 为什么 Windows 95 的定时器的运行频率是 55ms?

原文链接:https://devblogs.microsoft.com/oldnewthing/20041202-00/?p=37153

The story behind the 55ms timer tick rate goes all the way back to the original IBM PC BIOS. The original IBM PC used a 1.19MHz crystal, and 65536 cycles at 1.19MHz equals approximately 55ms. (More accurately, it was more like 1.19318MHz and 54.92ms.)

定时器的运行频率是 55ms 追根究底要回到原始的 IBM PC BIOS 上。最初的 IBM PC 使用了一颗 1.19MHz 的晶振,而 1.19MHz 上 65536 个时钟周期所需的时间大约就是 55ms。(更准确的说,应该是 1.19318 MHz 和 54.92ms。)

But that just pushes the question to another level. Why 1.19…MHz, then?

不过这样一解释只是将问题又推高了一个级别,为什么是 1.19 MHz 呢?

With that clock rate, 216 ticks equals approximately 3600 seconds, which is one hour. (If you do the math it’s more like 3599.59 seconds.) [Update: 4pm, change 232 to 216; what was I thinking?]

在这样的时钟频率下,216 个嘀嗒(tick)大约就是 3600 秒,也就是一小时。(精确一些的话,也可以说是3599.59 秒。)

What’s so special about one hour?

为什么『一个小时』这个周期那么特别呢?

The BIOS checked once an hour to see whether the clock has crossed midnight. When it did, it needed to increment the date. Making the hourly check happen precisely when a 16-bit tick count overflowed saved a few valuable bytes in the BIOS.

BIOS 每小时会检查一次系统时钟来确定是否跨越了午夜,当这种情况发生时,系统就会将日期向前推进一天。让这种检查机制发生在16位嘀嗒存储器溢出的时刻,可以在 BIOS 中节约宝贵的几个字节。

Another reason for the 1.19MHz clock speed was that it was exactly one quarter of the original CPU speed, namely 4.77MHz, which was in turn 4/3 times the NTSC color burst frequency of 3.5MHz. Recall that back in these days, personal computers sent their video output to a television set. Monitors were for the rich kids. Using a timer related to the video output signal saved a few dollars on the motherboard.

另一个采用 1.19MHz 时钟频率的原因是因为这个值正好是原始设计中 CPU 运行速度—— 4.77MHz ——的四分之一,而这正好又是 NTSC 制式的彩色信号频率的三分之四倍(译注:没有打错,4.77除以3.5约等于4除以3)。当年,个人电脑是将其视频信号输出到电视上的,那时候显示器是有钱人的玩具,而将定时器频率与视频信号关联起来则又在主板上省出了几美元的成本。

Calvin Hsia has another view of the story behind the 4.77MHz clock.

Calvin Hsia 提供了有关 4.77 MHz 时钟频率的另一个角度的故事。(译注:链接已失效)

(Penny-pinching was very common at this time. The Apple ][ had its own share of penny-saving hijinks.)

(那时候一分钱掰成两半花是很常见的事,Apple ][ 有其自己的省钱小妙招。)(译注:链接已失效)