TONT 40223 在服务器上,内存换页 = 死翘翘

限你一纳秒之内到我家楼下,不然我就……

原文链接:https://blogs.msdn.microsoft.com/oldnewthing/20040317-00/?p=40223

Chris Brumme’s latest treatise contained the sentence “Servers must not page”. That’s because on a server, paging = death.

Chris Brumme最新的论述里有一句话,叫做『服务器坚决不能进行内存换页』。这是因为在服务器上,换页等同于找死。

I had occasion to meet somebody from another division who told me this little story: They had a server that went into thrashing death every 10 hours, like clockwork, and had to be rebooted. To mask the problem, the server was converted to a cluster, so what really happened was that the machines in the cluster took turns being rebooted. The clients never noticed anything, but the server administrators were really frustrated. (“Hey Clancy, looks like number 2 needs to be rebooted. She’s sucking mud.”) [Link repaired, 8am.]

曾经有一次,我从其它部门的某人那里听说过一个故事:他们有一台服务器,每10个小时就会不可救药地死一次,就像闹钟一样精确,然后就不得不把它重启一下。为了掩盖这个问题,他们把服务器转换成了一个集群,这样实际上变成了集群里的机器轮流进行重启,而客户端完全不会注意到发生了什么,尽管如此,服务器的管理员还是感到很沮丧。(嘿Clancy,好像2号机又得重启了,瞅着应该是不行了。)(译注:原文使用的sucking mud指服务器出问题或崩溃的情况,原意是指钻井采油时没有打到油,而是把地下的泥浆抽了出来,还是很形象的)

The reason for the server’s death? Paging.

至于服务器崩溃的原因呢?就是内存换页。

There was a four-bytes-per-request memory leak in one of the programs running on the server. Eventually, all the leakage filled available RAM and the server was forced to page. Paging means slower response, but of course the requests for service kept coming in at the normal rate. So the longer you take to turn a request around, the more requests pile up, and then it takes even longer to turn around the new requests, so even more pile up, and so on. The problem snowballed until the machine just plain keeled over.

这台服务器上运行的程序有内存泄漏的问题,每个请求会造成4字节的内存泄漏。最终,这些内存泄漏累积起来,耗尽了所有可用的物理内存,服务器被迫进行内存换页。换页意味着响应速度的减缓,然而服务请求仍然按照一贯的速率涌进来,由此处理单个请求的时间越长,就会有越多的请求堆积起来,处理后续新请求的耗时就会越来越长,堆积的请求也越来越多,如此反复,直到整台机器被滚雪球式无限放大的问题最终压垮。

After much searching, the leak was identified and plugged. Now the servers chug along without a hitch.

经过大量的排查工作,内存泄漏问题最终被定位和修复,现在这台服务器跑得很顺畅了。

(And since the reason for the cluster was to cover for the constant crashes, I suspect they reduced the size of the cluster and saved a lot of money.)

(而且,鉴于最初建立服务器集群的目的是为了掩盖经常性的崩溃,我猜他们大概缩减了集群的规模,由此省下了不少钞票。)

Comments

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

 剩余字数 ( Characters available )

Your comment will be available after auditing.
您的评论将在通过审核后显示。

Please DO NOT add any links in your comment, otherwise it would be identified as SPAM automatically and never be audited.
请不要在评论中插入任何链接,否则将被自动归类为垃圾评论,且永远不会被提交给博主进行复审。

*