TONT 40223 在服务器上,内存换页 = 死翘翘



Chris Brumme’s latest treatise contained the sentence “Servers must not page”. That’s because on a server, paging = death.

Chris Brumme最新的论述里有一句话,叫做『服务器坚决不能进行内存换页』。这是因为在服务器上,换页等同于找死。

I had occasion to meet somebody from another division who told me this little story: They had a server that went into thrashing death every 10 hours, like clockwork, and had to be rebooted. To mask the problem, the server was converted to a cluster, so what really happened was that the machines in the cluster took turns being rebooted. The clients never noticed anything, but the server administrators were really frustrated. (“Hey Clancy, looks like number 2 needs to be rebooted. She’s sucking mud.”) [Link repaired, 8am.]

曾经有一次,我从其它部门的某人那里听说过一个故事:他们有一台服务器,每10个小时就会不可救药地死一次,就像闹钟一样精确,然后就不得不把它重启一下。为了掩盖这个问题,他们把服务器转换成了一个集群,这样实际上变成了集群里的机器轮流进行重启,而客户端完全不会注意到发生了什么,尽管如此,服务器的管理员还是感到很沮丧。(嘿Clancy,好像2号机又得重启了,瞅着应该是不行了。)(译注:原文使用的sucking mud指服务器出问题或崩溃的情况,原意是指钻井采油时没有打到油,而是把地下的泥浆抽了出来,还是很形象的)

The reason for the server’s death? Paging.


There was a four-bytes-per-request memory leak in one of the programs running on the server. Eventually, all the leakage filled available RAM and the server was forced to page. Paging means slower response, but of course the requests for service kept coming in at the normal rate. So the longer you take to turn a request around, the more requests pile up, and then it takes even longer to turn around the new requests, so even more pile up, and so on. The problem snowballed until the machine just plain keeled over.


After much searching, the leak was identified and plugged. Now the servers chug along without a hitch.


(And since the reason for the cluster was to cover for the constant crashes, I suspect they reduced the size of the cluster and saved a lot of money.)




