TONT 39923 为什么字典太大了不见得是好事

恭譆!您葰箓入的魰字巳樋鐹拼冩檢楂,莈烠栠菏錯誤!

原文链接:https://blogs.msdn.microsoft.com/oldnewthing/20040402-00/?p=39923

Sometimes you’ll see somebody brag about how many words are in their spell-checking dictionary. It turns out that having too many words in a spell checker’s dictionary is worse than having too few.

有时候你会听说有的人吹嘘自己的拼写检查词典里有好多好多的单词。实际上,在拼写检查词典里的单词太多了不见得比词条太少了好到哪里去。

Suppose you had a spell checker whose dictionary contained every word in the Oxford English Dictionary. Then you hand it this sentence:

假设你的拼写检查词典里包含了牛津英语词典里的每一个单词,然后你给拼写检查程序传入以下的句子:

Therf werre eyght bokes.

(译注:这句话在语法上没有实际意义,也不是There were eight books的古英语写法。)

That sentence would pass with flying colors, because all of the words in the above sentence are valid English words, though most people would be hard-pressed to provide definitions.

这句话会满分通过拼写检查,因为句子中的每一个单词都是有效的英语单词,不过大多数人如果被要求给出其含义大概都很困难。

The English language has so many words that if you included them all, then common typographical errors would often match (by coincidence) a valid English word and therefore not be detected by the spell checker. Which would go against the whole point of a spell checker: To catch spelling errors.

英语这门语言的单词实在太多,以至于如果你将每一个单词都包含在拼写检查词典里,那常见的录入错误通常都(凑巧)能匹配一个有效的英语单词,从而无法被拼写检查发现其中的错误,而这也违背了拼写检查的意义所在:揪出拼写错误来。

So be glad that your spell checker doesn’t have the largest dictionary possible. If it did, it would end up doing a worse job.

所以呢,还是为你的拼写检查词典并不是世界上最大的词典感到高兴吧,真要是这样的话,那它的本职工作一定做得很不怎么样。

After I wrote this article, I found a nice discussion of the subject of spell check dictionary size on the Wintertree Software web site.

当我写完上面这些文字之后,我在Wintertree Software的网站上找到了一篇关于拼写检查词典合理尺寸的讨论的文章。

(译注:以下为文末链接的文章的备份,有兴趣者请自行阅读,这里不再进行翻译)

How many words should be in the spell checker’s dictionary?

Wintertree Software’s American and British English dictionaries each contain about 100,000 words. We’re frequently asked if this is enough. Sometimes customers or potential customers call and mention that a competitor’s product comes with a dictionary containing 130,000 words, or 150,000 words, or some other large number, and they want to know why Sentry’s dictionary doesn’t contain that many.

The short answer to this question may surprise you: The other dictionaries probably contain too many words. Read on to find out why.

You probably know that spell checkers work by checking words against a dictionary containing words known to be correct. If a word isn’t found in the dictionary, the word is reported as a misspelling. If a word is found, it is skipped over without being reported. Two key measures of a spell checker’s accuracy are its detection rate, which is the number of misspelled words reported vs. the number of words actually misspelled, and the false-positive rate, which is the number of valid words incorrectly reported as misspelled vs. the number of words checked. A high detection rate and a low false-positive rate are desirable.

The number of words in the dictionary has a strong bearing on both of these measures. If the dictionary contains too many words, the probability will increase that a misspelled word will match one of the words in the dictionary, and therefore will not be reported. This will decrease the spell checker’s detection rate. If the dictionary contains too few words, more valid words will be reported because they aren’t in the dictionary. This will increase the spell checker’s false-positive rate.

The ideal dictionary for you would contain every word in your vocabulary, but no other words. This dictionary would yield an excellent detection rate and a false-positive rate of 0%. The detection rate would not be 100% because you could still misspell a word and match a different valid word — you might accidentally leave the e off stare and match star, for example. The false-positive rate would be 0% because every word reported by the spell checker must necessarily be misspelled, a condition which would remain true until you learned a new word.

Unfortunately, a dictionary that is ideal for you would likely be less than ideal for someone else, since different people have different vocabularies. Moreover, creating a dictionary containing the words in only one person’s vocabulary would be prohibitively expensive. A cost-effective dictionary contains the words most commonly used by the population of its users.

To maintain a high detection rate, the dictionary should contain only words common to a large portion of the population. If the dictionary contains technical terms used only by the small portion of the population who are taxidermists, for example, there is an increased chance that a misspelling made by an average user will match one of these specialized terms and therefore not be reported.

To maintain a low false-positive rate, the dictionary should contain most of the words used by the population. If the dictionary does not contain a word commonly used by the population, people will experience frustration when the spell checker reports the word as a misspelling.

Incidentally, a dictionary in a spell checker isn’t like a paper dictionary such as Webster’s or the OED. Paper dictionaries have an obligation to include as many words, no matter how obscure, as possible. One could even argue that paper dictionaries should focus on obscure words and not waste space on common words such as the, or, and of, since most users of a language will have an intuitive understanding of the meanings of these words.

Of the two measures, the detection rate is more important. A spell checker that flags valid words as misspellings may be annoying, but a spell checker that allows a misspelled word to pass through without report has failed to do its job. For this reason, the dictionary should contain as many common words as are needed to maintain a reasonable false-positive rate, but no more. Putting it another way, the dictionary should contain the minimum number of words needed to avoid incorrectly reporting common valid words.

This is the goal Wintertree Software has established for our dictionaries. We build our dictionaries by statistically analyzing vast amounts of text from many sources to ensure that the most common words — and only the most common words — are included, with words ranging from the, a, and of to less common but still far from obscure words like plenipotentiary and disenfranchisement. Even a person with a large vocabulary is unlikely to use a word not in our American or UK English dictionary, unless that word is a highly specialized technical term, such as the name of a disease or a rare insect. Specialized terms are best handled by supplemental dictionaries, and we carry medical and legal dictionaries for just this purpose. We could easily dump words willy-nilly into our dictionaries, beating our competitors’ counts by hundreds of thousands of words, but that would serve only to lower our detection rate. The count of words in a spell checker’s dictionary is like body weight: Once an optimum level has been achieved, adding or taking away will just make things worse.

So the next time you come across a company offering a spell checker with a dictionary containing 150,000 or more words, ask them one question: Why?

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

 剩余字数 ( Characters available )

Your comment will be available after auditing.
您的评论将在通过审核后显示。

Please DO NOT add any links in your comment, otherwise it would be identified as SPAM automatically and never be audited.
请不要在评论中插入任何链接,否则将被自动归类为垃圾评论,且永远不会被提交给博主进行复审。

*