areal's profileiamcrfBlogLists Tools Help

Blog


    27 August

    对于bakeoff-3的简单综述-1

    SIGHAN举办的第三届国际中文处理竞赛(the third international Chinese language processing Bakeoff,http://www.sighan.org/bakeoff2006/)在2006年4月17日到5月19日举行。本项活动的前两届分别在2003年和2005年举行,原名叫做interntional Chinese word segmentation Bakeoff。今年加入了命名实体的识别,所以竞赛也改名了。
    至于这项国际评测活动为什么叫bakeoff,似乎没有一个官方的说法,bakeoff是bake off这个短语的复合词,意为烘烤,可能就是意味着参赛者的努力处于煎熬中吧。反正,我参赛期间的msn昵称就是bakeoff开始bake off。
    首先说下结果,bakeoff-3的详细结果列在URL:
    http://sighan.cs.uchicago.edu/bakeoff2006/longstats.html
    尽管这个地址没有公开,但是其实可以使用google搜到的。
    本次比赛的中文分词项目对于我代表的微软亚洲研究院来说是一个辉煌的胜利,因为MSRA本身是一个语料的提供方,所以没有参加自己提供语料的分词项目,而参加了剩余的3个语料的6项赛事,每个语料分封闭和开放两项。在官方结果中,我们的系统在六项中获得了4个第一,2个第三。不过,事后我发现,我们处于第三的UPUC2006的开放测试的结果,我的文件编码出了错,这项的实际成绩,我们比官方结果中的第一名F score高0.9。我们另外处于第三的是CityU2006的封闭测试,第一名和第二名都是台湾中研院同一个系统不同参数的两个结果,成绩一样。在我们自己的MSRA2006语料上,我用同样的系统跑了一遍,在封闭测试上,我们是第二,在开放上我们依然是第一,比官方最好成绩高0.3。
    基本上来说,没什么意外的话,我们囊括了几乎的本次比赛中的分词项目的所有的第一名。
    我在7月sighan-2006今年7月开会前,已经通过多种渠道读到了今年的大部分参赛的结果的系统描述报告。上个月乘着开完会拿到光盘,把所有的论文读完了。所以,我想现在可以给出个较为全面的综述了。
    我统计了所有至少获得第一名的系统,这样的系统,包括我们的,有5个。在总共8个第一中,我们包揽了4个,剩下4个分别各被一家所夺走。总共有24项前三名的名次,其中的19项也被这5家包揽。因此,本次比赛的结果即使用前三名来统计,也是非常集中的,同样集中上述的5家参赛单位。为了描述方便,下面称这5家参赛单位为top-5。
    所有这些论文的参考文献如下:
    \bibitem{Zhao:2006}
    Hai Zhao, Chang-Ning Huang and Mu Li. 2006. An Improved Chinese Word Segmentation System with Conditional Random Field. {\em Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing}, 108-117. Sidney, Australia.
    \bibitem{Wang:2006}
    Xinhao Wang; Xiaojun Lin; Dianhai Yu; Hao Tian; Xihong Wu. 2006. Chinese Word Segmentation with Maximum Entropy and N-gram Language Model. {\em Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing}, 108-117. Sidney, Australia.
    \bibitem{Jacobs:2006}
    Aaron J. Jacobs; Yuk Wah Wong. 2006. Maximum Entropy Word Segmentation of Chinese Text. {\em Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing}, 108-117. Sidney, Australia.
    \bibitem{Tsai:2006}
    Richard Tzong-Han Tsai; Hsieh-Chuan Hung; Cheng-Lung Sung; Hong-Jie Dai; Wen-Lian Hsu. 2006. On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching. {\em
    Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing}, 108-117. Sidney, Australia.
    \bibitem{Liu:2006}
    Wu Liu; Heng Li; Yuan Dong; Nan He; Haitao Luo; Haila Wang. 2006. France Telecom R\&D Beijing Word Segmenter for Sighan Bakeoff 2006. {\em Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing}, 108-117. Sidney, Australia.

    Top-5的系统信息
    MSRA/自然语言计算组
    第一名成绩:UPUC-C/CityU-O/AS-O/AS-C
    学习模型:CRF
    特征改进自Low and Ng
    北京大学/机器感知国家实验室
    第一名成绩:MSRA-C
    学习模型ME
    特征拷贝自Low and Ng
    台湾中研院/智能Agent系统实验室
    第一名成绩:CityU-C
    学习模型ME
    特征使用聚类算法重现Low and Ng
    法国电信北京研发中心
    第一成绩:MSRA-O
    学习模型Gao method(language model)/ME
    特征类似Low and Ng 
    德州大学Austin分校/语言学系
    第一名成绩:UPUC-O
    学习模型ME
    特征拷贝自Low and Ng

    上面提到的Low and Ng是如下的参考文献
    \bibitem{Low:2005}
    Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo 2005. A Maximum Entropy Approach to Chinese Word Segmentation. {\em Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing}, 161-164. Jeju Island, Korea.
    这是2005年赛事上取得最多第一的系统,在参加的全部4个开放测试中累计获得3个第一,1个第二。
    因此,我愿意说,bakeoff-3的结果不是本届参赛者的胜利,而是bakeoff-2的参赛者Low and Ng的胜利。
    让我们稍微回顾一下bakeoff-2,Low and Ng几乎垄断了开放测试,Tseng等人则几乎垄断了封闭测试。然而,重要的是,这两个参赛者,用的都是基于字标引的机器学习方法。这一方法的开创者是bakeoff-1的Xue:
    \bibitem{Xue:2003a}
    Nianwen Xue. 2003. Chinese Word Segmentation as Character Tagging. {\em Computational Linguistics and Chinese Language Processing}, Vol. 8(1): 29-48.
    \bibitem{Xue:2002}
    Nianwen Xue and S. P. Converse. 2002. Combining Classifiers for Chinese Word Segmentation. {\em Proceedings of the First SIGHAN Workshop on Chinese Language Processing}, 57-63.
    \bibitem{Xue:2003b}
    Nianwen Xue and Libin Shen. 2003. Chinese Word Segmentation as LMR Tagging. In  {\em Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, in conjunction with ACL'03}, 176-179. Sapporo, Japan
    依此追溯,bakeoff至今的全部辉煌其实是Xue的方法论在中文分词上的胜利。

    Comments (3)

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.

    To add a comment, sign in with your Windows Live ID (if you use Hotmail, Messenger, or Xbox LIVE, you have a Windows Live ID). Sign in


    Don't have a Windows Live ID? Sign up

    扬 张wrote:
    引用2了,呵呵
    期待新的大作
     
    其实1已经有人转到到水木的nlp版了~~~
    7 Apr.
    arealwrote:
    终于有人看到了,受到鼓励,发布2。争取本月内完工掉。
    7 Apr.
    扬 张wrote:
    好文,引用了,期待2,hoho
    7 Mar.

    Trackbacks (3)

    The trackback URL for this entry is:
    http://cwseg.spaces.live.com/blog/cns!379FC86001B7891D!107.trak
    Weblogs that reference this entry