areal's profileiamcrfBlogLists Tools Help

Blog


    07 April

    对于bakeoff-3的简单综述-2

    好了,基本的结论可以有一个了,
    要想取得 state-of-the-art的中文分词性能,需要的要素其实只有两个:
    1. 字标引(character-based tagging)
    2. 一个类似于CRF/ME的学习模型
    至于特征,好像Low and Ng以及今年的结果已经很充分了.
    据我的观察,自从基于语料定义分词标准以及机器学习意义下的分词算法诞生以来,分词的特征体系似乎渐渐形成了两个主流:
    一个是MSRA的Gao Jianfeng(我在交大的师兄)的特征体系,代表作

    \bibitem{Gao:2005}
    Jianfeng Gao, Mu Li, Andi Wu and Chang-Ning Huang. 2005. Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics, Vol. 31(4): 531-574.
    这个特征体系在去年的Tseng的封闭测试系统中得到了充分的展示. 并且在今年法国电信的系统中也得到了发挥.
    法国电信的系统在MSRA2006语料的开放测试中获得第一不是偶然的. 因为当初Jianfeng师兄的特征系统本来就是基于MSRA标准的语料作语言特征统计而同步开发的,
    法国电信使用了Gao的特征外加TBL大法从而获得第一其实是意料之中(虽然他们的成绩其实比我们的还是低一点).
    另外一个特征系统是Low and Ng在去年,其实是在前年的EMNLP-04上提出来来的.在大家都知道的n-gram特征之外,Ng引入了标点,字母,时间/日期,数字的区分.
    虽然简单,或许这是最近3年语言特征的最大突破了吧.中文分词和其他的语言学习不同,因为它是最基本的和第一位的语言处理,根本无法使用丰富的语言现象来进行辅助性操作.比如,典型的,使用POS特征,这在分词信息学习中根本是不能想象的,除非有谁能够给汉字分配POS.
    下面我简要的回答这个问题,为什么这次MSRA的系统差点把所有的第一一锅端了.

    Comments (2)

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.

    To add a comment, sign in with your Windows Live ID (if you use Hotmail, Messenger, or Xbox LIVE, you have a Windows Live ID). Sign in


    Don't have a Windows Live ID? Sign up

    Jinhui YUANwrote:
    请问你上篇帖子提到的论文集在哪里能看到呢? 多谢。
    22 Apr.
    ke wuwrote:
    师兄,对于分词的理解真是深刻呀。:) .期待下一集
    12 Apr.

    Trackbacks (2)

    The trackback URL for this entry is:
    http://cwseg.spaces.live.com/blog/cns!379FC86001B7891D!113.trak
    Weblogs that reference this entry