Enhancing LSTM-based Word Segmentation Using Unlabeled Data

来源 :第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会 | 被引量 : 0次 | 上传用户:dfcy007
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
  Word segmentation problem is widely solved as the sequence labeling problem.The traditional way to this kind of problem is ma-chine learning method like conditional random field with hand-crafted features.Recently,deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM networks.This paper gives a method to introduce numer-ical statistics-based features counted on unlabeled data into LSTM net-works and analyzes how it enhances the performance of word segmenta-tion model.We add pre-trained character-bigram embedding,pointwise mutual information,accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of sim-plified Chinese.We achieve the state-of-the-art performance on two of them and get comparable results on the rest.
As a minority language,Tibetan has received relatively little atten-tion in the field of natural language processing(NLP),especially in current var-ious neural network models.In this paper,we investig
Collocation Extraction plays an important role in machine transla-tion,information retrieval,secondary language learning,etc.,and has obtained significant achievements in other languages,e.g.English a
Recently,image caption which aims to generate a textual description for an image automatically has attracted researchers from various fields.Encouraging performance has been achieved by applying deep
This paper puts forward theme analysis problem in order to automatically solve composition writing questions in Chinese college entrance examination.Theme analysis is to distillate the embedded se-man
One of the important works of Information Content Security is eval-uating the theme words of the text.Because of the variety of the Chinese ex-pression,especially of the abbreviation,the supervision o
Entity linking is a task of linking mentions in text to thecorresponding entities in a knowledge base.Recently,entity linking has received considerable attention and several online entity linking syst
Recently,many researchers have concentrated on using neu-ral networks to learn features for Distant Supervised Relation Extraction(DSRE).However,these approaches generally employ a softmax classi-fier
Recently,Chinese implicit discourse relation recognition has attracted more and more attention,since it is crucial to understand the Chinese discourse text.In this paper,we propose a novel memory augm
In this paper,we put forward UIDS,a new high-performing extensible framework for extractive MultiLingual Document Summariza-tion.Our approach looks on a document in a multilingual corpus as an item se
Named Entity Recognition(NER)is a tough task in Chi-nese social media due to a large portion of informal writings.Existing research uses only limited in-domain annotated data and achieves low performa