An End-to-end Method for Data Filtering on Tibetan-Chinese Parallel Corpus via Negative Sampling

来源 :第十八届中国计算语言学大会暨中国中文信息学会2019学术年会 | 被引量 : 0次 | 上传用户:lingfangzhi12
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  In the field of machine translation,parallel corpus serves as the most important prerequisite for learning complex mappings between targeted language pairs.However,in practice,the scale of parallel corpus is not necessarily the only factor to be taken into consideration for improving performance of translation models due to the quality of parallel data itself also has tremendous impact on model capacity.In recent years,neural machine translation systems have become the de facto choice of implementation in MT research,but they are more vulnerable to noisy disturbance presented in training data compared with traditional statistical machine translation models.Therefore,data filtering is an indispensable procedure in NMT pre-processing pipeline.Instead of utilizing discrete feature representations of basic language units to build a ranking function of given sentence pairs,in this work,we proposed a fully end-to-end parallel sentence classifier to estimate the probability of given sentence pairs being equivalent translation for each other.Our model was tested in three scenarios,namely,classification,sentence extraction and NMT data filtering tasks.All testing experiments showed promising results,and especially in Tibetan-Chinese NMT experiments,3.7 BLEU boost was observed after applying our data filtering method,indicating the effectiveness of our model.
其他文献
学位
学位
学位
As an endangered language,Tujia language only rely on oral communication.There must exist noises in the process of collecting Tujia language corpus.This paper studies an end-to-end speech enhancement
Machine translation has achieved impressive performance with the advances in deep learning and rely on large scale parallel corpora.There have been a large number of attempts to extend these successes
学位
Distant supervision is an effective method to generate large-scale la-beled data for relation extraction without expensive manual annotation,but it inevitably suffers from the wrong labeling problem,w
Exploring linguistic features and characteristics helps better understand natural language.Recently,there have been many studies on the internal relationships of linguistic features,such as collocatio
学位
In the field of Natural Language Processing(NLP)of Mongolian,Named Entity Recognition(NER)has great significance.The traditional model is to use the Conditional Random Field(CRF)and Long-Short Term Mo