Find information:

[11-19]Three lectures on multi-language information processing

Date:2013-11-15

The venue for these lectures is: Lecture Room(334), 3rd Floor, Building #5, State Key Laboratory of Computer Science, Institute of Software, CAS.

Title, speaker, abstract and biography of three lectures is as followed:

(1)Title: Tibetan Base Noun Phrase and Out-of-Vocabulary Word Recognition Methods from Large-Scale Web Resources

Speaker: Minghua Nuo

Time: 2:00-3:00 pm, 19th November 2013

Abstract:

Tibetan out-of-vocabulary (OOV) in large-scale web text resource affects the performance of dictionary-based Tibetan word segmentation system. Lots of Tibetan OOV is noun phrase. We present a novel Tibetan base noun phrase extraction method from Chinese correspondences. It’s a two-phase procedure: Chinese base noun phrases identification and finding their Tibetan correspondences. We propose head-phrase extension based Tibetan base noun phrase identification method in accordance with the morphologic characteristics of Tibetan. Tibetan base noun phrase produced by our method constructs candidate Tibetan lexicon on phrase level. In addition, we present approaches for recognizing Tibetan out-of-vocabulary. In preprocessing, explicit natural annotation and Tibetan enclitics are used to recognize correct OOV word. Then, OOV words are recognized from word segmentation fragments. Candidate Tibetan syllable string is determined by a newly built statistic function SEC to OOV word. Preliminary experiment shows that proposed approaches recognized Tibetan OOV word effectively and improves the performance of Tibetan word segmentation.

Biography:

Minghua Nuo is an assistant professor in Institute of Software Chinese Academy of Sciences. She received the Ph.D. degree in computer science from Institute of Software, Chinese Academy of Sciences. Her research interests include multilingual processing, machine translation.

(2) Title: Hybrid Statistical-Structural Online handwritten Tibetan character recognition based on Tibetan Components

Speaker: Longlong Ma

Time: 3:00-4:00 pm, 19th November 2013

Abstract:

The structural nature of Tibetan characters has inspired component-based recognition, but component segmentation from characters remains a challenge. This talk describes a new component-based online handwritten Chinese character recognition approach which combines the advantages of statistical methods and component-based structural methods. We first establish a Tibetan component model database using semi-supervised component annotation method. The strategy of Optimization segmentation hypotheses improves the accuracy of automatic component annotation. Then, we give the component-based recognition framework using integrated segmentation-recognition method. The character pattern is over-segmented into a sequence of sub-structure blocks. Integrated segmentation and recognition method based on the CRF model is used to determine the component segmentation points from these block sequences. The CRF model combines component shape likelihood with geometrical likelihood. The parameters are learned using an energy minimization method. Our experimental results demonstrate the effectiveness of the proposed approach.

Biography:

Longlong Ma is an Associate Professor of Institute of Software of , Chinese Academy of Science. He received the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Science. His research interests include pattern recognition, machine learning, and especially the applications to character recognition and document analysis.

(3)Title: Tibetan Text Resource Collection and Corpus Building from News and Broadcasting websites

Speaker: Huidan Liu

Time: 4:00-5:00 pm, 19th November 2013

Abstract:

Text in news and broadcasting websites is an important resource to build Tibetan text corpus. Based on the researches on the distribution of Tibetan text resources over the internet, we crawled eight Tibetan websites which mainly focus on news and broadcastings, and extracted the topic title, publishing date, author, topic content and some other topic related information by building one or more template(s) for each website, taking advantage of the information such as the content layout of the web page, the meta information and the commented text in the HTML. Using the method, we collected web pages created from February 2008 to May 2013, and built a corpus which has 140 thousand documents, 3.56 million sentences or 64 million syllables Tibetan text in total.

Biography:

Huidan Liu received the B.S. degree in Science from Beijing Jiaotong University, and the M.S. degree and Ph.D. degree in computer science from Institute of Software, Chinese Academy of Sciences. He is an engineer of Institute of Software, Chinese Academy of Sciences from 2009. His involved in researches concerning Tibetan information processing, including Tibetan word segmentation, Tibetan web text resources extraction, Tibetan information retrieval.