Home » 资源共享
 

Information for IWSLT 2008 Participants

注意,将在新窗口中打开。 PDF打印E-mail

最后更新 (周二, 30 11月 1999 08:00) 作者:段志岩 周一, 23 6月 2008 09:07

"HIT-Corpus" is mainly multi-source Chinese-English parallel corpus (including a proportion of  spoken language corpus), which would be useful to Spoken Language Machine Translation.

Due to the heavy request from different IWSLT 2008 participants and to advance the research and development in the community of spoken language translation, we would like to release part of the HIT-Corpus to IWSLT 2008 participant exclusively.

Please note that this one-off release is specifically for IWSLT 2008 evaluation only.

The released data is not allowed to be used in any other activities and for any purposes.

After IWSLT, we will release a new version to the community for research purpose usage only.

To get the data, please download the following agreement and send the signed copy to:

       Mr. Yang Muyun
       P.O. Box 321,
       No 92, West Dazhi Street,
       Nangang District,
       Harbin, Heilongjiang, P.R. China
       (150001)

We will send the data as soon as receiving the signed agreement. (The data delivery may be speed up by fist faxing the signed copy to  +86-451-8641-6225 (ext. 608), but the hardcopy is still required).

For further details, please contact Mr. Yang Muyun at "ymy-AT-mtlab-DOT-hit-edu-cn"
 

机器翻译(多领域)汉英词典(1.0)

注意,将在新窗口中打开。 PDF打印E-mail

最后更新 (周二, 30 11月 1999 08:00) 作者:段志岩 周三, 23 4月 2008 16:53

汉英基本词典含通用词汇88000条(包括GB2312中的6763个汉字),有多项信息标 注,除词性之外,还有次级分类(类似语义信息)、短语译文的核心位置,以及其 他必要的语法信息等。

汉英领域词典搜集整理了25个领域的汉英词典,包括:餐饮、电脑、电信、法律、纺 织、服装、化学、环境、机械、家电、建筑、能源、农业、汽车、商务、生物、石 油、数学、地质、物理、心理学、医学、造纸、哲学,均作为名词收录。共计词汇 26.5万。

转让价格:中国大陆地区盈利用—20万RMB;国外赢利用--80000US$

 

Share bilingual corpus

注意,将在新窗口中打开。 PDF打印E-mail

最后更新 (周五, 26 2月 2010 11:40) 作者:段志岩 周三, 23 4月 2008 16:52

Information for IWSLT 2008 Participants

"HIT-Corpus" is mainly multi-source Chinese-English parallel corpus (including a proportion of spoken language corpus), which would be useful to Spoken Language Machine Translation.

Due to the heavy request from different IWSLT 2008 participants and to advance the research and development in the community of spoken language translation, we would like to release part of the HIT-Corpus to IWSLT 2008 participant exclusively.

Please note that this one-off release is specifically for IWSLT 2008 evaluation only.

The released data is not allowed to be used in any other activities and for any purposes.

After IWSLT, we will release a new version to the community for research purpose usage only.

To get the data, please download the following agreement and send the signed copy to:

Mr. Yang Muyun
P.O. Box 321,
No 92, West Dazhi Street,
Nangang District,
Harbin, Heilongjiang, P.R. China
(150001)

We will send the data as soon as receiving the signed agreement. (The data delivery may be speed up by fist faxing the signed copy to +86-451-8641-6225 (ext. 608), but the hardcopy is still required).

For further details, please contact Mr. Yang Muyun at "ymy-AT-mtlab-DOT-hit-edu-cn"


MITLAB is now providing the following linguistic resources for research
purpose only:
1) Chinese English Bilingual Text for the Traveling Domain, 20K beads,
about USD1000;


2) Chinese English Bilingual Text for Food and Drink Oder, 10K beads,
about USD800;


3) Chinese English Bilingual Text for the Traffic Domain, 10K beads,
about USD800;


4) Chinese English Bilingual Text for the Sports Domain, 5K beads, about
USD500;


5) Chinese English Bilingual Text for the Business Domain, 10K beads,
about USD800;

6) Other Chinese-English bilingual sentence-aligned corpus, price:
negotiation;

For details of corpus and the way to get the corpus, please visit Information for IWSLT 2008 Participants or you can contact Mr.
Yang Muyun ( 该E-mail地址已受到防止垃圾邮件机器人的保护,您必须启用浏览器的Java Script才能看到。 .cn)


HIT-MI&TLab Chinese-English structure aligned parallel tree bank consists of 17k sub-tree aligned tree pairs. The sentence pairs are mainly obtained from Chinese-English Machine Translation evaluation campaigns, English books of Chinese Universities and Middle Schools as well as dictionaries. The sub-tree alignment is manually conducted to keep the aligned counterparts precisely semantically equivalent. The word segmentation in Chinese and parse trees in both languages are manually checked and therefore gold standard. 
    Corpus is conducted by a quaternion each. The 1st line is the English parse tree; 2nd line is the Chinese parse tree; the 3rd line is the word alignment; the 4th line is the sub-tree alignment. Following is an example:
[000019] #1#S[#2#NP[#3#He/PRP/1 ]#4#VP[#5#is/VBZ/2 #6#NP[#7#a/ART/3 #8#Chinese/NNP/4 #9#boy/NN/5 ]]#10#./FSP/6 ]
[000019] #1#S[#2#他/r/1 #3#VP[#4#是/vx/2 #5#NP[#6#个/q/3 #7#BNP[#8#中国/nd/4 #9#男孩/nc/5 ]]]#10#。/wj/6 ]
[000019] (1:1); (2:2); (5:5); (4:4); (6:6);
[000019] (4:3); (6:5);
To our knowledge, this is the first and the biggest manually-contracted Chinese-English bi-parsed word/node-aligned bilingual corpus, which is designed to serve as the basic resources for NLP, machine translation, linguistics studies and many other researches and applications. 
We are happy and welcome any universities, institutes, organizations and companies to licence and use our corpus for your purposes.
For any inquiries, please email to 该E-mail地址已受到防止垃圾邮件机器人的保护,您必须启用浏览器的Java Script才能看到。