Proceedings of the National Assembly for Wales Hosted by: Universitat d'Alacant, Grup Transducens This is a bilingual sentence aligned corpus of around 510,000 sentences in Welsh and English taken from the proceedings of the National Assembly for Wales. The corpus was assembled by Dafydd Jones and Andreas Eisele for use in their experiments on Welsh-English statistical machine translation. The corpus consists of selected sentences only from the Record of Proceedings, and is intended for linguistic analysis. It should not be taken as an accurate account of the activity of the National Assembly for Wales. The copyright terms and conditions may be found in the enclosed COPYRIGHT file. A copy of the correspondence regarding the determination of rights to distribution may be found in the file CORRESPONDENCE. If you use this corpus in your work, please cite the following paper: Jones, D. and Eisele, A. (2006) "Phrase-based statistical machine translation between English and Welsh". LREC-2006: Fifth International Conference on Language Resources and Evaluation. 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy, 23 May 2006; pp.75-77. Statistics: 510792 cy-clean.txt 510792 en-clean.txt Bugs: Currently the text is in UTF-8, but was converted from ISO-8859-1. Some of the characters in the Welsh alphabet, such as ŵ and ŷ are not used. These are hoped to be fixed and included, but we need _your_ help. Version history: * 1.0.1 Minor spelling and orthographic changes, removal of blank lines. * 1.0.0 The raw corpus as used by Jones and Eisele (2006) Comments: Please send all comments and suggestions to Francis Tyers.