Twi Corpus: A Massively Twi-to-Handful Languages Parallel Bible Corpus

Authors: Adjeisah, M., Liu, G., Nortey, R.N., Song, J., Lamptey, K.O. and Frimpong, F.N.

Conference: 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

Dates: 17-19 December 2020

DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00157

Abstract:

This paper presents detailed modeling of massively parallel Bible corpus based on Twi, a common Ghanaian language, to a handful of languages. We discussed some of the common issues we encountered in obtaining, processing, converting, and formatting the corpus and the latent desire for success in NLP. We stored the sentence aligned data in various files based on the Twi to the selected language pairs with a tab-delimited separation. Verses with the same line number in a line pair are mappings of each other. It is often challenging to learn what a “clean” corpus looks like in lower-resource situations, especially where the target corpus is the only sample of the language's parallel text. We, therefore, performed unsupervised measurements on each sentence pair. We engage the squared Mahalanobis distances that predicted parallelism on the dataset. Eventually, we perform a statistical analysis of the corpora collected based on selected text categorization models for text classification by leveraging vector embedding (like Word2vec). Finally, we trained the Twi vocabs for a 2D representation. Similar words find their vectors closer by engaging t-Distributed Stochastic Neighbor embedding (t-SNE).

Source: Manual