Project title

Research area and problem

The research area for this project will be dialectology, and the ability of API-based corpora to capture elements of regional dialect variation.

The uncertainty surrounding typed-language’s ability to capture distinctive, regional aspects of language use. (Clopper and Pisoni 2006; Grieve et al. 2019; Moisl, n.d.; Nguyen et al. 2017;, n.d.; Zaghouani and Charfi 2018; Szmrecsanyi and Wolk 2011)

Research aim and question

To compare the findings of API-based data to more traditionally compiled corpora, and discover whether attributes of regional dialects are noticeable in language used on Twitter.

Are established aspects of regional dialects found in API-based data, such as tweets? To what extent are dialects discernible in typed language?



Clopper, Cynthia G., and David B. Pisoni. 2006. “The Nationwide Speech Project: A New Corpus of American English Dialects.” Speech Communication 48 (6): 633–44.
Grieve, Jack, Chris Montgomery, Andrea Nini, Akira Murakami, and Diansheng Guo. 2019. “Mapping Lexical Dialect Variation in British English Using Twitter.” Frontiers in Artificial Intelligence 2: 11.
Moisl, Hermann. n.d. “Using Electronic Corpora in Historical Dialectology Research : The Problem of Document Length Variation.” M. Dossena & R. Lass, (Ed.) Studies in English and European Historical Dialectology, Bern:Peter Lang.
Nguyen, Trong Duc, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. “2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).” In, 438–49.
Szmrecsanyi, Benedikt, and Christoph Wolk. 2011. “Holistic Corpus-Based Dialectology.” Revista Brasileira de Linguística Aplicada 11: 561–92. n.d. “Methods and Objectives in Contemporary Dialectology.”
Zaghouani, Wajdi, and Anis Charfi. 2018. “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification.” arXiv:1808.07674 [Cs], August.