Topic Modelling Experiments on Hellenistic Corpora

Abstract: The focus of this study is Hellenistic Greek, a variation of Greek that continues to be of particular interest within the humanities. The Hellenistic variant of Greek, we argue, requires tools that are specifically tuned to its orthographic and semantic idiosyncrasies. This paper aims to put available documents to use in two ways: 1) by describing the development of a POS tagger and a lemmatizer trained on annotated texts written in Hellenistic Greek, and 2) by representing the lemmatized products as topic models in order to examine the effects of a) automatically processing the texts, and b) semi-automatically correcting the output of the lemmatizer on tokens occurring frequently in Hellenistic Greek corpora. In addition to topic models, we also generate and compare lists of semantically related words.

Get the gist of "Topic Modelling Experiments" with these excerpts

"The Hellenistic variant of Greek, we would argue, requires tools that are specifically tuned to its orthographic and even semantic idiosyncrasies." (p. 39) 
"The main contribution of this paper is its illustration of the importance of targeting machine learning tools toward specific datasets." (p. 47)

How to cite "Topic Modelling Experiments"

Wishart, Ryder, and Prokopis Prokopidis. “Topic Modelling Experiments on Hellenistic Corpora.” In Proceedings of the Workshop on Corpora in the Digital Humanities 17, 39–47. Bloomington, IN: CEUR Workshop Proceedings, 2017, Online: https://pdfs.semanticscholar.org/bd71/ab40960e481006117bafd0ae952d3e8d1f66.pdf.

How to access "Topic Modelling Experiments"

These conference proceedings are available online for free at this link.

>