Rikker Dockum - Swarthmore College

Rikker Dockum

Swarthmore College, Linguistics, Faculty Member

Yale University, Linguistics, Graduate Student

Followers

186

Following

Co-authors

Public Views

Personal website: http://rikkerdockum.com

Projects I have worked on:
http://www.pamanyungan.net/chirila/
http://sealang.net/sala
http://sealang.net/library
http://sealang.net/lab
http://sealang.net/dictionary/jones
http://sealang.net/dictionary/bradley/
http://sealang.net/ok/

less

InterestsView All (17)

Uploads

Conference Presentations by Rikker Dockum

Toward a big tent linguistics: Inclusion and the myth of the lone genius [pre-print]

by Rikker Dockum and Caitlin Green

(In press) Oxford Collection on Inclusion in Linguistics, 2023

Linguistics has a documented history of divisiveness and remains poorly understood by the general... more Linguistics has a documented history of divisiveness and remains poorly understood by the general public. Nevertheless, linguistics also has great unrealized potential for positive impact on global society, hand in hand with scholarship. We argue for an inclusive big tent linguistics that will help the discipline achieve its potential, and we outline three sources of current exclusion: (1) socialization into gatekeeping what counts as linguistics, with legitimacy tied to outdated opinions of what is more “scientific”, “rigorous”, “rational”, or “prestigious”, (2) epistemic injustice, including a tendency for hero-worship of “lone geniuses” of the field, and (3) a pattern of ignoring power imbalances in interactions, such as the demand for “civility,” often from the discipline’s least powerful members. We discuss the origins of these problems, some recent events that exemplify them, and suggest ways that all linguists, inclusively defined, can contribute to helping our scholarly community achieve a more uplifting culture.

Download

The dialexical set: A diagnostic tool for studying sound change

25th International Conferences on Historical Linguistics, Oxford University, 2022

The study of sound change in comparative linguistics includes many familiar tools and outputs suc... more The study of sound change in comparative linguistics includes many familiar tools and outputs such as sound correspond¬ences, cognate sets, and reconstructed phonemes. In this paper I propose a broadly generalizable new concept for sound change comparison called the dialexical set, defined as a diagnostic set of specific etyma which pattern together regularly within a given individual lect, but vary across related lects.

The dialexical set differs from existing comparative concepts as follows: (1) Cognate sets focus on shared form-meaning correspondence; by contrast, dialexical sets are for studying sound change, and compile etyma and associated reflexes, regardless of semantic change. A given etymon may belong to one cognate set in a given language, but to several dialexical sets for each constituent sound. (2) Reflex sets are attestations associated with cognate sets, but do not necessarily entail any regularity across reflexes. The reflex set of some etymon includes all forms descended from it, whether regular or idiosyncratic, and is often used to reconstruct the proto-form of that etymon; whereas dialexical sets are clusters of etyma that pattern together, entailing regularity, and thus useful as diagnostics for studying sound change. (3) Sound correspondences are a form of analysis of the surface outcomes of sound change, but identifying them also often precedes the hypothesizing of etyma. They are derived from the examination of surface forms, but unlike dialexical sets, sound correspondences are not always tied back to specific etyma, especially not when published, and thus are not immediately extensible to additional related languages. (4) Etysets (Cooper 2014) are a type of distributable dataset, ‘cognate sets, phylogenetic trees, and reconstructed proto-forms’, and thus could include dialexical sets but are distinct from them.

The notion of the dialexical set is inspired by insights from two distinct comparative traditions: English lexical sets (Wells 1982) and Tai tone boxes (Gedney 1972). Wellsian lexical sets are ‘large groups of words which tend to share the same vowel’ and represented by a keyword. For example, the set KIT comprises ‘those words whose form in the two standard accents [RP and General American English] has the stressed vowel /ɪ/’ (1982:127). While not an explicitly diachronic concept, lexical sets neatly reflect and reveal sound changes in the many varieties of English across the globe. Tai tone boxes, on the other hand, group etyma based on shared segmental environments that conditioned modern tone categories (e.g. voiced vs. voiceless proto-onsets). Each cell represents some combination of proto-tone and natural class of proto-onsets; it comprises a set of etyma that pattern together tonally across the set of Tai lects. Reflexes of these etyma share a modern tone in a given Tai language, regardless of subsequent changes to onsets.

The major benefit of dialexical sets is that they are clusters of specific etyma. Unlike, say, tables of reconstructed phonemes, or lists of sound correspondences, defining a dialexical set entails listing the set of etyma that it comprises (as Wells and Gedney both did). This enables immediate diagnostic usefulness to other researchers wishing to study additional related lects at any level of granularity (language, dialect, sociolect, idiolect, etc), without needing necessarily access to all of the original reflex data available to the linguist(s) who proposed the dialexical sets. This thus also addresses a shortcoming of much published comparative work.

Download

Dating the emergence of phonemic length contrasts in แ- /ɛ/ and -อ /ɔ/ in Standard Thai

29th Meeting of the Southeast Asian Linguistics Society, 2019

From Haas (1964) we can establish that at some point prior to the publication of that dictionary,... more From Haas (1964) we can establish that at some point prior to the publication of that dictionary, length contrast had emerged in the vowels แ- /æ/ and -อ /ɔ/ that the Thai script had not, and still has not, fully adapted to reflect. One of the dictionary’s useful innovations is the careful recording in IPA of a phonemic rendition of each word, regardless of its spelling in Thai script. Words whose pronunciation differ from that expected by their spelling are marked with an asterisk, to draw attention to their unexpected pronunciation and confirm that they are not typographic errors (Haas 1964: xxv). Entries thus marked include both unexpected long vowels, as in เช้า /cháaw/ ‘morning’, and unexpected short vowels, as in แล่น /læ̂n/ ‘to run’.

However, the question of precisely when these length contrasts emerged has remained unsettled. In a survey of Thai literature from Ayutthaya through the 20th century CE, and of dictionaries and grammars from the 19th-20th centuries, Pittayaporn (2016) concludes that length contrasts in these two vowels is quite recent, most likely emerging in the mid-20th century, based on lack of evidence prior to Haas, and use of <็> to indicate short /æ/ and /ɔ/ beginning in the mid-20th century (Dhananjayananda 1993).

New textual evidence indicates a much earlier emergence of these contrasts. In a 1917 letter, Cornelius Beach Bradley wrote to His Majesty Vajiravudh of Siam, King Rama VI, to provide constructive feedback on proposed modifications to the Thai script, changes which in the end never saw wide adoption. One of the critiques Bradley offers is that the new system fails to capture all of the vowel length contrasts in Thai. This represents the earliest known explicit documentation of these as phonemic contrasts. Bradley also includes a hand-drawn diagram of the vowel phonemes in Thai, clearly indicating contrasts in both vowels, with example words (Figure 1). A second piece of previously overlooked evidence comes from Michell (1892). Although not clearly explained in the frontmatter, this dictionary also appears to indicate with some regularity short vowels in the same places they are found in modern Thai, through the use of <a> instead of <aa> in its quasi-phonemic romanized pronunciation guides (Figure 2).

The combination of these two new sources of evidence allows us to confidently antedate this sound change to no later than the second decade of the 20th century CE, and most likely well before the end of the 19th century.

Download

Which Came First, the Register or the Tone?: Tonogenesis and the East Asian Voicing Shift

by Ryan Gehrmann and Rikker Dockum

25th International Conference on Historical Linguistics (ICHL25), 2022

Languages from five major language families (Austroasiatic, Austronesian, Hmong-Mieng, KraDai & S... more Languages from five major language families (Austroasiatic, Austronesian, Hmong-Mieng, KraDai & Sino-Tibetan) participate in the Greater Mainland Southeast Asia linguistic convergence area (GMSEA), covering Myanmar, Thailand, Laos, Cambodia, Vietnam, the Malay peninsula, northeast India and southern China (Enfield & Comrie 2015). The vast majority of these languages employ lexical contrasts of tone or register, which are typically thought of as the contrastive implementation of pitch and voice quality, respectively; but this oversimplifies the phenomena. Tone and register overlap in both synchronic phonetic expression and in diachronic development to such an extent that various recent proposals have questioned the necessity of the tone-register typological bifurcation (Brunelle & Kirby 2016, Dockum 2019, Dockum & Gehrmann 2021, Tạ 2021, Gehrmann 2022).

One such proposal is the concept of the East Asian Voicing Shift (EAVS), a massively crosslinguistic transphonologization of onset voicing contrasts as tones or registers that spread across GMSEA over the past millennium (Dockum 2019, Dockum & Gehrmann 2021). EAVS is integral to the received models of both tonogenesis and registrogenesis (Haudricourt 1954, Huffman 1976), but the question remains: why should EAVS produce clearly suprasegmental contrasts cued by pitch and voice quality (i.e. tones) in some languages and debatably suprasegmental contrasts cued by vowel quality and voice quality (i.e. registers) in others? We present two hypotheses here to address this question.

The Sequential Interpretation builds on the conventional explanations (Haudricourt 1965, Matisoff 1973) and predicts two outcomes when a language undergoes EAVS, depending on their suprasegmental typology: (1) If the language is non-tonal, it will become registral, but (2) if it has already developed tones under conditioning from historical coda laryngeal contrasts, it will undergo tone splits (cf. the ‘Great Tone Split’; Brown 1975). Thus, it is the relative historical ordering of EAVS with respect to tonogenetic events that determines the output.

The Simultaneous Interpretation is a new proposal, developed in response to the absence of unambiguous examples of languages innovating coda-conditioned tone contrasts without also undergoing EAVS (Gehrmann 2022). In this hypothesis, EAVS comes first, in the form of a phonetic shift in historical onset voicing contrasts from a VOT phasing realization to a register realization, cued by any combination of differential VOT, pitch, voice quality and/or vowel quality (Kirby & Brunelle 2017). Thereafter, two outcomes are predicted for languages that have undergone EAVS, depending on the relative prominence of these cues: (1) tone tends to develop in languages in which pitch has greater prominence (e.g. Vietnamese, Sinitic, Hmong-Mien, Kra-Dai), but (2) register tends to develop in languages where vowel quality cues are more prominent (e.g. various Austroasiatic languages, including Khmer). The Simultaneous Interpretation follows Thurgood (2007) in casting register as the primary driver of tonogenetic innovation in GMSEA, albeit with a different interpretation of the phonetic underpinnings of the process, decentering the role of voice quality.

We weigh the merits of these two interpretations, drawing on examples from modern GMSEA languages. The nature and chronology of EAVS and associated sound changes carries significant implications for the investigation and interpretation of suprasegmental diachrony in this region—specifically, modelling tone diversification within families, the regional spread of tonogenesis, reconstruction of past suprasegmental systems, and the interpretation of linguistic evidence from historical written sources.

Download

The East Asian Voicing Shift and its role in the origins of tone and register

by Ryan Gehrmann and Rikker Dockum

95th Annual Meeting of the Linguistic Society of America, 2021

Video poster presentation with narration available here: https://youtu.be/_Ma7PfrSSn4 We present... more Video poster presentation with narration available here:
https://youtu.be/_Ma7PfrSSn4

We present here a new unified model to explain the origins and evolution of tone and register in East and Southeast Asia, topics which have generally been investigate separately in different academic circles. After briefly introducing this new Desegmental Model, we focus in on one of its principle component sound change mechanisms which we call the East Asian Voicing Shift (EAVS). Through EAVS, onset voicing contrasts transphonologize into a register contrast upheld by a bundle of co-varying phonetic cues including pitch, voice quality and vowel quality. With few exceptions, this register contrast then either conditions a doubling of the lexical tone inventory in languages which already employ lexically contrastive pitch or it conditions a doubling of the vowel inventory in languages which do not. EAVS is massive in scope, having affected the vast majority of langauges in the region across five separate langauge families. It is perhaps the most sweeping sound change event yet described in linguistics.

Download

Gender representation in linguistic example sentences

format_quoteMen are featured more as agents in linguistic examples, engaged in diverse intellectual activities, and violence depiction.format_quote

Download

When enough is enough: Drawing linguistic generalizations from limited data

Linguistics necessarily involves generalizing from limited data. Documentation, no matter how det... more Linguistics necessarily involves generalizing from limited data. Documentation, no matter how detailed, can never completely capture the full complexity of a linguistic system. Given a limited lexicon for an underdocumented language, are conclusions that can be drawn from those data representative of the language as a whole? The question of what constitutes a minimal sufficient dataset is significant in evaluating the validity of generalizations drawn from those data, and significant to typology generally. The question is also necessary for both extinct languages, where further documentation is impossible, and living languages, where the finite resources of linguists--time and funding--place limits on the available data.

This talk presents the results of experiments attempting to answer this question with respect to phonology. Drawing from the CHIRILA lexical database of Australian indigenous languages (Bowern 2016), the languages with the most extensively documented lexicons were selected to stand as proxy for \complete" natural languages. This study used a sample of 37 languages with lexicons in CHIRILA ranging from 2,000 to 10,000 lexical items. A series of sampling experiments were performed on each test language to determine the smallest necessary wordlist size to achieve (1) full phonemic coverage for the language, and (2) accurate phonemic distribution compared to the full dataset. It is hypothesized that when these two criteria are met, a dataset represents a minimally complete sample of a language for basic phonological typology, and thus suitable for further analysis and generalization. For each test language, samples were generated of varying sizes (50, 100, 250, 500, 1000 and 2000 lexical items) and with varying sampling methods (first n items, last n items, n random items, and every nth interval). These were then analyzed to determine phonemic coverage and fidelity of phonemic distribution.

The results shows coverage is consistently achieved at an average lexicon size of approximately 350 items, regardless of the original lexicon size sampled from. For linguistics generally, these results hold broad significance, given the predominance of word lists smaller than 400 items. For fieldwork, this study provides a guideline for designing documentation tasks in the face of limited time and resources. And for analysis, sufficiently representative data is imperative for drawing reliable conclusions. These results help to make empirically grounded decisions about which datasets are suitable for use in which research tasks.

format_quote3,232 languages are endangered; 100 language families have no speakers left, highlighting urgent documentation needs.format_quote

Download

Numeral classifiers in areal perspective: Khmer and Thai 'syntactic borrowing' revisited

Huffman (1973) described ‘syntactic borrowing’ between Thai and Khmer, coming to the general conc... more Huffman (1973) described ‘syntactic borrowing’ between Thai and Khmer, coming to the general conclusion that many aspects of Khmer syntax pattern unidirectionally after Thai as the result of extensive language contact. While the parallelism that holds between them remains striking, Thai is by no means the sole or obvious source of influence on Khmer. In order to draw firm conclusions about the historical syntax of Khmer, the Mon-Khmer branch, or of Austroasiatic generally, a cross-family perspective is required. We must not simply show their similarity, but either demonstrate or rule out larger family resemblance, larger geographic convergence, or mutual influence on both languages from a third source.

Increasing access to comparative data, diachronic data from epigraphic texts, and understanding of sociohistorical facts about mainland Southeast Asia, all motivate revisiting Huffman’s conclusions through careful point-by-point comparison of syntactic features. Each feature shared by Khmer and Thai can now be examined from a broader perspective than has been previously accomplished. This paper chooses numeral classifiers as a starting point, and examines them within Ross's (2007) framework for syntactic contact. The Ross schema is as follows:

Lexical calquing → Syntactic Calquing → Metatypy

Numeral classifiers serve as a case study in how we might approach each facet of this syntactic parallelism more thoroughly. This helps us to better determine the strength of the evidence at hand, and to move towards a more complete picture of how contact-induced change occurs in areas with strong linguistic convergence.

format_quoteStudy suggests Thai influences the reordering of classifiers in Khmer, indicating language contact and potential sources of classification changes.format_quote

Download

Tone analysis in Southeast Asia: Computational modeling & traditional methods

The study of tone systems is a fundamental component of Southeast Asian linguistics. With three o... more The study of tone systems is a fundamental component of Southeast Asian linguistics. With three of the region’s five major language families predominantly tonal, and with tonal languages found in all five families, it is difficult to overstate the importance of tonal phenomena for both language-specific and larger areal study.

Computational methods are opening up possibilities for studying tone systems in novel ways. Recent quantitative work by Brunelle and Kirby (2015) confirms, among other findings, that (i) the simplistic tonal-atonal dichotomy is a poor representation of the actual complexity of tonal typology in Southeast Asia, and (ii) vertical transmission is crucial, as language family was the greatest predictor of a language’s tone profile. Meanwhile, Shosted et al (2014, 2015) in their work with Iu Mien argue for the benefits of implementing unsupervised computational modeling of lexical tone very early in the documentation process for understudied languages.

Results such as these motivate the expanded use of computational methods in comparison with traditional methods. This study implements and expands on methods outlined in Shosted et al (2014), and applies them to Chindwin Khamti, a variety of Khamti [kht] spoken in Khamti District, Sagaing Region, Burma. Khamti tones were manually segmented and extracted, and then modeled using principal components analysis (Johnson 2008:95) and k-means clustering in R (R Core Team 2015). This paper compares the results of this modeling against the author’s earlier work assigning tone categories using traditional methods of researcher audition and instrumental analysis, in consultation with native informants.

Such comparison enables mutual methodological improvement. Computational modeling can help to strengthen existing analyses, where fieldworker expertise often varies. Moreover, linguists experienced in traditional methods can help to further delineate and refine the usefulness of the newer quantitative methods, and explore new areas of linguistics to apply them in. By so doing we can achieve more sophisticated and increasingly nuanced understanding of tonal phenomena in Southeast Asia and beyond.

Download

Tonal evidence in historical linguistics: genetic signal or typological noise?

The internal structure of Southwestern Tai (SWT) remains an unresolved issue in Tai historical li... more The internal structure of Southwestern Tai (SWT) remains an unresolved issue in Tai historical linguistics and is the subject of numerous proposals (Chamberlain 1975; Robinson 1994; Edmondson & Solnit 1997; Kullavanijaya & L-Thongkum 1998; Luo 2001, etc). One reason for this may be that tonal data figures prominently in many SWT classifications, despite cross-family areal similarities in tonal systems. Disagreement on the reliability of such evidence for historical analysis (e.g. Matisoff 1973; Chamberlain 1979) is not fully resolved, but recent quantitative work shows that typological diversity is much greater than the common tonal/atonal dichotomy, and that genetic relationship is a strong predictor of typological tone profile (Brunelle & Kirby 2015).

The place of Khamti [kht] within SWT has been one point of disagreement in past classifications. Treated as homogeneous in the literature, most work on Khamti to date is based on data gathered in Northeast India (Robinson 1849; Needham 1894; Harris 1976; Weidert 1977), though the scholarly consensus is that Khamti speakers migrated to India from northern Burma (Inglis 2014). New documentation work by the author in Khamti District, Burma, shows just four lexical tonemes, instead of the five described in India, and analysis following the Gedney (1972) method shows a very different history of splits and mergers. Morey (2005) has also reconstructed the tonal system from Robinson's (1849) Khamti sketch grammar, showing a tone system in some ways more similar to Khamti of modern Burma than of modern India.

The result is what at first appears to be a mismatch between the tonal and segmental evidence for these different varieties. However, by approaching the historical tone categories using the logic of the comparative method, Khamti serves as a case study in better integrating tonal evidence into language classification. Comparison of the modern and historical varieties allows us to reconstruct a tonal system of their nearest common ancestor, Proto-Khamti, turning the apparent con ict of traditional methods between the tonal and segmental evidence into mutual corroboration.

Improving our understanding of how to integrate tonal evidence into historical linguistics is critical for resolving both the immediate and larger Tai subgrouping problems, as well as classification issues in tonal languages generally. This paper provides one example of how tonal evidence can be used in conjunction with traditional segmental evidence for more complete and reliable historical reconstruction and classification.

Download

Tai Khamti of Burma and language classification in Southwestern Tai

Varieties of Khamti [ISO 639-3: kht] spoken in India and Burma (Myanmar) have largely been assume... more Varieties of Khamti [ISO 639-3: kht] spoken in India and Burma (Myanmar) have largely been assumed to be the same, despite little comparative work between the two. This paper presents findings from 2014 fieldwork conducted in the Chindwin River valley, Khamti District, Burma. While uncontroversially classified as Southwestern Tai (SWTai), Khamti has been a point of disagreement. Chamberlain (1975) grouped Khamti with Shan, while Edmondson and Solnit (1997) place Khamti and Shan in separate sub-branches. Luo (2001) suggests Khamti may be part of a new branch of Tai altogether.

A significantly different tonal system is found in Khamti of the Chindwin River valley, with just four lexical tonemes instead of the five described in India. Analysis following Gedney (1972) shows a different history of splits and mergers, and Morey’s (2005) reconstruction of mid-19th century Khamti tones shows a system in some ways more similar to Khamti of modern Burma than modern India. As previous SWTai subgroupings have been based on Indian data, this requires revisiting both the question of Khamti’s alignment within SWTai, but also the use of tonal evidence for language classification generally.

The mismatch between tonal and segmental evidence can be resolved using the logic of the comparative method, and Khamti is presented as a case study in the utility of tonal evidence for language classification. Comparison of the modern and historical varieties allows us to reconstruct a tonal system of their nearest common ancestor, and the apparent conflict between the tonal and segmental evidence becomes mutual corroboration.

Additional documentation in Burma is needed to draw more firm conclusions and tease out relationships between closely related SWTai varieties. However, improving our understanding of tonal evidence in historical linguistics is critical for resolving both the immediate and larger Tai subgrouping problems, as well as classification issues in tonal languages generally, and Khamti provides us one example of how we can do so.

Download

Sanskrit loanword adaptation in Old Thai: epigraphic evidence from the Sukhothai corpus

Southeast Asia has long been a crossroads of culture and commerce, and these interactions inevita... more Southeast Asia has long been a crossroads of culture and commerce, and these interactions inevitably leave their mark upon the local languages. Modern Thai is replete with words borrowed from Sanskrit, Pali, Khmer, English, and many other languages. However, spelling standardization of the Thai writing system in the 20th century, while often preserving or even restoring useful etymological information, has also obscured some facts in the linguistic record. Spelling variation that predates modern standardization provides important evidence for how Sanskrit loanwords were pronounced in the past, how they were adapted into the language, and how those borrowings have evolved over time.
This study will return to the earliest Thai texts: the corpus of epigraphic inscriptions of the Sukhothai kingdom. The corpus is comprised of around 100 texts, of varying lengths, found on primarily stone inscriptions dating from the 13th-15th centuries CE. This corpus begins with the genesis of Thai writing and extends until the decline of Sukhothai. Under the auspices of a previous Fulbright grant, these texts have been compiled, romanized, and analyzed by the author.
Sanskrit loanwords in the Sukhothai corpus provide an untapped source of linguistic evidence into the influence of Sanskrit in a transitional period of Southeast Asian history. As the neighboring Khmer empire was gradually undergoing a change in religion, a parallel shift in linguistic influence was also taking place throughout Southeast Asia, from the Sanskrit of Hinduism to the Pali of Buddhism. The Sukhothai texts offer us a glimpse into this changing world, through the window of these Sanskrit loanwords.

Download

Towards a regional convention for epigraphic transliteration of Southeast Asia’s Brahmic scripts

Abugida writing systems from the Brahmic script family date back in Southeast Asia more than 1500... more Abugida writing systems from the Brahmic script family date back in Southeast Asia more than 1500 years. They still dominate mainland Southeast Asia, and were once widely used throughout maritime Southeast Asia as well. A rich epigraphic record of mostly stone and metal inscriptions has long been appreciated for the historical information it provides. However, its linguistic content remains sorely underutilized, despite being an abundant source of evidence and insight into the intricate history of contact between Southeast Asian’s major language families.

With very similar scripts being used to write often very different languages, the epigraphic record readily reveals etymological connections and loan relationships that can be obscured by phonology. For instance, shared Sanskrit vocabulary becomes immediately apparent, to the point of finding many identical spellings in inscriptions centuries apart from different language families. In addition to general difficulty of access, widely varying local conventions for transcription—if the texts have been transcribed at all—have also served to obscure this important resource.

This paper examines the utility of a unified convention for epigraphic transliteration in Southeast Asia, as well as the challenges involved in developing such a convention. This effort has academic parallels both in the development of IAST for South Asian epigraphy beginning at the end of the 19th century, and in the role IPA has played in modern descriptive linguistics.

Download

Thai in Transition and the Thai Gigaword/Terabyte Web Corpus

The Thai language has always been in transition; the result of both external influence and intern... more The Thai language has always been in transition; the result of both external influence and internal innovation. English influence is only the most recent in a long line; over the past 1,000 years, Chinese, Khmer, Mon, Sanskrit, and Pali have all left substantial marks on the language. Thai has changed internally as well, with new patterns of nominalization, use and intent of passive forms, and innovations in the form of classifiers all appearing in the last century or two. These changes are sometimes prescriptive, as in the ‘linguistic’ edicts promulgated by King Mongkut, and in the work of the Royal Institute’s word-coining committee. But they are also entirely spontaneous; created by the tongues and hands of more than 60 million Thai speakers and writers.

It is only within the past few years, however, that we are able to observe the transition of Thai in progress via the Internet. We estimate that at the end of 2007, roughly 100 million Thai-language Web pages have been indexed by the major search engines (based on counts of the extremely common word ที่). A conservative estimate of 1,000 Thai characters per page implies that this Web corpus includes some 100 billion characters, or more than 25 billion words. Thus, we are already able to consult a gigaword corpus 25 times over, and can anticipate access to a terabyte corpus within a few years.

This paper discusses the Thai Web Corpus, part of a suite of tools and lexical resources for Southeast Asian language study. We describe its design and operation, and show how it is used to explore two new works by the Royal Institute: the Dictionary of New Words (2006), and Foreign Words that Can be Replaced with Thai Words (2007).

Download

From Lost to Online: A Digital eText + Image Edition of the First Thai-English Dictionary

In 2005 a brief entry in an unpublished handlist from the British Museum Library was revealed to ... more In 2005 a brief entry in an unpublished handlist from the British Museum Library was revealed to be a hand-written Thai-English dictionary dating from the first half of the nineteenth century. On inspection, the microfilmed manuscript of “Or. 11828, 256 ff” yielded an extensive dictionary of nearly 10,000 entries, arranged in an authentic Thai alphabetical order not seen again until the publication of Dan Beach Bradley’s 1873 monolingual dictionary. The microfilm continued with what appeared to be an early draft of Rev. J. Caswell’s Treatise on the Tones of the Siamese Language, written in possibly the same hand.

Unsigned, and tantalizingly similar to other early texts, it took comparisons of every feature – content, ordering, and handwriting – to develop a theory of the dictionary’s origin. The first known monolingual dictionary, also written in longhand, is dated 1846 by J.H. Chandler, who ‘copied and enlarged’ it from a lost earlier work by Caswell. But in text and form it is clearly unrelated to the British Museum manuscript. More promising clues emerged along a different path – the well-preserved letters and journals of early missionaries in Siam, including those of Bradley. These pointed to Mr. and Mrs. J. Taylor Jones, the first American missionaries in Siam, as the likely authors, and to the manuscript itself as being the earliest Thai-English dictionary.

This talk will present my investigation of the British Museum manuscript, now believed to be the long-lost Jones dictionary, and describe its lexicographic properties and orthographic features. I will also discuss the methodology used to prepare an integrated, searchable online edition of the text, meant to mitigate a significant roadblock for Southeast Asian studies – the world-wide dispersal of early texts. The online Jones incorporates both digital text (searchable by both original spellings and their modern equivalents) and digitized images of the original document, providing immediate access to all text features, and enabling ongoing research on the region’s early manuscripts.

Download

Convergences in Khumi and Marma morphosyntax

Download

Papers by Rikker Dockum

Tai Khamti of the Upper Chindwin River Valley

Basic word order in Tai Khamti: Language contact with Burmese

Abstract: The issue of basic word order in Tai Khamti (hereafter Khamti) has been a topic of deba... more Abstract: The issue of basic word order in Tai Khamti (hereafter Khamti) has been a topic of debate among linguistics for decades. There is no question that historically it would have been Subject-Verb-Object (SVO), like most Tai languages, but speakers also use Subject-Object-Verb (SOV) word order, apparently due to contact with neighboring SOV languages. Needham's (1894) grammar states Subject-Object-Verb (SOV) is the basic word order in Khamti. Using data from her own fieldwork, Khanittanan (1986) argued that SVO was all but gone from Khamti, and had fully transitioned to SOV. Diller (1992) showed that the syntactic generalizations laid out by Needham do not always hold, and used data from other Tai languages of Northeast India to argue for pragmatically driven word order, rather than SOV as basic. Morey (2006) introduced extensive additional data from Northeast India, also arguing that Khamti's verb-final ordering is pragmatically driven. The Khamti case presents an inte...

format_quoteKhamti's SOV structure influenced by prolonged contact with various SOV languages over 200 years.format_quote

Download

Gender bias and stereotypes in Large Language Models

Proceedings of The ACM Collective Intelligence Conference

Large Language Models (LLMs) have made substantial progress in the past several months, shatterin... more Large Language Models (LLMs) have made substantial progress in the past several months, shattering state-of-the-art benchmarks in many domains. This paper investigates LLMs' behavior with respect to gender stereotypes, a known issue for prior models. We use a simple paradigm to test the presence of gender bias, building on but differing from WinoBias, a commonly used gender bias dataset, which is likely to be included in the training data of current LLMs. We test four recently published LLMs and demonstrate that they express biased assumptions about men and women's occupations. Our contributions in this paper are as follows: (a) LLMs are 3-6 times more likely to choose an occupation that stereotypically aligns with a person's gender; (b) these choices align with people's perceptions better than with the ground truth as reflected in official job statistics; (c) LLMs in fact amplify the bias beyond what is reflected in perceptions or the ground truth; (d) LLMs ignore crucial ambiguities in sentence structure 95% of the time in our study items, but when explicitly prompted, they recognize the ambiguity; (e) LLMs provide explanations for their choices that are factually inaccurate and likely obscure the true reason behind their predictions. That is, they provide rationalizations of their biased behavior. This highlights a key property of these models: LLMs are trained on imbalanced datasets; as such, even with the recent successes of reinforcement learning with human feedback, they tend to reflect those imbalances back at us. As with other types of societal biases, we suggest that LLMs must be carefully tested to ensure that they treat minoritized individuals and communities equitably. • Human-centered computing → HCI theory, concepts and models; Interactive systems and tools; Natural language interfaces; • Social and professional topics → Gender.

format_quoteModels chose pronouns in a biased manner, citing ambiguity only 5% of the time and selecting one occupation consistently, skewed towards male.format_quote

Download

Data on acoustic phonetic properties of non-coronal fricatives in monosyllabic words of Zhongjiang Chinese

Data in Brief, Jun 1, 2022

The data reported in this article are non-coronal fricative measurements from 10 (5 male; 5 femal... more The data reported in this article are non-coronal fricative measurements from 10 (5 male; 5 female) native speakers of Zhongjiang Chinese. Each speaker produced 10 repetitions of 90 monosyllabic words beginning with either a velar fricative, /x/, or a labial-dental fricative, /f/. The measurements reported include spectral properties often used to characterize fricative variation, including: spectrum center of gravity (CoG), spectrum standard deviation (SD), spectrum skew, spectrum kurtosis, maximum amplitude frequency, and maximum amplitude. These measurements are compared across two data filtering conditions: a high pass filter condition, in which a 300Hz high pass filter was applied to the data before spectral measurements were calculated, and a no filter condition. The 90 monosyllabic words include the target fricatives in different phonetic environments. Target words include some that historically derive from different fricatives and show variation across regional varieties of Mandarin Chinese. Subsets of the target materials enable several closely matched comparisons of items. We describe measurements across the whole dataset, comparing as well the effect that filtering has on the measurements. The data also include a CSV file with measurements of each token, which enables comparison of phonetic contexts, lexical effects and individual differences in fricative variation beyond those described here. For further discussion of the data, please refer to the full length article entitled “The role of gestural timing in non-coronal fricative mergers in Southwestern Mandarin: acoustic evidence from a dialect island. Journal of Phonetics” [6].

Rikker Dockum - Swarthmore College

Rikker Dockum

Uploads

Conference Presentations by Rikker Dockum

Papers by Rikker Dockum

Log In