publications
2024
- Evolutionary pathways of complexity in gender systemsOlena Shcherbakova, and Marc Allassonnière-TangJournal of Language Evolution 2024
Humans categorize the experience they encounter in various ways, which is mirrored, for instance, in grammatical gender systems of languages. In such systems, nouns are grouped based on whether they refer to masculine/feminine beings, (non-)humans, (in)animate entities, or objects with specific shapes. Languages differ greatly in how many gender assignment rules are incorporated in gender systems and how many word classes carry gender marking (gender agreement patterns). It has been suggested that these two dimensions are positively associated as numerous assignment rules are better sustained by numerous agreement patterns. We test this claim by analyzing the correlated evolution (Continuous method in BayesTraits) and making the causal inferences about the relationships (phylogenetic path analysis) between these 2 dimensions in 482 languages from the global Grambank database. By applying these methods to linguistic data matched to phylogenetic trees (a world tree and individual families), we evaluate whether various types of gender assignment rules (semantic, phonological, and unpredictable) are causally linked to more gender agreement patterns on the global level and in individual language families. Our results on the world language tree suggest that semantic rules are weakly positively correlated with gender agreement and that the development of agreement patterns is facilitated by different rules in individual families. For example, in Indo-European languages, more agreement patterns are caused by the presence of phonological and unpredictable rules, while in Bantu languages, the driving force of agreement patterns is the variety of semantic rules. Our study shows that the relationships between agreement and rules are family-specific and yields support to the idea that more distinct rules and/or rule types might be more robust in languages with more pervasive gender agreement.
- On the distribution and origin of sortal classifiers in Altaic languagesShen-An Chen, Marc Allassonnière-Tang, Yung-Ping Liang, and 1 more authorJournal of Chinese Linguistics 2024
The grammatical feature of sortal classifiers, common in East and Southeast Asian languages, is also found in 15 of the 65 Altaic languages we have examined, though the classifiers are far fewer and used optionally. These observations suggest that the Altaic classifier systems are not indigenous. Based on the Single Origin Hypothesis that Chinese is the only language with an indigenous classifier system in Eurasia, we propose that the rise of classifiers in Altaic is due to the influence of neighboring classifier languages. Having first confirmed that the putative classifiers in these 15 languages are genuine classifiers, we then examine the phonological and semantic characteristics of the classifiers identified in each language and detect the influence from either Chinese or Persian. Taking historical and geographical factors into consideration, we suggest that classifier languages east of Uyghur were influenced by Chinese, while those to the west are influenced by Persian; Uyghur itself was influenced by both. Assuming that Persian classifiers are not indigenous either, these findings suggest that the Single Origin Hypothesis is applicable to classifier languages in Altaic.
- The evolutionary dynamics of grammatical gender in Torricelli languagesJose A Jódar-Sánchez, and Marc Allassonnière-TangSTUF - Language Typology and Universals 2024
Grammatical gender in New Guinea is an often neglected area in typological research, even though it is extremely diverse. For example, in New Guinea, some languages have grammatical gender systems with two sex-based categories, more than four gender-indexing targets, and no gender marking on nouns, while some languages have grammatical gender systems with much more categories, which are only marginally sex-based. This paper infers the processes of development and change of grammatical gender in Torricelli languages from two perspectives. First, it synthesizes the available data in the existing literature and hypothesizes the evolutionary pathway of gender systems in Torricelli languages. Nineteen Torricelli languages are selected as a representative coverage of the 55 Torricelli languages listed in Glottolog within the limits of the available documentation. These languages are then coded based on 6 presence-absence features relating to gender marking on verbs, adjectives, nouns, numerals, pronouns, and demonstratives. Second, it conducts an analysis with phylogenetic comparative methods to provide a quantitative assessment of the evolutionary possibilities for gender systems in Torricelli languages. The preliminary results show that gender is likely marked at the root of Torricelli languages, with pronouns and verbs being at the core of the system. This is in agreement with trends reflecting the evolution of gender systems in languages across the world.
- How network structure shapes languages: Disentangling the factors driving variation in communicative agentsMathilde Josserand, Marc Allassonnière-Tang, François Pellegrino, and 2 more authorsCognitive Science 2024
Languages show substantial variability between their speakers, but it is currently unclear how the structure of the communicative network contributes to the patterning of this variability. While previous studies have highlighted the role of network structure in language change, the specific aspects of network structure that shape language variability remain largely unknown. To address this gap, we developed a Bayesian agent-based model of language evolution, contrasting between two distinct scenarios: language change and language emergence. By isolating the relative effects of specific global network metrics across thousands of simulations, we show that global characteristics of network structure play a critical role in shaping interindividual variation in language, while intraindividual variation is relatively unaffected. We effectively challenge the long-held belief that size and density are the main network structural factors influencing language variation, and show that path length and clustering coefficient are the main factors driving interindividual variation. In particular, we show that variation is more likely to occur in populations where individuals are not well-connected to each other. Additionally, variation is more likely to emerge in populations that are structured in small communities. Our study provides potentially important insights into the theoretical mechanisms underlying language variation.
- Early Segmental Production in Thai Preschool Children Learning MandarinI-Ping Wan, Marc Allassonnière-Tang, and Pu YuInternational Journal of Asian Language Processing 2024
The research aims to conduct a corpus-based and data-driven analysis of the early-stage Mandarin learning production of 11 Thai preschool children in Bangkok, Thailand, within the interlanguage system. These children consist of 8 boys and 3 girls, with an age range of 4;1-6;5 (M = 5.455, SD = 0.688; total tokens = 36,565). Data were extracted from a spoken corpus constructed between 2018 and 2022, which was time-stamped, phone-aligned, and multi-tiered using Praat [P. Boersma and D. Weenink, Praat: Doing phonetics by computer (2022), http://www.praat.org/]. The data were annotated and labeled through a semi-automatic process employing various applications in Hybrid-DNN-HMM. The findings indicate the following: (1) Most sound deviations in learning do not mirror the phonetic inventory of L1; (2) Sound deviations can be influenced by L2, with marked phones exhibiting more deviations between L1 and L2; (3) Interlanguage manifests as a self-organizing and self-adaptive system. The study delves into the Contrastive Analysis Hypothesis, Markedness Differential Hypothesis and Interlanguage theory. It compares data with cross-linguistic universal trends in Mandarin acquisition and spoken corpus in Mandarin adults. Segmental similarities regarding phonological distances are quantitatively measured through Levenshtein edit distance and Hamming distance based on multivalued distinctive features.
- LA80: A Lexical Database of 10 Bantu A80 LanguagesTessa Vermeir, Marc Allassonnière-Tang, and Guillaume SegererJournal of Open Humanities Data 2024
In this paper, we present LA80, a database containing lexical data of 10 Bantu A80 languages (Bekwel, Gyeli, Kol, Koonzime, Kwasio, Makaa, Mpiemo, Njyem, Shiwa and Sso). Data from existing fieldwork datasets have been compiled and formatted. We standardised French translations, corrected spelling mistakes, and merged overlapping data points, resulting in a database with 5,588 concepts. Furthermore, for a subset of 557 concepts available in at least six of the 10 languages, we did additional reformatting by separating prefixes from stems, something that is not done systematically in the source data. The LA80 database can be used for comparative linguistic analyses and diachronic reconstructions.
- The meaning of morphomes: distributional semantics of Spanish stem alternationsBorja Herce, and Marc Allassonnière-TangLinguistics Vanguard 2024
Romance stem alternations have been argued to represent exclusively morphological objects (or “morphomes”) independent from semantic and syntactic categories. This conclusion has been based on feature-value analyses of the inflected forms, and definitions of natural classes that are theoretically driven and about which no consensus exists. Individual examples of morphomes are thus frequently challenged, while their autonomously morphological nature has never been tested quantitatively or experimentally. This is the purpose of the present study. We use context-based embeddings to explore the semantic profile of Spanish verb stem alternations. At the paradigmatic level, our findings suggest that Spanish morphomes’ cells are characterized by significantly above-chance distributional-semantic similarity. At the lexical level, similarly, verbs that show more similar patterns of alternation have also been found to be closer in meaning. Both of these findings suggest that these structures may have an extramorphological function. Using gradient distributional-semantic similarity offers a way to objectively assess the degree of (un)naturalness of a set of forms and meanings, something which has been lacking from most discussions on the structure of features and the architecture of paradigms.
- Vowel alternation with final i offers an easy-to-learn morphological option for a sex-blind grammatical gender in FrenchMarie-Claude Marsolier, Pris Touraille, and Marc Allassonnière-TangFrontiers in Psychology 2024
Like all modern Romance languages, French has a sex-based grammatical gender with two genders, feminine and masculine, and a lexicon that is highly sex-differentiated. These characteristics give rise to a number of issues, including the problematic generic use of the masculine grammatical gender, coupled with the challenge of sex categorization itself, and the epistemological difficulty of an adequate sociological description and analysis of what gender commonsense categories really are about. To remedy these concerns, several authors have proposed the creation of an additional, epicene grammatical gender. We have identified three such systematic proposals, or solutions, which specify various morphological options for new epicene nouns and gender markers on their satellite elements. These options include the use of non-standard or rarely used characters, the merging of feminine and masculine gender markers, as well as consonantal and vowel changes. In the simplest proposal, referred to as “solution I,” new epicene forms are mostly derived from feminine forms by systematically replacing with an i the final e that generally differentiates feminines from their masculine counterparts in written French. Although these solutions are used in some communities, their learnability has not been addressed so far, even though it could be a determining factor in their popularity and their eventual integration into standard French. In the present study, we provide a first assessment of this aspect by means of an online translation test. For each solution, French-speaking participants were instructed that they would be trained to learn an “alien” language that does not mark sex/gender categories (these alien languages correspond to standard French where only gendered words referring to people are replaced by the new epicene forms recommended by each solution). After a short learning-by-example phase, participants were required to translate into the alien language a set of 16 standard French sentences. The translations were analyzed as a function of several variables including the participants’ self-reported age and sex, the word categories and the solutions themselves. While all solutions proved quickly learnable, participants’ responses with solution I achieved the highest accuracy score, in particular with regard to the production of non-standard epicene forms.
- ‘Reflexemes’ – a first cross-linguistic insight into how and why reflexive constructions encode emotionsAlex Stephenson, Maïa Ponsonnet, and Marc Allassonnière-TangSTUF - Language Typology and Universals 2024
This article presents the first study on reflexive expressions having lexicalized an emotional meaning, as in the English example enjoy oneself. Such lexicalized forms, which we call ‘reflexemes’, occur in a number of genetically unrelated languages worldwide. Here we interrogate the cross-linguistic distribution and semantics of reflexemes, based on a sample of 58 languages from 6 genetic groups throughout Europe, Australia, and Asia. Reflexemes exhibit uneven distribution in this sample. Despite the presence of reflexemes across all three continents, European languages generally display much larger inventories. Based on our language sample’s contrasts, we hypothesize that these disparities could be driven by: the form of reflexive markers; their semantic range, including colexifications with anticausative constructions; and their longevity, with ancient, cognate European markers fostering accumulation of reflexemes via inheritance and borrowing. As for semantics, reflexemes target comparable emotions across languages. Specifically, categories labelled ‘Good feelings’, ‘Anger’, ‘Worry’, ‘Bad feelings’ and ‘Fear’ are consistently most prevalent. These tendencies apply across our sample, with no sign of family- or continent-specific semantic tendency. The observed semantic distribution may reflect universal lexicalization tendencies not specific to reflexemes, perhaps combined with an emphasis on self-evaluation and other social emotions imparted by reflexive semantics.
- Early humans out of Africa had only base-initial numeralsOne-Soon Her, Yung-Ping Liang, Eugene Chan, and 3 more authorsHumanities and Social Sciences Communications 2024
The vast majority of languages have numerals involving multiplication. Cross-linguistically, a numeral that involves a multiplier and a numeral base can be base-final, e.g., three hundred [three × hundred] in English, or base-initial, e.g., ikie ita [hundred × three] in Ibibio (Niger-Congo). A worldwide survey of 4099 languages reveals that 39% of the languages are base-initial, 48% are base-final, 4% use both orders, and 8% are without numeral bases. As the first step towards explaining this diversity and worldwide distribution, we offer convergent evidence to support the hypothesis that the languages of early humans in Africa had base-initial numerals. From a linguistic point of view, linearization is necessary for the verbal expression of multiplicative numerals. Between the two linear orders of multiplication, we demonstrate that the base-initial order has an initial advantage in communicative efficiency. We also offer typological evidence from the dominant head-initial word order in present-day numeral systems and nominal phrases in African languages. Finally, results from a phylogenetic analysis based on a global tree of human languages show that the base-initial order is more stable diachronically and more likely to be at the root of the reconstructed tree of languages in Africa between 100 and 150 thousand years ago. The dominant base-final order in non-African languages of modernity is thus likely to be a development after the Out-of-Africa exodus between 60 and 80 thousand years ago.
- Semantic and Phonological Distances in Free Word Association TasksMarc Allassonnière-Tang, I.-Ping Wan, and Chainwu LeeIn Chinese Lexical Semantics 2024
Free word association tasks are used to evaluate different hypotheses proposed by interactive and cascade models of speech processing. The interactive model predicts a small semantic and phonological distance between the target and the response words, whereas the cascade model predicts that the responses are semantically close to the targets but are phonologically far from the targets. One hundred forty-five stimuli tested with 22 participants resulted in 2289 tokens available for testing. The phonological and semantic distances were automatically measured using Levenshtein distance and word embeddings; additional metadata over 10M drawn from the Academia Sinica Corpus in Taiwan was computed. The results show that the stimuli and the responses are closer than random semantically and phonologically, supporting the predictions from the interactive models. However, we also observe that the semantic distance is shorter than the phonological distance. A concomitant increase in chronometry is found with longer semantic distance.
2023
- Phylogenetic analyses for the origin of sortal classifiers in Mongolic, Tungusic, and Turkic languagesMarc Allassonnière-Tang, Zhong-Liang Gao, Shen-An Chen, and 1 more authorConcentric 2023
Numeral classifiers are one of the most common types of nominal classification systems. Their geographical distribution worldwide is concentrated in Asia, which infers a scheme of diffusion from a linguistic innovation. This study investigates the origin of classifier systems in the Mongolic, Tungusic, and Turkic languages in the Altaic region with a phylogenetic analysis based on data from 55 languages. The Single Origin Hypothesis suggests that Sinitic is the most probable original source of classifier systems found in Asia. Under this hypothesis, classifiers are unlikely to be an indigenous feature of the Altaic region, and indeed their phylogenetic signal turns out to be weak. We also conduct a qualitative analysis on the classifier inventory of the studied languages to assess the robustness of phylogenetic methods. The results also indicate that classifiers are most likely a borrowed feature in the Mongolic, Tungusic, and Turkic languages.
- Variation du genre des substantifs dans les dialectes gallo-romans. Étude exploratoireGuylaine Brun-Trigaud, Maguelone Sauzet, and Marc Allassonnière-TangGéolinguistique 2023
Cet article propose une analyse sur un corpus d’environ 900 cartes de l’Atlas linguistique de la France (1902‑1910), dans le but d’explorer la variation de genre (masculin/féminin) des substantifs dans les dialectes gallo-romans (oïl, occitan, francoprovençal), en regard du français standard, où cette catégorie grammaticale a été fortement régularisée par la norme. Nous avons eu recours à des méthodes qualitatives et quantitatives (régression linéaire). Les premiers résultats montrent un foisonnement de cas de variation, que des critères sémantiques, étymologiques et morpho‑phonologiques inhibent ou favorisent. Les travaux de Platz (1918), précurseur de l’étude du genre dans l’ALF, ont apporté des pistes intéressantes à nos réflexions.
- Nominal classification in Asia and Oceania: Functional and diachronic perspectivesMarc Allassonnière-Tang, and Marcin KilarskiIn John Benjamins 2023
Linguists have long been interested in systems of nominal classification due to their diverse functions as well as cognitive and cultural correlates. Among others, ongoing research has focused on semantic, functional and morphosyntactic properties of complex systems such as co-occurring gender and numeral classifiers. Such approaches have typically focused on the languages of north-western South America and Papua New Guinea. This volume proposes to fill in a gap in existing research by focusing on Asia, based on case studies from languages belonging to a wide range of families, i.e., Austroasiatic, Austronesian, Dravidian, Hmong-Mien, Indo-European, Mongolic, Sino-Tibetan and Tai-Kadai as well as the language isolate Nivkh. Gender and classifiers in these languages are approached within several different perspectives, i.e., functional, typological and diachronic, thus revealing complex patterns in their lexical and pragmatic functions as well as origin, development and loss. Describing and analysing such properties is a unique and innovative contribution of the volume.
- Nominal classification in Assamese: An analysis of functionPori Saikia, and Marc Allassonnière-TangIn Current Issues in Linguistic Theory 2023
We provide an analysis of the classifier system in Assamese (Indo-European) via the framework of functional typology. Assamese is located at the border of Indo-European and Sino-Tibetan language families, which are typically associated with grammatical gender and classifiers, respectively. Assamese represents an insightful example of an Indo-European language relying on classifiers rather than grammatical gender to fulfill the functions typical for a nominal classification system. Our analysis shows that classifiers in Assamese behave similarly to other classifier languages in terms of lexical and discourse functions, except for the functions of definiteness marking and individuation. The implications of such findings are connected to typology, research in human cognition, and language contact.
- Why we need a gradient approach to word orderNatalia Levshina, Savithry Namboodiripad, Marc Allassonnière-Tang, and 12 more authorsLinguistics 2023
This article argues for a gradient approach to word order, which treats word order preferences, both within and across languages, as a continuous variable. Word order variability should be regarded as a basic assumption, rather than as something exceptional. Although this approach follows naturally from the emergentist usage-based view of language, we argue that it can be beneficial for all frameworks and linguistic domains, including language acquisition, processing, typology, language contact, language evolution and change, and formal approaches. Gradient approaches have been very fruitful in some domains, such as language processing, but their potential is not fully realized yet. This may be due to practical reasons. We discuss the most pressing methodological challenges in corpus-based and experimental research of word order and propose some practical solutions.
- L’apport des données participatives pour l’étude linguistique des français du monde : le cas de l’opposition /a∼ɑ/Mathilde Hutin, and Marc Allassonnière-TangJournal of French Language Studies 2023
French is a language spoken by hundreds of millions of speakers in Europe, Africa, and America. Such widespread use favours variation, yet large homogeneous corpora allowing to account for this variation worldwide are scarce and would in any case necessitate non-negligeable financial and human resources, as did for instance the project Phonologie du Français Contemporain. In this study, we present a possible alternative – crowdsourcing. We introduce Lingua Libre, Wikimedia’s open linguistic library, and use it to describe the variation of a phonemic opposition between two low vowels, /a/ and /ɑ/, in several varieties of French. The recordings of 38 speakers from 26 survey points are processed automatically and compared to values from past research. Results show that the platform has the potential to provide results mostly congruent with those of professional field recordings. The study concludes on the advantages and limitations of the platform and proposes suggestions for its improvement.
- Idéer une catégorie épicène et la matérialiser cohéremment dans la langue. Une nécessité épistémologique autant que politiquePriscille Touraille, and Marc Allassonnière-TangIn Qu’est-ce qu’une femme ? Catégories homme/femme : débats contemporains 2023
Pris Touraille a assuré la rédaction de ce chapitre et en assume la responsabilité scientifique. Ce travail constitue le premier volet d’une collaboration avec Claude (Miki) Marsolier, passionnéi de linguistique (et chercheuri en génétique au CEA et au MNHN, Paris) et Arc Allassonnière-Tang (avec lesquellis la communication au colloque Qu’est qu’une femme ? a été faite en 2022 à Nantes). Cette collaboration consiste en l’élaboration d’une solution grammaticale épicène en français à laquelle Pris Touraille a commencé à réfléchir comme pouvant devenir un outil de rupture épistémologique dans les sciences sociales. Le deuxième volet de cette collaboration, assuré en grande partie par M. Marsolier, est une proposition concrète de formes épicènes formant système, « la solution en i », que nous appelons aussi « le français hors-sexe ». Un troisième volet du projet, largement piloté par Arc Allassonnière-Tang, est à venir et sera dédié aux résultats des tests et à l’application créée pour développer cet outil.
- Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse MarkersMathilde Hutin, Liesbeth Degand, and Marc Allassonnière-TangIn Proc. INTERSPEECH 2023 2023
Discourse markers (DMs) are (chunks of) words stemming from the diachronic development of other parts-of-speech that tag the discourse’s organization (ex. "well then", "innit"...). However, in synchrony, the formal accounts for the DM class vary from purely discourse-oriented definitions to models relying on a combination of lexico-grammatical and discursive information. We propose to bring new evidence into this debate by comparing the phonetic realizations of 4 DM types: stemming originally from adverbs, coordinators, subordinators and interjections. A discourse-only account would predict that the 4 types would be realized similarly, while a syntactic-discursive account predicts that subordinators would stand out, as they are less prone to syntactic independence. The analysis of various acoustic parameters (segment duration, F0, F1, F2 and HNR) in a finely-annotated 4-hour long corpus of French indicates that a hybrid approach may indeed be more accurate.
- Intra- and inter-speaker variation in eight Russian fricativesNatalja Ulrich, François Pellegrino, and Marc Allassonnière-TangThe Journal of the Acoustical Society of America 2023
Acoustic variation is central to the study of speaker characterization. In this respect, specific phonemic classes such as vowels have been particularly studied, compared to fricatives. Fricatives exhibit important aperiodic energy, which can extend over a high-frequency range beyond that conventionally considered in phonetic analyses, often limited up to 12 kHz. We adopt here an extended frequency range up to 20.05 kHz to study a corpus of 15 812 fricatives produced by 59 speakers in Russian, a language offering a rich inventory of fricatives. We extracted two sets of parameters: the first is composed of 11 parameters derived from the frequency spectrum and duration (acoustic set) while the second is composed of 13 mel frequency cepstral coefficients (MFCCs). As a first step, we implemented machine learning methods to evaluate the potential of each set to predict gender and speaker identity. We show that gender can be predicted with a good performance by the acoustic set and even more so by MFCCs (accuracy of 0.72 and 0.88, respectively). MFCCs also predict individuals to some extent (accuracy = 0.64) unlike the acoustic set. In a second step, we provide a detailed analysis of the observed intra- and inter-speaker acoustic variation.
- A corpus-based quantitative study of numeral classifiers in NepaliKrishna Parajuli, and Marc Allassonnière-TangCorpus Linguistics and Linguistic Theory 2023
Nepali is typologically rare in terms of nominal classification systems, as it is one of the few languages of the world having simultaneously two gender systems (human/non-human, masculine/feminine) and one numeral classifier system (distinguishing features such as human, round-shaped objects, and long objects among others). Such a rare co-occurrence of different nominal classification systems is highly relevant for investigating linguistic complexity, as languages generally do not have several systems of the same type fulfilling the same functions. However, no corpus-based quantitative analyses have been conducted on the productive use of nominal classification systems in Nepali. The current paper aims at filling this gap by providing a token-based study from the Nepali National Corpus (∼20 million words). Our preliminary results show that there is in fact little formal overlap between the classifier and the gender systems.
2022
- Defining numeral classifiers and identifying classifier languages of the worldOne-Soon Her, Harald Hammarström, and Marc Allassonnière-TangLinguistics Vanguard 2022
This paper presents a precise definition of numeral classifiers, steps to identify a numeral classifier language, and a database of 3,338 languages, of which 723 languages have been identified as having a numeral classifier system. The database, named World Atlas of Classifier Languages (WACL), has been systematically constructed over the last 10 years via a manual survey of relevant literature and also an automatic scan of digitized grammars followed by manual checking. The open-access release of WACL is thus a significant contribution to linguistic research in providing (i) a precise definition and examples of how to identify numeral classifiers in language data and (ii) the largest dataset of numeral classifier languages in the world. As such it offers researchers a rich and stable data source for conducting typological, quantitative, and phylogenetic analyses on numeral classifiers. The database will also be expanded with additional features relating to numeral classifiers in the future in order to allow more fine-grained analyses.
- The noncausal/causal alternation and genealogical affiliation: Quantitative testing in three Niger-Congo language familiesMarc Allassonnière-Tang, Stéphane Robert, and Sylvie VoisinLinguistique et Langues Africaines 2022
The noncausal/causal alternation is the pairing of two verb forms that refer to the same core event but differ in the absence vs. presence of a causer for this event (e.g. rise vs. raise, open (intr.) vs. open (tr.), die vs. kill). Languages differ in their overall preferences among the possible strategies for coding this alternation. This study uses machine-learning methods (clustering and tree-based computational classifiers) to investigate the predictive power of the noncausal/causal alternation for the genealogical affiliation of 38 languages belonging to the Atlantic, Mande and Mel families. The languages studied here belong to different contact areas in Senegal and its surroundings. The three families are all affiliated to the Niger-Congo phylum but display quite different typological profiles. The present paper elaborates on an earlier study that used a standard list of 18 verb pairs to establish the coding strategies in these languages. Apart from highlighting which coding strategies are favored in each family, our quantitative analyses show that the family affiliation of the 38 languages can be predicted with an accuracy above the majority baseline based on the information of the noncausal/causal alternation in the 18 verb pairs, but that the predictive power of verb pairs 1‑9 is generally lower than the one of verb pairs 10‑18. Our results confirm the hypothesis that the first group of verb pairs shows universal rather than lineage-specific tendencies concerning the noncausal/causal alternation. Furthermore, our analyses identify which of the 18 verb pairs (and their correlated coding strategies) have the highest predictive power. This study opens new avenues for identifying the relevant synchronic data for genealogical classification in historical linguistics. Future studies could replicate the same analysis in different language families to assess if our results are universal or specific to some language families.
- On Taiwanese Universities’ Two–One Academic Dismissal Policies: A Quantitative Fairness Analysis of the Four Policies of National Chengchi UniversityOne-Soon Her, Jie-Wen Tsai, and Marc Allassonnière-TangJournal of Educational Research and Development 2022
Academic dismissal policies are used by universities worldwide for quality control purposes. Taiwanese universities base their policies solely on the credit fail rate (CFR) of individual semesters (S-CFR). The most common S-CFR is 50% and is called er-yi (two-one), which indicates half or more of the course credits of a semester were failed. Though actual policies vary among universities, their core designs generally rely on the concept of S-CFR. The present study first compares the dismissal policies among universities in the United States, the Netherlands, and Taiwan to demonstrate how the two–one design lacks consultation and review processes. We then argue that the disregard for cumulative grade point average, semester grade point average, and cumulative credit pass rate may lead to bias because it may lead to students with better overall academic performance being dismissed. We further validate the argument by conducting a quantitative analysis of data on the academic performance of students (N=22,703) from National Chengchi University over 11 years under four different policies. Our findings strongly indicate that the core design common in such policies, i.e., the S-CFR, should be reconsidered.
- Predicting grammatical gender in Nakh languages: Three methods comparedJesse Wichers Schreur, Marc Allassonnière-Tang, Kate Bellamy, and 1 more authorLinguistic Typology at the Crossroads 2022
The Nakh languages Chechen and Tsova-Tush each have a five-valued gender system: masculine, feminine, and three “neuter” genders named for their singular agreement forms: B, D and J. Gender assignment in languages is generally analysed as being dependent on both forms and semantics (e.g. Corbett, 1991), with semantics typically prevailing over form (e.g. Bellamy & Wichers Schreur, 2021, Allassonnière-Tang et al., 2021). Most previous studies have considered only binary or tripartite gender systems possessing masculine, feminine, and neuter values. The five-valued system of Nakh thus represents an innovative and insightful case study for analysing gender assignment. In this paper we build on the existing qualitative linguistic analyses of gender assignment in Tsova-Tush (Wichers Schreur, 2021) and apply three machine-learning methods to investigate the weight of form and semantics in predicting grammatical gender in Chechen and Tsova-Tush. The results show that while both form and semantics are helpful for predicting grammatical gender in Nakh, semantics is dominant, which supports findings from existing literature (Allassonnière-Tang, Brown & Fedden, 2021). However, the results also show that the coded semantic information could be further fine-grained to improve the accuracy of the predictions (see also Plaster et al., 2013). In addition, we discuss the implications of the output for our understanding of language-internal and family-internal processes of language change, including how loanwords are integrated from Russian, a three-gender language.
- Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced LanguagesMathilde Hutin, and Marc Allassonnière-TangLanguages 2022
Less-resourced languages are usually left out of phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France in 2015. It contains more than 670k recordings in approximately 150 languages across nearly 740 speakers. As a proof of concept, we consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate data from 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System. Information on the formants of the vowel segments is then extracted to measure how vowels expand in the acoustic space and whether this is correlated with the number of vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.
- Inferring case paradigms in Koalib with computational classifiersNicolas Quint, and Marc Allassonnière-TangCorpus Linguistics and Linguistic Theory 2022
The object case inflection in Koalib (Niger-Congo) represents complex patterns that involve phoneme position, syllable structure, and tonal pattern. Few attempts have been made with qualitative and quantitative approaches to identify the rules of the object case paradigms in Koalib. In the current study, information on phonemes, tones, and syllables are automatically extracted from a Koalib sample of 2,677 lexemes. The data is then fed to decision-tree-based classifiers to predict the object case paradigms and extract the interactive patterns between the variables. The results improve the predicting accuracy of existing studies and identify the case paradigms predicted by linguistic hypotheses. New case paradigms are also found by the computational classifiers and explained from a linguistic perspective. Our work demonstrates that the combination of linguistic theoretical knowledge with machine learning techniques can become one of the methodological approaches for linguistic analyses.
- The evolutionary trends of noun class systems in Atlantic languagesNeige Rochant, Marc Allassonnière-Tang, and Chundra CathcartIn Proceedings of the Joint Conference on Language Evolution 2022
Nominal classification systems such as grammatical gender (e.g., the masculine/feminine distinction in French) and noun classes (e.g., Bantu noun classes based on fruits, plants, liquids, among others) provide a window on how the human brain perceives and categorizes objects and experiences it encounters. While the diachronic development of grammatical gender systems is well studied, noun class systems have received less attention. We use phylogenetic comparative methods to analyze where noun classes are marked (on nouns, pronouns, demonstratives, articles, adjectives, numbers, and verbs) in thirty-six Atlantic languages and how these markers change diachronically. Our results show that noun class marking is generally preferred and more stable within the noun phrase, i.e., on nouns, demonstratives, and adjectives.
- Investigating phonological theories with crowd-sourced data: The Inventory Size Hypothesis in the light of Lingua LibreMathilde Hutin, and Marc Allassonnière-TangIn Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology 2022
Data-driven research in phonetics and phonology relies massively on oral resources, and access thereto. We propose to explore a question in comparative linguistics using an open-source crowd-sourced corpus, Lingua Libre, Wikimedia’s participatory linguistic library, to show that such corpora may offer a solution to typologists wishing to explore numerous languages at once. For the present proof of concept, we compare the realizations of Italian and Spanish vowels (sample size = 5000) to investigate whether vowel production is influenced by the size of the phonemic inventory (the Inventory Size Hypothesis), by the exact shape of the inventory (the Vowel Quality Hypothesis) or by none of the above. Results show that the size of the inventory does not seem to influence vowel production, thus supporting previous research, but also that the shape of the inventory may well be a factor determining the extent of variation in vowel production. Most of all, these results show that Lingua Libre has the potential to provide valuable data for linguistic inquiry.
- Crowd-sourcing for Less-resourced Languages: Lingua Libre for PolishMathilde Hutin, and Marc Allassonnière-TangIn Proceedings of the International Conference on Language Resources and Evaluation 2022
Oral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the present study, we take Polish, a less-resourced language in terms of phonetic data, as an example, and compare our phonetic observations built on the data from Lingua Libre with the phonetic observations found by previous linguistic studies. We observe that the data from Lingua Libre partially matches the phonetic inventory of Polish as described in previous studies, but that the acoustic values are less precise, thus showing both the potential and the limitations of Lingua Libre to be used for phonetic research.
2021
- Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systemsMarc Allassonnière-Tang, Olof Lundgren, Maja Robbers, and 5 more authorsHumanities and Social Sciences Communications 2021
Languages of diverse structures and different families tend to share common patterns if they are spoken in geographic proximity. This convergence is often explained by horizontal diffusibility, which is typically ascribed to language contact. In such a scenario, speakers of two or more languages interact and influence each other’s languages, and in this interaction, more grammaticalized features tend to be more resistant to diffusion compared to features of more lexical content. An alternative explanation is vertical heritability: languages in proximity often share genealogical descent. Here, we suggest that the geographic distribution of features globally can be explained by two major pathways, which are generally not distinguished within quantitative typological models: feature diffusion and language expansion. The first pathway corresponds to the contact scenario described above, while the second occurs when speakers of genetically related languages migrate. We take the worldwide distribution of nominal classification systems (grammatical gender, noun class, and classifier) as a case study to show that more grammaticalized systems, such as gender, and less grammaticalized systems, such as classifiers, are almost equally widespread, but the former spread more by language expansion historically, whereas the latter spread more by feature diffusion. Our results indicate that quantitative models measuring the areal diffusibility and stability of linguistic features are likely to be affected by language expansion that occurs by historical coincidence. We anticipate that our findings will support studies of language diversity in a more sophisticated way, with relevance to other parts of language, such as phonology.
- Identifying the Russian voiceless non-palatalized fricatives /f/, /s/, and /ʃ/ from acoustic cues using machine learningNatalja Ulrich, Marc Allassonnière-Tang, François Pellegrino, and 1 more authorThe Journal of the Acoustical Society of America 2021
This paper shows that machine learning techniques are very successful at classifying the Russian voiceless non-palatalized fricatives [f], [s], and [ʃ] using a small set of acoustic cues. From a data sample of 6320 tokens of read sentences produced by 40 participants, temporal and spectral measurements are extracted from the full sound, the noise duration, and the middle 30 ms windows. Furthermore, 13 mel-frequency cepstral coefficients (MFCCs) are computed from the middle 30 ms window. Classifiers based on single decision trees, random forests, support vector machines, and neural networks are trained and tested to distinguish between these three fricatives. The results demonstrate that, first, the three acoustic cue extraction techniques are similar in terms of classification accuracy (93% and 99%) but that the spectral measurements extracted from the full frication noise duration result in slightly better accuracy. Second, the center of gravity and the spectral spread are sufficient for the classification of [f], [s], and [ʃ] irrespective of contextual and speaker variation. Third, MFCCs show a marginally higher predictive power over spectral cues (<2%). This suggests that both sets of measures provide sufficient information for the classification of these fricatives and their choice depends on the particular research question or application.
- Investigating the branching of Chinese classifier phrases: Evidence from speech perception and productionMarc Allassonnière-Tang, Ying-Chun Chen, Nai-Shing Yen, and 1 more authorJournal of Chinese Linguistics 2021
The formal structure of the construction formed by a numeral (Num), a sortal classifier (C) or mensural classifier (M), and a noun (N), is controversial, as both left-branching [[Num C/M] N] and right-branching [Num [C/M N]] structures have been argued for in the literature. In this paper we report two psycholinguistic experiments on speech production and perception in Mandarin to investigate this branching issue. First, we applied the syntax-phonology interface of tone 3 (T3) sandhi and performed a phonological analysis of native speakers’ tone sandhi patterns of [Num C/M N] phrases composed of T3 monosyllabic words. Second, we conducted a click-detection experiment to see how native speakers would perceive a click inserted in a C/M phrase composed of monosyllabic words, as compared to when it is inserted in other syntactic structures with attested left or right-branching. Results from both experiments supported the leftbranching structure of classifier phrases.
- Syllable Complexity and Morphological Synthesis: A Well-Motivated Positive Complexity Correlation Across SubdomainsShelece Easterday, Matthew Stave, Marc Allassonnière-Tang, and 1 more authorFrontiers in Psychology 2021
Relationships between phonological and morphological complexity have long been proposed in the linguistic literature, with empirical investigations often seeking complexity trade-offs. Positive complexity correlations tend not to be viewed in terms of motivations. We argue that positive complexity correlations can be diachronically well-motivated, emerging from crosslinguistically prevalent processes of language change. We examine the correlation between syllable complexity and morphological synthesis, hypothesizing that the process of grammaticalization motivates a positive relationship between the two features. To test this, we conduct a typological survey of 95 diverse languages and a corpus study of 21 languages with substantive (predominantly \textgreater10,000 words) corpora from the DoReCo project. The first study establishes a significant positive correlation between syllable complexity, measured in terms of maximal syllable patterns, and the index of synthesis (morpheme/word ratio). The second study tests the hypothesis that the relationship between syllable complexity and synthesis holds at local (word-initial and word-final) levels and within noun and verb types, as predicted by a grammaticalization account. While the findings of the corpus study are limited in their statistical power, the observed tendencies are consistent with our predictions. This study contributes important findings to the complexity literature, as well as a novel method which incorporates broad typological sampling and deep corpus analysis.
- Testing Semantic Dominance in Mian Gender: Three Machine Learning ModelsMarc Allassonnière-Tang, Dunstan Brown, and Sebastian FeddenOceanic Linguistics 2021
The Trans-New Guinea language Mian has a four-valued gender system that has been analyzed in detail as semantic. This means that the principles of gender assignment are based on the meaning of the noun. Languages with purely semantic systems are at one end of a spectrum of possible assignment types, while others are assumed to have both semantic and formal (i.e., phonologyor morphology-based) assignment. Given the possibility of gender assignment by both semantic and formal principles, it is worthwhile testing the empirical validity of the categorization of the Mian system as predominantly semantic. Here, we apply three machine learning models to determine independently what role semantics and phonology play in predicting Mian gender. Information about the formal and semantic features of nouns is extracted automatically from a dictionary. Different types of computational classifiers are trained to predict the grammatical gender of nouns, and the performance of the computational classifiers is used to assess the relevance of form and semantics in relation to gender prediction. The results show that semantics is dominant in predicting the gender of nouns in Mian. While it validates the original analysis of the Mian system, it also provides further evidence that claims of an equal contribution of form-based and semantic features in gender assignment do not hold for at least a proper subset of languages with gender.
- Interindividual Variation Refuses to Go Away: A Bayesian Computer Model of Language Change in Communicative NetworksMathilde Josserand, Marc Allassonnière-Tang, François Pellegrino, and 1 more authorFrontiers in Psychology 2021
Treating the speech communities as homogeneous entities is not an accurate representation of reality, as it misses some of the complexities of linguistic interactions. Inter-individual variation and multiple types of biases are ubiquitous in speech communities, regardless of their size. This variation is often neglected due to the assumption that “majority rules,” and that the emerging language of the community will override any such biases by forcing the individuals to overcome their own biases, or risk having their use of language being treated as “idiosyncratic” or outright “pathological.” In this paper, we use computer simulations of Bayesian linguistic agents embedded in communicative networks to investigate how biased individuals, representing a minority of the population, interact with the unbiased majority, how a shared language emerges, and the dynamics of these biases across time. We tested different network sizes (from very small to very large) and types (random, scale-free, and small-world), along with different strengths and types of bias (modeled through the Bayesian prior distribution of the agents and the mechanism used for generating utterances: either sampling from the posterior distribution [“sampler”] or picking the value with the maximum probability [“MAP”]). The results show that, while the biased agents, even when being in the minority, do adapt their language by going against their a priori preferences, they are far from being swamped by the majority, and instead the emergent shared language of the whole community is influenced by their bias.
- The Diversity of Classifier Inventory in Mandarin Dialects: A Case Study of BaodingNa Song, and Marc Allassonnière-TangFaits de Langues 2021
Our study compares Standard Mandarin (the Beijing dialect used in spoken and written registers) with the Mandarin dialect of Baoding (one of the Mandarin dialects belonging to the Jì-lŭ Mandarin group, Hebei-Shandong). Standard Mandarin and Baoding are geographically and phylogenetically closely related, but they differ in terms of their classifier system, as Standard Mandarin resorts to a wide array of sortal classifiers whereas Baoding only uses one general classifier. We first provide a detailed analysis of the unconventional classifier system in Baoding. Then, we compare the lexical and discourse functions of sortal classifiers in Standard Mandarin and Baoding. We show that Standard Mandarin does present a certain level of convergence with its geographical neighbour Baoding. However, these varieties also display significant divergences, as several lexical and discourse functions typically associated with classifier systems cannot be fulfilled by the only classifier found in Boading.
- Topic modelling on archive documents from the 1970s: global policies on refugeesPhilip Grant, Ratan Sebastian, Marc Allassonnière-Tang, and 1 more authorDigital Scholarship in the Humanities 2021
This study conducts a historical analysis of global policies on refugees within typewritten and digitally born documents (c. 55,000 pages) from international and national archives. The data originate from the 1970s and are stored in archives from the UK and US governments, plus the United Nations High Commissioner for Refugees (UNHCR). The overarching theme is to analyse the involvement of the UK, the USA, and the UNHCR in different refugee cases that occurred during the 1970s. To do so, we (1) identify the main topics in each document; (2) investigate the transmission of topics horizontally (between organizations) and vertically (through time); and (3) suggest targeted areas of the document set for further close reading by historians. Standard Optical Character Recognition and object detection are used to extract information from documents and categorize them. Then, natural language processing (NLP) methods like topic modelling and clustering are used to identify topics and the relationships between them across time. The results identify several main themes covered by different organizations and how the focus of each organization changes diachronically. Besides its academic contribution, this study also demonstrates how, through the use of existing techniques with limited customization, digital technologies in the hands of the historian can augment and complement qualitative methods in bringing to light the themes and trends demonstrated in large bodies of historical documents.
- What conditions tone paradigms in Yukuna: Phonological and machine learning approachesMagdalena Lemus-Serrano, Marc Allassonnière-Tang, and Dan DediuGlossa: a journal of general linguistics 2021
Yukuna is an understudied Arawak language of North-West Amazonia with a privative tonal system. In this system, roots are underlyingly specified for tone, whilst affixes are toneless. However, affixation interacts with tone, leading to many variations in surface tonal patterns. This paper puts forth a qualitative analysis of Yukuna’s tonal system, and provides data-driven evidence in favor of this analysis using machine learning methods. More precisely, we use decision trees and random forests to assess quantitatively the predictions of the phonological analysis. A manually annotated corpus of verbal paradigms was split into a training and a testing set. We trained the computational classifiers on the first and tested their predictions on the second. We found that they predict the majority of the patterns and support the qualitative analysis. Additionally, they suggest avenues for enhancing the phonological analysis, by providing a ranking of the variables that highlight statistical tendencies within tonal patterns. Besides its contribution to understanding tonal systems in general and of that of Yukuna in particular, our work also suggests that such machine learning approaches might become part of the complex theoretical and methodological toolkit needed for language description and linguistic theory development.
- A corpus study of lexical speech errors in MandarinI-Ping Wan, and Marc Allassonnière-TangTaiwan Journal of Linguistics 2021
We investigate a corpus of lexical substitution speech errors in Mandarin conversation data and present how Mandarin speakers produce erroneous lexical items and how these items are related to the intended words. The corpus includes 747 lexical speech errors from 100 participants and applies the part-of-speech definition of the Academia Sinica Corpus. Our results partially match with the observations in Germanic and Romance languages. As an example, the data from Mandarin native speakers shows that erroneously produced words and target words are almost always found in the same parts of speech. Moreover, noun substitutions are the most common type of substitution within the majority of content word pairs. However, the occurrence of verb errors is higher in Mandarin than in other languages, possibly reflecting a word frequency effect.
- An empirical study on the contribution of formal and semantic features to the grammatical gender of nounsAli Basirat, Marc Allassonnière-Tang, and Aleksandrs BerdicevskisLinguistics Vanguard 2021
This study conducts an experimental evaluation of two hypotheses about the contributions of formal and semantic features to the grammatical gender assignment of nouns. One of the hypotheses (Corbett and Fraser 2000) claims that semantic features dominate formal ones. The other hypothesis, formulated within the optimal gender assignment theory (Rice 2006), states that form and semantics contribute equally. Both hypotheses claim that the combination of formal and semantic features yields the most accurate gender identification. In this paper, we operationalize and test these hypotheses by trying to predict grammatical gender using only character-based embeddings (that capture only formal features), only context-based embeddings (that capture only semantic features) and the combination of both. We performed the experiment using data from three languages with different gender systems (French, German and Russian). Formal features are a significantly better predictor of gender than semantic ones, and the difference in prediction accuracy is very large. Overall, formal features are also significantly better than the combination of form and semantics, but the difference is very small and the results for this comparison are not entirely consistent across languages.
- Classifiers in MorphologyMarcin Kilarski, and Marc Allassonnière-TangIn Oxford Research Encyclopedia of Linguistics 2021
Classifiers are partly grammaticalized systems of classification of nominal referents. The choice of a classifier can be based on such criteria as animacy, sex, material, and function as well as physical properties such as shape, size, and consistency. Such meanings are expressed by free or bound morphemes in a variety of morphosyntactic contexts, on the basis of which particular subtypes of classifiers are distinguished. These include the most well-known numeral classifiers which occur with numerals or quantifiers, as in Mandarin Chinese yí liàng chē (one clf.vehicle car) ‘one car’. The other types of classifiers are found in contexts other than quantification (noun classifiers), in possessive constructions (possessive classifiers), in verbs (verbal classifiers), as well as with deictics (deictic classifiers) and in locative phrases (locative classifiers). Classifiers are found in languages of diverse typological profiles, ranging from the analytic languages of Southeast Asia and Oceania to the polysynthetic languages of the Americas. Classifiers are also found in other modalities (i.e., sign languages and writing systems).
- Classifiers in Southeast Asian languagesAlice Vittrant, and Marc Allassonnière-TangIn The languages and Linguistics of Mainland Southeast Asia 2021
Classifiers are one of the types of nominal classifications systems that help speakers to identify discourse referents. They are commonly found in Southeast Asian languages, which motivates the geographical focus of this chapter. Given the semantic as well as the morphosyntactic overlap between the various systems, classifier devices are first presented in the context of all systems of nominal classifications. Then, the analysis focuses on the different constructional subtypes of classifiers and discusses their origin along with how they are used by speakers in discourse.
- The Effect of Word Frequency and Position-in-Utterance in Mandarin Speech Errors: A Connectionist Model of Speech ProductionI.-Ping Wan, and Marc Allassonnière-TangIn Chinese Lexical Semantics 2021
The connectionist model of speech processing infers that word frequency and position-in-utterance play a major role in the occurrence of speech errors. First, words that are not frequently used are more likely to result in speech errors since they generally receive less activation than frequently occurring words and require more activation to be chosen. Second, speech errors are more likely to occur near the end of utterances since, according to the given-before-new-principle, utterance-final words convey new information that has not yet been activated in the preceding context. The information of word frequency and position-in-utterance is extracted automatically from 382 utterances of a Mandarin speech error corpus and fed to generalized linear mixed models and a decision-tree based classifier. The results show that word frequency and position-in-utterance can predict the occurrence of speech errors with a performance over (but close to) the majority baseline. Therefore, additional information is required to improve the accuracy in the predictions.
- Keyword Spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptionsHarald Hammarström, One-Soon Her, and Marc Allassonnière-TangIn Proceedings of the Swedish Language Technology Conference 2021
Starting from a large collection of digitized raw-text descriptions of languages of the world, we address the problem of extracting information of interest to linguists from these. We describe a general technique to extract properties of the described languages associated with a specific term. The technique is simple to implement, simple to explain, requires no training data or annotation, and requires no manual tuning of thresholds. The results are evaluated on a large gold standard database on classifiers with accuracy results that match or supersede human inter-coder agreement on similar tasks. Although accuracy is competitive, the method may still be enhanced by a more rigorous probabilistic background theory and usage of extant NLP tools for morphological variants, collocations and vector-space semantics.
2020
- The evolutionary trends of grammatical gender in Indo-Aryan languagesMarc Allassonnière-Tang, and Michael DunnLanguage Dynamics and Change 2020
This paper infers the processes of development and change of grammatical gender in Indo-Aryan languages using phylogenetic comparative methods. 48 Indo-Aryan languages are coded based on 44 presence-absence features relating to gender marking on the verbs, adjectives, personal pronouns, demonstrative pronouns, and possessive pronouns. A Bayesian Reverse Jump Hyper Prior analysis, which infers the evolutionary dynamics of changes between feature values, gives results that are consistent with historical linguistic and typological studies on gender systems in Indo-Aryan languages and predicts the evolutionary trends of the features included in the dataset.
- A Statistical Explanation of the Distribution of Sortal Classifiers in Languages of the World via Computational ClassifiersOne-Soon Her, and Marc TangJournal of Quantitative Linguistics 2020
Previous studies demonstrate that morphosyntactic plural markers and the structure of numeral systems have individually strong predictive power with regard to the usage of sortal classifiers in languages. We use these two factors as explanatory variables to train the computational classifier of random forests and evaluate the accuracy of their predictive power when selecting the existence/absence of sortal classifiers as response variable. Our results show that these two factors result in an excellent discrimination performance of random forests, even when taking into account sortal classifiers as an areal feature. However, the correlation between morphosyntactic plural markers and multiplicative bases is weaker than the correlation between sortal classifiers and plural markers plus multiplicative bases. We are thus able to provide novel insights with regard to probabilistic universals on sortal classifiers, and suggest an innovative cross-disciplinary approach to test the effect of implicational universals with computational methods.
- Functions of gender and numeral classifiers in NepaliMarc Allassonnière-Tang, and Marcin KilarskiPoznan Studies in Contemporary Linguistics 2020
We examine the complex nominal classification system in Nepali (Indo-European, Indic), a language spoken at the intersection of the Indo-European and Sino-Tibetan language families, which are usually associated with prototypical examples of grammatical gender and numeral classifiers, respectively. In a typologically rare pattern, Nepali possesses two gender systems based on the human/non-human and masculine/feminine oppositions, in addition to which it has also developed an inventory of at least ten numeral classifiers as a result of contact with neighbouring Sino-Tibetan languages. Based on an analysis of the lexical and discourse functions of the three systems, we show that their functional contribution involves a largely complementary distribution of workload with respect to individual functions as well as the type of categorized nouns and referents. The study thus contributes to the ongoing discussions concerning the typology and functions of nominal classification as well as the effects of long-term language contact on language structure.
- A simple introduction to programming and statistics with decision trees in RMarc TangTeaching Statistics 2020
University students in other disciplines without prior knowledge in statistics and/or programming language are introduced to the statistical method of decision trees in the programming language R during a 45-minute teaching and practice session. Statistics and programming skills are now frequently required within a wide variety of research fields and private industries. However, students unfamiliar with these subjects may be reluctant to join a full course because of time or student workloads or other commitments or a belief it is not for them. The proposed session is short and can be used as an ice-breaker to let students have a basic understanding of running statistical models in programming language.
- Numeral base, numeral classifier, and noun: Word order harmonizationMarc Allassonnière-Tang, and One-Soon HerLanguage and Linguistics 2020
Greenberg ( 1990a : 292) suggests that classifiers ( clf ) and numeral bases tend to harmonize in word order, i.e. a numeral (Num) with a base-final [ n base ] order appears in a clf -final [Num clf ] order, e.g. in Mandarin Chinese, san1-bai3 (three hundred) ‘300’ and san1 zhi1 gou3 (three clf animal dog) ‘three dogs’, and a base-initial [ base n ] Num appears in a clf -initial [ clf Num] order, e.g. in Kilivila (Eastern Malayo-Polynesian, Oceanic), akatu-tolu (hundred three) ‘300’ and na-tolu yena ( clf animal -three fish) ‘three fish’. In non-classifier languages, base and noun (N) tend to harmonize in word order. We propose that harmonization between clf and N should also obtain. A detailed statistical analysis of a geographically and phylogenetically weighted set of 400 languages shows that the harmonization of word order between numeral bases, classifiers, and nouns is statistically highly significant, as only 8.25% (33/400) of the languages display violations, which are mostly located at the meeting points between head-final and head-initial languages, indicating that language contact is the main factor in the violations to the probabilistic universals.
- Sociocultural gender in nominal classification: A study of grammatical genderMarc Allassonnière-Tang, and Hiram RingIndian Linguistics 2020
We analyse how sociocultural gender can be reflected through grammatical gender and select Hindi (Indo-European) and Pnar (Austroasiatic) as case studies. We demonstrate that these grammatical gender systems share universal tendencies based on human cognition, i.e. associating long, thin, and vertical objects with masculine grammatical gender whereas round, flat, horizontal ones are associated with feminine grammatical gender. We also show that these grammatical gender systems distinguish between sociocultural values of the language speakers. Speakers of Hindi maintain a patrilineal kinship system, and in their language objects of large size are generally assigned to the masculine gender. Pnar kinship is matrilineal and in the language large sized objects tend to be associated with feminine gender. Similar asymmetries are observed with regard to generic gender and gender reversal. These results contribute to the impact of universal cognitive principles and culture on grammatical structures by showing that both tendencies are not necessarily complementary and that they can co-exist in the same language.
- Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender AssignmentHartger Veeman, Marc Allassonnière-Tang, Aleksandrs Berdicevskis, and 1 more authorIn Proceedings of the 24th Conference on Computational Natural Language Learning 2020
Grammatical gender is assigned to nouns differently in different languages. Are all factors that influence gender assignment idiosyncratic to languages or are there any that are universal? Using cross-lingual aligned word embeddings, we perform two experiments to address these questions about language typology and human cognition. In both experiments, we predict the gender of nouns in language X using a classifier trained on the nouns of language Y, and take the classifier’s accuracy as a measure of transferability of gender systems. First, we show that for 22 Indo-European languages the transferability decreases as the phylogenetic distance increases. This correlation supports the claim that some gender assignment factors are idiosyncratic, and as the languages diverge, the proportion of shared inherited idiosyncrasies diminishes. Second, we show that when the classifier is trained on two Afro-Asiatic languages and tested on the same 22 Indo-European languages (or vice versa), its performance is still significantly above the chance baseline, thus showing that universal factors exist and, moreover, can be captured by word embeddings. When the classifier is tested across families and on inanimate nouns only, the performance is still above baseline, indicating that the universal factors are not limited to biological sex.
2019
- Word order of numeral classifiers and numeral basesOne-Soon Her, Marc Tang, and Bing-Tsiong LiSTUF - Language Typology and Universals 2019
In a numeral classifier language, a sortal classifier (C) or a mensural classifier (M) is needed when a noun is quantified by a numeral (Num). Num and C/M are adjacent cross-linguistically, either in a [Num C/M] order or [C/M Num]. Likewise, in a complex numeral with a multiplicative composition, the base may follow the multiplier as in [ n×base ], e.g., san-bai ‘three hundred’ in Mandarin. However, the base may also precede the multiplier in some languages, thus [ base×n ]. Interestingly, base and C/M seem to harmonize in word order, i.e., [ n×base ] numerals appear with a [Num C/M] alignment, and [ base×n ] numerals, with [C/M Num]. This paper follows up on the explanation of the base-C/M harmonization based on the multiplicative theory of classifiers and verifies it empirically within six language groups in the world’s foremost hotbed of classifier languages: Sinitic, Miao-Yao, Austro-Asiatic, Tai-Kadai, Tibeto-Burman, and Indo-Aryan. Our survey further reveals two interesting facts: base-initial ([ base×n ]) and C/M-initial ([C/M Num]) orders exist only in Tibeto-Burman (TB) within our dataset. Moreover, the few scarce violations to the base-C/M harmonization are also all in TB and are mostly languages having maintained their original base-initial numerals but borrowed from their base-final and C/M-final neighbors. We thus offer an explanation based on Proto-TB’s base-initial numerals and language contact with neighboring base-final, C/M-final languages.
- Insights on the Greenberg-Sanches-Slobin generalization: Quantitative typological data on classifiers and plural markersMarc Tang, and One-Soon HerFolia Linguistica 2019
This paper offers quantitative typological data to investigate a revised version of the Greenberg-Sanches-Slobin generalization (GSSG), which states that (a) a language is unlikely to have both sortal classifiers and morphosyntactic plural markers, and (b) if a language does have both, then their use is in complementary distribution. Morphosyntactic plurals engage in grammatical agreement outside the noun phrase, while morphosemantic plurals that relate to collective and associative marking do not. A database of 400 phylogenetically and geographically weighted languages was created to test this generalization. The statistical test of conditional inference trees was applied to investigate the effect of areal, phylogenetic, and linguistic factors on the distribution of classifiers and morphosyntactic plural markers. The results show that the presence of classifiers is affected by areal factors as most classifier languages are concentrated in Asia. Yet, the low ratio of languages with both features simultaneously is still statistically significant. Part (a) of the GSSG can thus be seen as a statistical universal. We then look into the few languages that do have both features and tentatively conclude that part (b) also seems to hold but further investigation into some of these languages is needed.
- A typology of classifiers and gender: From description to computationMarc TangIn Acta Universitatis Upsaliensis 2019
Categorization is one the most relevant tasks realized by humans during their life, as we consistently need to categorize the things and experience that we encounter. Such need is reflected in language via various mechanisms, the most prominent being nominal classification systems (e.g., grammatical gender such as the masculine/feminine distinction in French). Typological methods are used to investigate the underlying functions and structures of such systems, using a wide variety of cross-linguistic data to examine universality and variability. This analysis is itself a classification task, as languages are categorized and clustered according to their grammatical features. This thesis provides a cross-linguistic typological analysis of nominal classification systems and in parallel compares a number of quantitative methods that can be applied at different scales. First, this thesis provides an analysis of nominal classification systems (i.e., gender and classifiers) via the description of three languages with respectively gender, classifiers, and both. While the analysis of the first two languages are more of a descriptive nature and aligns with findings in the existing literature, the third language provides novel insights to the typology of nominal classification systems by demonstrating how classifiers and gender may co-occur in one language in terms of distribution of functions. Second, the underlying logic of nominal classification systems is commonly considered difficult to investigate, e.g., is there a consistent logic behind gender assignment in language? is it possible to explain the distribution of classifier languages of the world while taking into account geographical and genealogical effects? This thesis addresses the lack of arbitrariness of nominal classification systems at three different scales: The distribution of classifiers at the worldwide level, the presence of gender within a language family, and gender assignment at the language-internal level. The methods of random forests, phylogenetics, and word embeddings with neural networks are selected since they are respectively applicable at three different scales of research questions (worldwide, family-internal, language-internal).
- Linguistic Information in Word EmbeddingsAli Basirat, and Marc TangIn Agents and Artificial Intelligence 2019
We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.
- Predicting Speech Errors in Mandarin Based on Word FrequencyMarc Tang, and I-Ping WanIn From Minimal Contrast to Meaning Construct 2019
This paper investigates the effect of word frequency on the occurrence of speech errors in Mandarin. A corpus of 390 speech errors along with their surrounding linguistic context was gathered. The information of word frequency was extracted from the Academia Sinica Corpus. Our analysis with a computational classifier based on conditional inference trees shows that intended words having a frequency lower than words of the surrounding context are more likely to generate speech errors.
2018
- The lexical and discourse functions of grammatical gender in MarathiPär Eliasson, and Marc TangJournal of South Asian Languages and Linguistics 2018
We provide a functional analysis of the grammatical gender system of Marathi (Indo-Aryan) in Western India. The majority of the new Indo-Aryan languages typically classifies each noun of the lexicon according to biological gender as masculine and feminine. Only a few Indo-Aryan languages such as Marathi diverge in terms of agreement pattern by categorizing nouns as masculine, feminine, and neuter. Yet gender in Marathi has not been extensively described in terms of functions. We thus use apply functional typology to analyze grammatical gender in Marathi and provide detailed examples of its lexical and discourse functions.
- Lexical and morpho-syntactic features in word embeddings: A case study of nouns in SwedishAli Basirat, and Marc TangIn Proceedings of the 10th International Conference on Agents and Artificial Intelligence 2018
We apply real-valued word vectors combined with two different types of classifiers (linear discriminant analy- sis and feed-forward neural network) to scrutinize whether basic nominal categories can be captured by simple word embedding models. We also provide a linguistic analysis of the errors generated by the classifiers. The targeted language is Swedish, in which we investigate three nominal aspects: uter/neuter, common/proper, and count/mass. They represent respectively grammatical, semantic, and mixed types of nominal classification within languages. Our results show that word embeddings can capture typical grammatical and semantic fea- tures such as uter/neuter and common/proper nouns. Nevertheless, the model encounters difficulties to identify classes such as count/mass which not only combine both grammatical and semantic properties, but are also subject to conversion and shift. Hence, we answer the call of the Special Session on Natural Language Process- ing in Artif icial Intelligence by approaching the topic of interfaces between morphology, lexicon, semantics, and syntax via interdisciplinary methods combining machine learning of language and general linguistics.
- The coalescence of grammatical gender and numeral classifiers in the general classifier wota in NepaliMarcin Kilarski, and Marc TangIn Proceedings of the Linguistic Society of America 2018
While nominal classification has received considerable attention, relatively little is known about cross-linguistically rare complex systems. An example is provided by Nepali (Indo-European, Indic), which possesses both grammatical gender and numeral classifiers. Our aim is to examine morphosyntactic and functional properties of the general classifier wota. Unusually, the classifier exhibits gender agreement both in its independent forms and as fused with a numeral, raising questions about its lexical and pragmatic functions. Our study contributes to the typology of nominal classification by proposing a functional approach to cases of complex co-occurrence of gender and classifiers.
2017
- Explaining the acquisition order of classifiers and measure words via their mathematical complexityMarc TangJournal of Child Language Acquisition and Development 2017
We provide theoretical explanation for the acquisition of numeral classifiers (sortal classifiers) and measure words (mensural classifiers) in Mandarin Chinese. Previous research in various languages separately observed that the general classifier is acquired before specific classifiers and that classifiers are acquired previous to measure words. However no theoretical discussion was fully developed and no study combined general classifier, specific classifiers and measure words in one dataset. We propose to fill these gaps by combining semantic complexity (Brown, 1973) and a mathematical approach (Her, 2012): given that the relative complexity of x, y and z is unknown, x + y is more complex than either x or y, and x + y + z is more complex than any of them. By applying the mathematical approach, it is observed that general classifier carries the mathematical value of times one, noted x, while specific classifiers posses x plus a semantic value of y, which highlights an inherent feature of the referent. Finally, measure words detain both x and y, along with a new information of quantity z. Therefore, the acquisition order is expected to start from the simplest semanticity and develop toward the most complex, i.e. general classifiers (x) > specific classifier (x+y)> measure word (x+y+z). As supporting evidence, we gathered longitudinal data from CHILDES (Child Language Data Exchange System; Zhou, 2008). The participants included 110 children from 1-6 years old, providing a total of 110 conversations of 20 minutes each with 1851 tokens of numeral classifiers and measure words. Our methodology applied the definition of acquisition from Brown (1973) and the equation of Suppliance in Obligatory Context (SOC) cross-checked with Target-Like Usage (TLU) from Pica (1983). The results demonstrated that our model generated correct prediction, serving as theoretical basis for future studies in the field of language acquisition.