publications | Marc Allassonnière-Tang

To facilitate identification of author in the literature, I combined my two parent names (Allassonnière and Tang) starting from 2020. As an overview, since 2017, my publications include 53 articles, 9 book chapters, 2 books, and 9 papers in conference proceedings.

2026

GRAF – Gendered reference analysis in French

Magdalena Lemus-Serrano, Marine Cozzolino, Tessa Vermeir, and 2 more authors

Journal of Open Humanities Data 2026

Abs Bib PDF

Grammatical gender is only found in about 20% of the world’s languages, including French, which marks masculine and feminine distinctions through determiners and agreement of adjectives. In French, the masculine form is traditionally used as a “generic” to describe mixed-gender groups, yet research shows that these masculine generics strongly bias mental representations toward men from childhood onward and contribute to the invisibility of women in social contexts. In response, several inclusive or gender-neutral strategies have emerged, such as using epicene nouns, feminization through double forms, or newly created gender forms (e.g., ‘æ’, ‘ë’, capital letters, or the ‘-i’ ending). Critics argue that these innovations are difficult to learn and grammatically complex, but recent work suggests that they can be acquired quickly. However, we still lack quantitative data on how often gender-marked words referring to humans appear in real French usage. The current database addresses this gap by analyzing newspaper articles and speech data to estimate the scale of change required.
@article{serrano_graf_2026, title = {{GRAF} – {G}endered reference analysis in {F}rench}, doi = {10.5334/johd.510}, journal = {Journal of Open Humanities Data}, author = {Lemus-Serrano, Magdalena and Cozzolino, Marine and Vermeir, Tessa and Josserand, Mathilde and Allassonnière-Tang, Marc}, year = {2026}, volume = {12}, number = {59}, pages = {1-9}, }
SEB-LEK (Sebitoli Ecological Biodiversity – Local Ecological Knowledge): Dataset on a Research Project Staff Knowledge of Local Taxa and Encounter Frequencies in Kibale National Park, Uganda

Gabriel Dubus, Hugo Magaldi, Raymond Katumba, and 4 more authors

Journal of Open Humanities Data 2026

Abs Bib PDF

SEB-LEK documents local ecological knowledge of forest vertebrates among research and conservation staff working in Sebitoli, Kibale National Park in Uganda. Using a two-step group survey, participants identified vertebrate taxa from visual and auditory cues and reported ordinal encounter frequencies for seeing and hearing each taxon. Responses were collected in English and the local language, Rutooro. The dataset comprises identification responses and encounter-frequency ratings for 54 taxa. It enables comparative analyses of human knowledge, biodiversity monitoring methods, and ethnozoological research. Beyond its applications in the analysis of biodiversity, the dataset documents vernacular naming practices and patterns of local ecological knowledge in Rutooro, which provides a valuable resource for linguistic, anthropological, and ethnobiological research. It enables studies analyzing how environmental knowledge is structured within a language and its cultural contexts. It also provides data for comparative analyses of human–environment relationships.
@article{Dubus_seblek_2026, title = {SEB-LEK (Sebitoli Ecological Biodiversity – Local Ecological Knowledge): Dataset on a Research Project Staff Knowledge of Local Taxa and Encounter Frequencies in Kibale National Park, Uganda}, doi = {10.5334/johd.543}, journal = {Journal of Open Humanities Data}, author = {Dubus, Gabriel and Magaldi, Hugo and Katumba, Raymond and Rugonge, Harold and Tibesigwa, John Justice and Allassonnière-Tang, Marc and Krief, Sabrina}, year = {2026}, volume = {12}, number = {75}, pages = {1-7}, }
A GIS view of the word orders of numeral bases and numeral classifiers in Kuki-Chin languages

Khawlsonkim Suantak, Marc Allassonnière-Tang, Eugene Chan, and 2 more authors

Digital Scholarship in the Humanities 2026

Abs Bib PDF

Tibeto-Burman (TB) languages are known for their diversity in numeral systems and classifiers. This paper investigates the Kuki-Chin (KC, also called South-Central Tibeto-Burman) languages in TB’s Northeast Indian Areal Group and provides a comprehensive description of the four types of numeral systems: base-final, base-initial, base-split, and no-base, and the four types of classifier systems: CL-final, CL-initial, CL-split, and no-CL. A thorough survey of the literature, aided by fieldwork, enables a GIS view of the distribution of the different types of languages. Also, KC languages conform to Greenberg’s Universal 20A: N does not come between Num and CL, and the numeral base and classifier harmonize in word order. We propose the following hypothesis for Proto-KC (PKC) and Proto-TB (PTB) to account for the variation in numeral bases and classifiers in KC, i.e., PKC, like PTB, is base-initial and without numeral classifiers, and the current variation in numeral bases and numeral classifiers in KC is due to horizontal external influence via language contact. Bayesian phylogenetic inference tests, however, show mixed results, as PKC is likely to be base-initial, as expected, but CL-initial, and PTB is likely to be base-final but CL-initial. Thus, we plan to conduct a comprehensive survey of all TB languages and further explore the state of PTB in terms of numeral bases and classifiers.
@article{suantak_gis_2026, title = {A {GIS} view of the word orders of numeral bases and numeral classifiers in {Kuki}-{Chin} languages}, doi = {10.1093/llc/fqaf133}, journal = {Digital Scholarship in the Humanities}, author = {Suantak, Khawlsonkim and Allassonnière-Tang, Marc and Chan, Eugene and Hsu, Anthony and Her, One-Soon}, year = {2026}, volume = {41}, number = {1}, pages = {420-433}, }
Lexical, pronominal and zero argument encoding in Movima

Katharina Haude, and Marc Allassonnière-Tang

In Topicality and the shaping of grammar: New perspectives from lesser-studied languages 2026

Abs Bib PDF

Past research on the cross-linguistic discourse conditions for the lexical and nonlexical expressions of arguments has shown that semantic role and animacy both play an important role. Some less attention has been paid so far to the choice between different nonlexical expressions, in particular, unstressed pronouns and zero. This choice is possible in Movima (isolate, Bolivia), where the single argument of a basic intransitive clause and one of the two arguments of a basic transitive clause can remain unexpressed. Based on data from spontaneous oral discourse, the present study investigates the lexical, pronominal, and zero expression of S of the intransitive and P of the ergative transitive clause and shows that S and P do not display the same behaviour in discourse: Overall, P is less often expressed by a pronoun than S, and inanimate referents favour zero rather than pronominal expression. Only an animate S is more often encoded as a pronoun than as zero. It is argued that this exceptional behaviour of animate S arguments reflects their affinity with the A argument of the ergative transitive clause, which typically encodes an animate and topical referent, is obligatorily overtly expressed, and typically expressed by a pronoun.
@incollection{haude_argument_2026, author = {Haude, Katharina and Allassonnière-Tang, Marc}, editor = {Palancar, Enrique and Chamoreau, Claudine and Donabédian, Anaid}, title = {Lexical, pronominal and zero argument encoding in Movima}, booktitle = {Topicality and the shaping of grammar: New perspectives from lesser-studied languages}, year = {2026}, publisher = {John Benjamins}, address = {Amsterdam}, pages = {99-129}, isbn = {978-981-97-0586-3}, doi = {10.1075/tsl.137.03hau}, }

2025

The evolution of gender and number agreement in the noun phrase

Olena Shcherbakova, Marc Allassonnière-Tang, and Francesca Di Garbo

Linguistic Typology 2025

Abs Bib PDF

We test the dependency relations between gender and number posited by Greenberg’s Universal 36 by focusing on patterns of gender and number agreement in the noun phrase from an evolutionary perspective. To do so, we use data from Grambank, the largest existing database of morphosyntactic structures in the world’s languages. Based on data from 1,608 languages worldwide, we use a Reverse Jump Markov Chain Monte Carlo method to investigate the order of emergence of gender and number marking on adjectives, demonstratives and (for a smaller dataset) articles. Globally, our findings support Greenberg’s idea that gender marking hinges on number marking. In addition, they show that both adjectives and demonstratives play a special role in the development of noun-phrase internal agreement, in that under different evolutionary scenarios the occurrence of gender and number agreement on adjectives or demonstratives may favor the spreading of agreement to other target types. We compare these results with family-specific patterns of language change and further discuss their relevance to the general understanding of nominal morphosyntax.
@article{ShcherbakovaAllassonnièreTangDiGarbo_2025, url = {https://doi.org/10.1515/lingty-2024-0072}, title = {The evolution of gender and number agreement in the noun phrase}, author = {Shcherbakova, Olena and Allassonnière-Tang, Marc and Garbo, Francesca Di}, journal = {Linguistic Typology}, doi = {doi:10.1515/lingty-2024-0072}, year = {2025}, volume = {30}, number = {1}, pages = {33-52}, lastchecked = {2025-08-19}, }
Internal order and areal patterns in South Asian numerals

Mamta Kumari, Ezequiel Koile, and Marc Allassonnière-Tang

STUF-Language Typology and Universals 2025

Abs Bib PDF

The linguistic diversity present in South Asia stems from its historical mix of cultures, as well as its heterogeneous topography. This diversity can be observed in numeral systems, which play an important role in cognitive and cultural history. We provide the first typological overview of numeral systems in South Asia, presenting a database of 122 languages – mostly Austroasiatic, Dravidian, Indo-Aryan, and Tibeto-Burman, with a majority of data from original fieldwork. We also provide a framework for analyzing numeral systems based on the internal ordering of their complex numerals (teens, crowns, running numbers, hundreds, and thousands). Quantitative analyses based on decision trees and phylogenetic regressions suggest that internal ordering of complex numerals are generally stable in each language family along history, although horizontal transmission due to historical contact phenomena are observed as well. Our study also has a societal impact: the diversity of numeral systems featured here is quickly disappearing, and we claim that it is of the utmost importance to document and preserve it.
@article{kumari_numerals_2025, title = {Internal order and areal patterns in South Asian numerals}, doi = {10.1515/stuf-2025-2021}, journal = {STUF-Language Typology and Universals}, author = {Kumari, Mamta and Koile, Ezequiel and Allassonnière-Tang, Marc}, year = {2025}, volume = {78}, number = {4}, pages = {699-724}, }
The phonology of letter shapes: Feature economy and informativeness in 43 writing systems

Yoolim Kim, Marc Allassonnière-Tang, Helena Miton, and 1 more author

Journal of Memory and Language 2025

Abs Bib PDF

Differentiating letter shapes accurately is a core competence for any reader. Are letter shapes as distinctive as they could be? The visual shapes of letters, contrary to the phonemes of spoken languages, lack a unified description — an equivalent of the phonological features that describe most phonemes in the world’s languages. Using a gamified crowdsourcing approach, we elicited thousands of letter descriptions from lay people for the sets of letter shapes (the scripts) used in 43 diverse writing systems. Using 19,591 letter classifications, contributed by 1,683 participants, who were asked to sort the letters of each script repeatedly into two groups, we extracted a sufficient number of binary classifications (features) to provide a unique description for all letters in the 43 scripts. We show that scripts, compared to phoneme inventories, use more features to produce similar sets of distinct elements. Compared to the phoneme inventories of a large sample of the world’s languages dataset (the P-base dataset, collected by another team), our 43 scripts have lower feature economy (fewer symbols for a given number of features) and lower feature informativeness (a less balanced distribution of feature values). Compared to phonemes, letter shapes require more binary features for a complete description. These features are also less informative in letters than in phonemes: the chances that two random letters in a script differ on any given feature are low. Letter shapes, which have more degrees of freedom than speech sounds, use those degrees of freedom less efficiently.
@article{kim_phonology_2025, title = {The phonology of letter shapes: {Feature} economy and informativeness in 43 writing systems}, volume = {142}, issn = {0749596X}, shorttitle = {The phonology of letter shapes}, url = {https://linkinghub.elsevier.com/retrieve/pii/S0749596X25000130}, doi = {10.1016/j.jml.2025.104620}, language = {en}, urldate = {2025-02-03}, journal = {Journal of Memory and Language}, author = {Kim, Yoolim and Allassonnière-Tang, Marc and Miton, Helena and Morin, Olivier}, month = apr, year = {2025}, pages = {104620}, }
Gendered and Gender-Neutral Naming Practices in Mandarin Chinese: Metaphoric and Non-Metaphoric Imagery Over Time in Taiwan

Pei-Ci Li, and Marc Allassonnière-Tang

Sociolinguistic Studies 2025

Abs Bib PDF

This study investigates gender-specific names, as well as gender-neutral names in Mandarin Chinese, and explores their naming strategies from both synchronic and diachronic perspectives. Synchronically, we examined the 100 most frequently used names for girls, boys, and gender-neutral names, using data from the full population of Taiwan ( Ministry of the Interior of Taiwan, 2018 ). We analyzed the frequency of each character in accordance with metaphoric and nonmetaphoric qualities. Diachronically, we reviewed the most frequently used gender-specific names from the recent 100 years over each 10-year period and then analyzed how they change over time. Both diachronically and synchronically, a higher diversity in boy’s names is observed. Such diversity aligns with traditional perspectives that assign greater social and public expectations to men. Over time, there appears to be a decrease in gender stereotypes, as reflected in naming practices. The most commonly used gender-neutral names often feature characters with less gendered connotations, which may involve the incorporation of functional words or metaphoric usage not conventionally associated with either gender.
@article{li_gendered_2025, title = {Gendered and {Gender}-{Neutral} {Naming} {Practices} in {Mandarin} {Chinese}: {Metaphoric} and {Non}-{Metaphoric} {Imagery} {Over} {Time} in {Taiwan}}, volume = {19}, issn = {1750-8649, 1750-8657}, shorttitle = {Gendered and {Gender}-{Neutral} {Naming} {Practices} in {Mandarin} {Chinese}}, url = {https://utppublishing.com/doi/10.3138/SS-2024-0022}, doi = {10.3138/SS-2024-0022}, language = {en}, number = {1-2}, urldate = {2025-06-07}, journal = {Sociolinguistic Studies}, author = {Li, Pei-Ci and Allassonnière-Tang, Marc}, month = apr, year = {2025}, pages = {128--152}, }
LiLA : Outil d’augmentation automatisée des données vocales participatives de Lingua Libre

Mathilde Hutin, Marc Allassonnière-Tang, Lucas Prégaldiny, and 1 more author

In Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes de l’atelier Science Participative pour les Données et Corpus Linguistiques 2025 (ParCol) 2025

Abs Bib PDF

La constitution de corpus vocaux, nécessaires à l’exploration de la phonétique et de la phonologie des langues du monde, soulève de nombreux défis. La constitution de corpus multi-dialectes, permettant d’explorer la variation dialectale, ou de corpus multilingues, permettant de comparer plusieurs langues, est d’autant plus difficile que, pour que chaque dialecte /langue soit comparable aux autres dans le corpus, les données doivent avoir été enregistrées dans les mêmes conditions (même matériel, même protocole \textellipsis). Une solution à ces défis semble envisageable aujourd’hui grâce aux données participatives, par définition administrées et enregistrées par des volontaires, et donc moins coûteuses à tous points de vue pour la communauté scientifique. En mars 2025, Lingua Libre, la médiathèque linguistique participative de Wikimédia France ouverte depuis 2018, compte ~1,4M enregistrements en 284 langues par 2.547 individus à travers le monde : notre projet est de créer un outil pour rendre ces données brutes exploitables par les linguistes.
@inproceedings{Hutin-Allassonniere-Tang-Pregaldiny-Leveque:CORIA-TALN:2025, author = {Hutin, Mathilde and Allassonni\`ere-Tang, Marc and Pr\'egaldiny, Lucas and L\'ev\^eque, Lucas}, title = {LiLA : Outil d'augmentation automatis\'ee des donn\'ees vocales participatives de Lingua Libre}, booktitle = {Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes de l'atelier Science Participative pour les Donn\'ees et Corpus Linguistiques 2025 (ParCol)}, month = jun, year = {2025}, address = {Marseille, France}, publisher = {Association pour le Traitement Automatique des Langues}, pages = {6-10}, note = {}, keywords = {Lingua Libre, Wikimedia, Donn\'ees participatives, Phon\'etique, Phonologie, Typologie}, }

2024

Evolutionary pathways of complexity in gender systems

Olena Shcherbakova, and Marc Allassonnière-Tang

Journal of Language Evolution 2024

Abs Bib PDF

Humans categorize the experience they encounter in various ways, which is mirrored, for instance, in grammatical gender systems of languages. In such systems, nouns are grouped based on whether they refer to masculine/feminine beings, (non-)humans, (in)animate entities, or objects with specific shapes. Languages differ greatly in how many gender assignment rules are incorporated in gender systems and how many word classes carry gender marking (gender agreement patterns). It has been suggested that these two dimensions are positively associated as numerous assignment rules are better sustained by numerous agreement patterns. We test this claim by analyzing the correlated evolution (Continuous method in BayesTraits) and making the causal inferences about the relationships (phylogenetic path analysis) between these 2 dimensions in 482 languages from the global Grambank database. By applying these methods to linguistic data matched to phylogenetic trees (a world tree and individual families), we evaluate whether various types of gender assignment rules (semantic, phonological, and unpredictable) are causally linked to more gender agreement patterns on the global level and in individual language families. Our results on the world language tree suggest that semantic rules are weakly positively correlated with gender agreement and that the development of agreement patterns is facilitated by different rules in individual families. For example, in Indo-European languages, more agreement patterns are caused by the presence of phonological and unpredictable rules, while in Bantu languages, the driving force of agreement patterns is the variety of semantic rules. Our study shows that the relationships between agreement and rules are family-specific and yields support to the idea that more distinct rules and/or rule types might be more robust in languages with more pervasive gender agreement.
@article{gender_complexity_2024, author = {Shcherbakova, Olena and Allassonnière-Tang, Marc}, title = {{Evolutionary pathways of complexity in gender systems}}, journal = {Journal of Language Evolution}, volume = {8}, number = {2}, pages = {120-133}, year = {2024}, month = mar, issn = {2058-458X}, doi = {10.1093/jole/lzae001}, url = {https://doi.org/10.1093/jole/lzae001}, }
On the distribution and origin of sortal classifiers in Altaic languages

Shen-An Chen, Marc Allassonnière-Tang, Yung-Ping Liang, and 1 more author

Journal of Chinese Linguistics 2024

Abs Bib PDF

The grammatical feature of sortal classifiers, common in East and Southeast Asian languages, is also found in 15 of the 65 Altaic languages we have examined, though the classifiers are far fewer and used optionally. These observations suggest that the Altaic classifier systems are not indigenous. Based on the Single Origin Hypothesis that Chinese is the only language with an indigenous classifier system in Eurasia, we propose that the rise of classifiers in Altaic is due to the influence of neighboring classifier languages. Having first confirmed that the putative classifiers in these 15 languages are genuine classifiers, we then examine the phonological and semantic characteristics of the classifiers identified in each language and detect the influence from either Chinese or Persian. Taking historical and geographical factors into consideration, we suggest that classifier languages east of Uyghur were influenced by Chinese, while those to the west are influenced by Persian; Uyghur itself was influenced by both. Assuming that Persian classifiers are not indigenous either, these findings suggest that the Single Origin Hypothesis is applicable to classifier languages in Altaic.
@article{altaic_classifier_2024, author = {Chen, Shen-An and Allassonnière-Tang, Marc and Liang, Yung-Ping and Her, One-Soon}, title = {{On the distribution and origin of sortal classifiers in Altaic languages}}, journal = {Journal of Chinese Linguistics}, volume = {52}, number = {2}, pages = {456-479}, year = {2024}, month = may, doi = {10.1353/jcl.2024.a929996}, url = {https://doi.org/10.1353/jcl.2024.a929996}, }
The evolutionary dynamics of grammatical gender in Torricelli languages

Jose A Jódar-Sánchez, and Marc Allassonnière-Tang

STUF - Language Typology and Universals 2024

Abs Bib PDF

Grammatical gender in New Guinea is an often neglected area in typological research, even though it is extremely diverse. For example, in New Guinea, some languages have grammatical gender systems with two sex-based categories, more than four gender-indexing targets, and no gender marking on nouns, while some languages have grammatical gender systems with much more categories, which are only marginally sex-based. This paper infers the processes of development and change of grammatical gender in Torricelli languages from two perspectives. First, it synthesizes the available data in the existing literature and hypothesizes the evolutionary pathway of gender systems in Torricelli languages. Nineteen Torricelli languages are selected as a representative coverage of the 55 Torricelli languages listed in Glottolog within the limits of the available documentation. These languages are then coded based on 6 presence-absence features relating to gender marking on verbs, adjectives, nouns, numerals, pronouns, and demonstratives. Second, it conducts an analysis with phylogenetic comparative methods to provide a quantitative assessment of the evolutionary possibilities for gender systems in Torricelli languages. The preliminary results show that gender is likely marked at the root of Torricelli languages, with pronouns and verbs being at the core of the system. This is in agreement with trends reflecting the evolution of gender systems in languages across the world.
@article{gender_torricelli_2024, author = {Jódar-Sánchez, Jose A and Allassonnière-Tang, Marc}, title = {{The evolutionary dynamics of grammatical gender in Torricelli languages}}, journal = {STUF - Language Typology and Universals}, volume = {77}, number = {3}, pages = {353-369}, year = {2024}, month = sep, doi = {10.1515/stuf-2024-2010}, url = {https://doi.org/10.1515/stuf-2024-2010}, eprint = {https://doi.org/10.1515/stuf-2024-2010}, }
How network structure shapes languages: Disentangling the factors driving variation in communicative agents

Mathilde Josserand, Marc Allassonnière-Tang, François Pellegrino, and 2 more authors

Cognitive Science 2024

Abs Bib PDF

Languages show substantial variability between their speakers, but it is currently unclear how the structure of the communicative network contributes to the patterning of this variability. While previous studies have highlighted the role of network structure in language change, the specific aspects of network structure that shape language variability remain largely unknown. To address this gap, we developed a Bayesian agent-based model of language evolution, contrasting between two distinct scenarios: language change and language emergence. By isolating the relative effects of specific global network metrics across thousands of simulations, we show that global characteristics of network structure play a critical role in shaping interindividual variation in language, while intraindividual variation is relatively unaffected. We effectively challenge the long-held belief that size and density are the main network structural factors influencing language variation, and show that path length and clustering coefficient are the main factors driving interindividual variation. In particular, we show that variation is more likely to occur in populations where individuals are not well-connected to each other. Additionally, variation is more likely to emerge in populations that are structured in small communities. Our study provides potentially important insights into the theoretical mechanisms underlying language variation.
@article{network_shape_2024, author = {Josserand, Mathilde and Allassonnière-Tang, Marc and Pellegrino, François and Dediu, Dan and de Boer, Bart}, title = {{How network structure shapes languages: Disentangling the factors driving variation in communicative agents}}, journal = {Cognitive Science}, pages = {e13439}, year = {2024}, month = apr, issn = {1551-6709}, doi = {10.1111/cogs.13439}, url = {https://doi.org/10.1111/cogs.13439}, }
Early Segmental Production in Thai Preschool Children Learning Mandarin

I-Ping Wan, Marc Allassonnière-Tang, and Pu Yu

International Journal of Asian Language Processing 2024

Abs Bib PDF

The research aims to conduct a corpus-based and data-driven analysis of the early-stage Mandarin learning production of 11 Thai preschool children in Bangkok, Thailand, within the interlanguage system. These children consist of 8 boys and 3 girls, with an age range of 4;1-6;5 (M = 5.455, SD = 0.688; total tokens = 36,565). Data were extracted from a spoken corpus constructed between 2018 and 2022, which was time-stamped, phone-aligned, and multi-tiered using Praat [P. Boersma and D. Weenink, Praat: Doing phonetics by computer (2022), http://www.praat.org/]. The data were annotated and labeled through a semi-automatic process employing various applications in Hybrid-DNN-HMM. The findings indicate the following: (1) Most sound deviations in learning do not mirror the phonetic inventory of L1; (2) Sound deviations can be influenced by L2, with marked phones exhibiting more deviations between L1 and L2; (3) Interlanguage manifests as a self-organizing and self-adaptive system. The study delves into the Contrastive Analysis Hypothesis, Markedness Differential Hypothesis and Interlanguage theory. It compares data with cross-linguistic universal trends in Mandarin acquisition and spoken corpus in Mandarin adults. Segmental similarities regarding phonological distances are quantitatively measured through Levenshtein edit distance and Hamming distance based on multivalued distinctive features.
@article{wan_segmental_2024, author = {Wan, I-Ping and Allassonni\`{e}re-Tang, Marc and Yu, Pu}, title = {Early Segmental Production in Thai Preschool Children Learning Mandarin}, journal = {International Journal of Asian Language Processing}, volume = {0}, number = {0}, pages = {2450005}, year = {2024}, doi = {10.1142/S271755452450005X}, url = {https://doi.org/10.1142/S271755452450005X}, }
LA80: A Lexical Database of 10 Bantu A80 Languages

Tessa Vermeir, Marc Allassonnière-Tang, and Guillaume Segerer

Journal of Open Humanities Data 2024

Abs Bib PDF

In this paper, we present LA80, a database containing lexical data of 10 Bantu A80 languages (Bekwel, Gyeli, Kol, Koonzime, Kwasio, Makaa, Mpiemo, Njyem, Shiwa and Sso). Data from existing fieldwork datasets have been compiled and formatted. We standardised French translations, corrected spelling mistakes, and merged overlapping data points, resulting in a database with 5,588 concepts. Furthermore, for a subset of 557 concepts available in at least six of the 10 languages, we did additional reformatting by separating prefixes from stems, something that is not done systematically in the source data. The LA80 database can be used for comparative linguistic analyses and diachronic reconstructions.
@article{la80_2024, author = {Vermeir, Tessa and Allassonnière-Tang, Marc and Segerer, Guillaume}, title = {{LA80: A Lexical Database of 10 Bantu A80 Languages}}, journal = {Journal of Open Humanities Data}, volume = {10}, number = {42}, pages = {1-10}, year = {2024}, month = jul, issn = {2059-481X}, doi = {10.5334/johd.218}, url = {https://doi.org/10.5334/johd.218}, eprint = {hhttps://openhumanitiesdata.metajnl.com/articles/10.5334/johd.218}, }
The meaning of morphomes: distributional semantics of Spanish stem alternations

Borja Herce, and Marc Allassonnière-Tang

Linguistics Vanguard 2024

Abs Bib PDF

Romance stem alternations have been argued to represent exclusively morphological objects (or “morphomes”) independent from semantic and syntactic categories. This conclusion has been based on feature-value analyses of the inflected forms, and definitions of natural classes that are theoretically driven and about which no consensus exists. Individual examples of morphomes are thus frequently challenged, while their autonomously morphological nature has never been tested quantitatively or experimentally. This is the purpose of the present study. We use context-based embeddings to explore the semantic profile of Spanish verb stem alternations. At the paradigmatic level, our findings suggest that Spanish morphomes’ cells are characterized by significantly above-chance distributional-semantic similarity. At the lexical level, similarly, verbs that show more similar patterns of alternation have also been found to be closer in meaning. Both of these findings suggest that these structures may have an extramorphological function. Using gradient distributional-semantic similarity offers a way to objectively assess the degree of (un)naturalness of a set of forms and meanings, something which has been lacking from most discussions on the structure of features and the architecture of paradigms.
@article{HerceAllassonnièreTang+2024, url = {https://doi.org/10.1515/lingvan-2023-0010}, title = {The meaning of morphomes: distributional semantics of Spanish stem alternations}, author = {Herce, Borja and Allassonnière-Tang, Marc}, journal = {Linguistics Vanguard}, doi = {doi:10.1515/lingvan-2023-0010}, year = {2024}, volume = {10}, number = {1}, pages = {115-128}, }
Vowel alternation with final i offers an easy-to-learn morphological option for a sex-blind grammatical gender in French

Marie-Claude Marsolier, Pris Touraille, and Marc Allassonnière-Tang

Frontiers in Psychology 2024

Abs Bib PDF

Like all modern Romance languages, French has a sex-based grammatical gender with two genders, feminine and masculine, and a lexicon that is highly sex-differentiated. These characteristics give rise to a number of issues, including the problematic generic use of the masculine grammatical gender, coupled with the challenge of sex categorization itself, and the epistemological difficulty of an adequate sociological description and analysis of what gender commonsense categories really are about. To remedy these concerns, several authors have proposed the creation of an additional, epicene grammatical gender. We have identified three such systematic proposals, or solutions, which specify various morphological options for new epicene nouns and gender markers on their satellite elements. These options include the use of non-standard or rarely used characters, the merging of feminine and masculine gender markers, as well as consonantal and vowel changes. In the simplest proposal, referred to as “solution I,” new epicene forms are mostly derived from feminine forms by systematically replacing with an i the final e that generally differentiates feminines from their masculine counterparts in written French. Although these solutions are used in some communities, their learnability has not been addressed so far, even though it could be a determining factor in their popularity and their eventual integration into standard French. In the present study, we provide a first assessment of this aspect by means of an online translation test. For each solution, French-speaking participants were instructed that they would be trained to learn an “alien” language that does not mark sex/gender categories (these alien languages correspond to standard French where only gendered words referring to people are replaced by the new epicene forms recommended by each solution). After a short learning-by-example phase, participants were required to translate into the alien language a set of 16 standard French sentences. The translations were analyzed as a function of several variables including the participants’ self-reported age and sex, the word categories and the solutions themselves. While all solutions proved quickly learnable, participants’ responses with solution I achieved the highest accuracy score, in particular with regard to the production of non-standard epicene forms.
@article{isolution_syllable_2024, title = {Vowel alternation with final i offers an easy-to-learn morphological option for a sex-blind grammatical gender in French}, volume = {15}, number = {1310475}, copyright = {All rights reserved}, issn = {1664-1078}, doi = {10.3389/fpsyg.2024.1310475}, journal = {Frontiers in Psychology}, author = {Marsolier, Marie-Claude and Touraille, Pris and Allassonnière-Tang, Marc}, year = {2024}, pages = {1-12}, }

Revisiting the automatic prediction of lexical errors in Mandarin

Marc Allassonnière-Tang, and I-Ping Wan

Linguistics Vanguard 2024

Abs Bib PDF

@article{AllassonnièreTangWan_2024,
  title = {Revisiting the automatic prediction of lexical errors in Mandarin},
  author = {Allassonnière-Tang, Marc and Wan, I-Ping},
  journal = {Linguistics Vanguard},
  volume = {10},
  number = {1},
  pages = {527-535},
  doi = {doi:10.1515/lingvan-2023-0036},
  year = {2024},
}

‘Reflexemes’ – a first cross-linguistic insight into how and why reflexive constructions encode emotions

Alex Stephenson, Maïa Ponsonnet, and Marc Allassonnière-Tang

STUF - Language Typology and Universals 2024

Abs Bib PDF

This article presents the first study on reflexive expressions having lexicalized an emotional meaning, as in the English example enjoy oneself. Such lexicalized forms, which we call ‘reflexemes’, occur in a number of genetically unrelated languages worldwide. Here we interrogate the cross-linguistic distribution and semantics of reflexemes, based on a sample of 58 languages from 6 genetic groups throughout Europe, Australia, and Asia. Reflexemes exhibit uneven distribution in this sample. Despite the presence of reflexemes across all three continents, European languages generally display much larger inventories. Based on our language sample’s contrasts, we hypothesize that these disparities could be driven by: the form of reflexive markers; their semantic range, including colexifications with anticausative constructions; and their longevity, with ancient, cognate European markers fostering accumulation of reflexemes via inheritance and borrowing. As for semantics, reflexemes target comparable emotions across languages. Specifically, categories labelled ‘Good feelings’, ‘Anger’, ‘Worry’, ‘Bad feelings’ and ‘Fear’ are consistently most prevalent. These tendencies apply across our sample, with no sign of family- or continent-specific semantic tendency. The observed semantic distribution may reflect universal lexicalization tendencies not specific to reflexemes, perhaps combined with an emphasis on self-evaluation and other social emotions imparted by reflexive semantics.
@article{ponssonnetetal_reflexemes_2024, author = {Stephenson, Alex and Ponsonnet, Maïa and Allassonnière-Tang, Marc}, pages = {141--188}, volume = {77}, number = {1}, journal = {STUF - Language Typology and Universals}, doi = {doi:10.1515/stuf-2024-2003}, year = {2024}, }
Early humans out of Africa had only base-initial numerals

One-Soon Her, Yung-Ping Liang, Eugene Chan, and 3 more authors

Humanities and Social Sciences Communications 2024

Abs Bib PDF

The vast majority of languages have numerals involving multiplication. Cross-linguistically, a numeral that involves a multiplier and a numeral base can be base-final, e.g., three hundred [three × hundred] in English, or base-initial, e.g., ikie ita [hundred × three] in Ibibio (Niger-Congo). A worldwide survey of 4099 languages reveals that 39% of the languages are base-initial, 48% are base-final, 4% use both orders, and 8% are without numeral bases. As the first step towards explaining this diversity and worldwide distribution, we offer convergent evidence to support the hypothesis that the languages of early humans in Africa had base-initial numerals. From a linguistic point of view, linearization is necessary for the verbal expression of multiplicative numerals. Between the two linear orders of multiplication, we demonstrate that the base-initial order has an initial advantage in communicative efficiency. We also offer typological evidence from the dominant head-initial word order in present-day numeral systems and nominal phrases in African languages. Finally, results from a phylogenetic analysis based on a global tree of human languages show that the base-initial order is more stable diachronically and more likely to be at the root of the reconstructed tree of languages in Africa between 100 and 150 thousand years ago. The dominant base-final order in non-African languages of modernity is thus likely to be a development after the Out-of-Africa exodus between 60 and 80 thousand years ago.
@article{allassonniere-tang_numerals_2024, title = {Early humans out of Africa had only base-initial numerals}, volume = {11}, issn = {2662-9992}, doi = {10.1057/s41599-023-02506-z}, language = {en}, number = {254}, urldate = {2024-02}, journal = {Humanities and Social Sciences Communications}, author = {Her, One-Soon and Liang, Yung-Ping and Chan, Eugene and Hsu, Hung-Hsin and Hsu, Anthony Chi-Pin and Allassonnière-Tang, Marc}, month = dec, year = {2024}, pages = {1-7}, }
Semantic and Phonological Distances in Free Word Association Tasks

Marc Allassonnière-Tang, I.-Ping Wan, and Chainwu Lee

In Chinese Lexical Semantics 2024

Abs Bib PDF

Free word association tasks are used to evaluate different hypotheses proposed by interactive and cascade models of speech processing. The interactive model predicts a small semantic and phonological distance between the target and the response words, whereas the cascade model predicts that the responses are semantically close to the targets but are phonologically far from the targets. One hundred forty-five stimuli tested with 22 participants resulted in 2289 tokens available for testing. The phonological and semantic distances were automatically measured using Levenshtein distance and word embeddings; additional metadata over 10M drawn from the Academia Sinica Corpus in Taiwan was computed. The results show that the stimuli and the responses are closer than random semantically and phonologically, supporting the predictions from the interactive models. However, we also observe that the semantic distance is shorter than the phonological distance. A concomitant increase in chronometry is found with longer semantic distance.
@incollection{allassonniere-tang_speech_2024, author = {Allassonnière-Tang, Marc and Wan, I.-Ping and Lee, Chainwu}, editor = {Dong, Minghui and Hong, Jia-Fei and Lin, Jingxia and Jin, Peng}, title = {Semantic and Phonological Distances in Free Word Association Tasks}, booktitle = {Chinese Lexical Semantics}, year = {2024}, publisher = {Springer Nature Singapore}, address = {Singapore}, pages = {91-100}, isbn = {978-981-97-0586-3}, doi = {10.1007/978-981-97-0586-3_8}, }

2023

Phylogenetic analyses for the origin of sortal classifiers in Mongolic, Tungusic, and Turkic languages

Marc Allassonnière-Tang, Zhong-Liang Gao, Shen-An Chen, and 1 more author

Concentric 2023

Abs Bib PDF

Numeral classifiers are one of the most common types of nominal classification systems. Their geographical distribution worldwide is concentrated in Asia, which infers a scheme of diffusion from a linguistic innovation. This study investigates the origin of classifier systems in the Mongolic, Tungusic, and Turkic languages in the Altaic region with a phylogenetic analysis based on data from 55 languages. The Single Origin Hypothesis suggests that Sinitic is the most probable original source of classifier systems found in Asia. Under this hypothesis, classifiers are unlikely to be an indigenous feature of the Altaic region, and indeed their phylogenetic signal turns out to be weak. We also conduct a qualitative analysis on the classifier inventory of the studied languages to assess the robustness of phylogenetic methods. The results also indicate that classifiers are most likely a borrowed feature in the Mongolic, Tungusic, and Turkic languages.
@article{allassonniere-tang_clftranseurasian_2023, title = {Phylogenetic analyses for the origin of sortal classifiers in Mongolic, Tungusic, and Turkic languages}, doi = {10.1075/consl.00031.her}, language = {en}, urldate = {2023-09-14}, journal = {Concentric}, volume = {49}, number = {2}, pages = {295-315}, author = {Allassonnière-Tang, Marc and Gao, Zhong-Liang and Chen, Shen-An and One-Soon, Her}, month = may, year = {2023}, }
Variation du genre des substantifs dans les dialectes gallo-romans. Étude exploratoire

Guylaine Brun-Trigaud, Maguelone Sauzet, and Marc Allassonnière-Tang

Géolinguistique 2023

Abs Bib PDF

Cet article propose une analyse sur un corpus d’environ 900 cartes de l’Atlas linguistique de la France (1902‑1910), dans le but d’explorer la variation de genre (masculin/féminin) des substantifs dans les dialectes gallo-romans (oïl, occitan, francoprovençal), en regard du français standard, où cette catégorie grammaticale a été fortement régularisée par la norme. Nous avons eu recours à des méthodes qualitatives et quantitatives (régression linéaire). Les premiers résultats montrent un foisonnement de cas de variation, que des critères sémantiques, étymologiques et morpho‑phonologiques inhibent ou favorisent. Les travaux de Platz (1918), précurseur de l’étude du genre dans l’ALF, ont apporté des pistes intéressantes à nos réflexions.
@article{allassonniere-tang_genderfrance_2023, title = {Variation du genre des substantifs dans les dialectes gallo-romans. Étude exploratoire}, doi = {10.4000/geolinguistique.13891}, language = {fr}, urldate = {2023-12-13}, journal = {Géolinguistique}, volume = {23}, number = {}, pages = {}, author = {Brun-Trigaud, Guylaine and Sauzet, Maguelone and Allassonnière-Tang, Marc}, month = dec, year = {2023}, }
Nominal classification in Asia and Oceania: Functional and diachronic perspectives

Marc Allassonnière-Tang, and Marcin Kilarski

In John Benjamins 2023

Abs Bib PDF

Linguists have long been interested in systems of nominal classification due to their diverse functions as well as cognitive and cultural correlates. Among others, ongoing research has focused on semantic, functional and morphosyntactic properties of complex systems such as co-occurring gender and numeral classifiers. Such approaches have typically focused on the languages of north-western South America and Papua New Guinea. This volume proposes to fill in a gap in existing research by focusing on Asia, based on case studies from languages belonging to a wide range of families, i.e., Austroasiatic, Austronesian, Dravidian, Hmong-Mien, Indo-European, Mongolic, Sino-Tibetan and Tai-Kadai as well as the language isolate Nivkh. Gender and classifiers in these languages are approached within several different perspectives, i.e., functional, typological and diachronic, thus revealing complex patterns in their lexical and pragmatic functions as well as origin, development and loss. Describing and analysing such properties is a unique and innovative contribution of the volume.
@book{allassonniere-tang_nominal_2023, address = {Amsterdam}, title = {Nominal classification in Asia and Oceania: Functional and diachronic perspectives}, isbn = {978-90-272-1437-9}, language = {eng}, publisher = {John Benjamins}, author = {Allassonnière-Tang, Marc and Kilarski, Marcin}, year = {2023}, }
Nominal classification in Assamese: An analysis of function

Pori Saikia, and Marc Allassonnière-Tang

In Current Issues in Linguistic Theory 2023

Abs Bib PDF

We provide an analysis of the classifier system in Assamese (Indo-European) via the framework of functional typology. Assamese is located at the border of Indo-European and Sino-Tibetan language families, which are typically associated with grammatical gender and classifiers, respectively. Assamese represents an insightful example of an Indo-European language relying on classifiers rather than grammatical gender to fulfill the functions typical for a nominal classification system. Our analysis shows that classifiers in Assamese behave similarly to other classifier languages in terms of lexical and discourse functions, except for the functions of definiteness marking and individuation. The implications of such findings are connected to typology, research in human cognition, and language contact.
@incollection{allassonniere-tang_nominal_2024, address = {Amsterdam}, title = {Nominal classification in {Assamese}: {An} analysis of function}, volume = {362}, isbn = {978-90-272-1437-9 978-90-272-4924-1}, language = {en}, urldate = {2023-12-19}, booktitle = {Current {Issues} in {Linguistic} {Theory}}, publisher = {John Benjamins Publishing Company}, author = {Saikia, Pori and Allassonnière-Tang, Marc}, editor = {Allassonnière-Tang, Marc and Kilarski, Marcin}, month = dec, year = {2023}, doi = {10.1075/cilt.362.03sai}, pages = {30--55}, }
Why we need a gradient approach to word order

Natalia Levshina, Savithry Namboodiripad, Marc Allassonnière-Tang, and 12 more authors

Linguistics 2023

Abs Bib PDF

This article argues for a gradient approach to word order, which treats word order preferences, both within and across languages, as a continuous variable. Word order variability should be regarded as a basic assumption, rather than as something exceptional. Although this approach follows naturally from the emergentist usage-based view of language, we argue that it can be beneficial for all frameworks and linguistic domains, including language acquisition, processing, typology, language contact, language evolution and change, and formal approaches. Gradient approaches have been very fruitful in some domains, such as language processing, but their potential is not fully realized yet. This may be due to practical reasons. We discuss the most pressing methodological challenges in corpus-based and experimental research of word order and propose some practical solutions.
@article{levshina_order_2023, title = {Why we need a gradient approach to word order}, issn = { 1613-396X}, doi = {10.1515/ling-2021-0098}, language = {en}, urldate = {2023-04-25}, journal = {Linguistics}, volume = {61}, number = {4}, pages = {825-883}, author = {Levshina, Natalia and Namboodiripad, Savithry and Allassonnière-Tang, Marc and Kramer, Mathew and Talamo, Luigi and Verkerk, Annemarie and Wilmoth, Sasha and Rodriguez, Gabriela Garrido and Gupton, Timothy Michael and Kidd, Evan and Liu, Zoey and Naccarato, Chiara and Nordlinger, Rachel and Panova, Anastasia and Stoynova, Natalia}, month = may, year = {2023}, }
L’apport des données participatives pour l’étude linguistique des français du monde : le cas de l’opposition /a∼ɑ/

Mathilde Hutin, and Marc Allassonnière-Tang

Journal of French Language Studies 2023

Abs Bib PDF

French is a language spoken by hundreds of millions of speakers in Europe, Africa, and America. Such widespread use favours variation, yet large homogeneous corpora allowing to account for this variation worldwide are scarce and would in any case necessitate non-negligeable financial and human resources, as did for instance the project Phonologie du Français Contemporain. In this study, we present a possible alternative – crowdsourcing. We introduce Lingua Libre, Wikimedia’s open linguistic library, and use it to describe the variation of a phonemic opposition between two low vowels, /a/ and /ɑ/, in several varieties of French. The recordings of 38 speakers from 26 survey points are processed automatically and compared to values from past research. Results show that the platform has the potential to provide results mostly congruent with those of professional field recordings. The study concludes on the advantages and limitations of the platform and proposes suggestions for its improvement.
@article{hutin_allassonnière-tang_2023, title = {L’apport des données participatives pour l’étude linguistique des français du monde : le cas de l’opposition /a∼ɑ/}, doi = {10.1017/S0959269523000200}, journal = {Journal of French Language Studies}, author = {Hutin, Mathilde and Allassonnière-Tang, Marc}, year = {2023}, volume = {34}, number = {2}, pages = {249–272}, }
Idéer une catégorie épicène et la matérialiser cohéremment dans la langue. Une nécessité épistémologique autant que politique

Priscille Touraille, and Marc Allassonnière-Tang

In Qu’est-ce qu’une femme ? Catégories homme/femme : débats contemporains 2023

Abs Bib PDF

Pris Touraille a assuré la rédaction de ce chapitre et en assume la responsabilité scientifique. Ce travail constitue le premier volet d’une collaboration avec Claude (Miki) Marsolier, passionnéi de linguistique (et chercheuri en génétique au CEA et au MNHN, Paris) et Arc Allassonnière-Tang (avec lesquellis la communication au colloque Qu’est qu’une femme ? a été faite en 2022 à Nantes). Cette collaboration consiste en l’élaboration d’une solution grammaticale épicène en français à laquelle Pris Touraille a commencé à réfléchir comme pouvant devenir un outil de rupture épistémologique dans les sciences sociales. Le deuxième volet de cette collaboration, assuré en grande partie par M. Marsolier, est une proposition concrète de formes épicènes formant système, « la solution en i », que nous appelons aussi « le français hors-sexe ». Un troisième volet du projet, largement piloté par Arc Allassonnière-Tang, est à venir et sera dédié aux résultats des tests et à l’application créée pour développer cet outil.
@incollection{Touraille_Women_2023, address = {Paris}, title = {Idéer une catégorie épicène et la matérialiser cohéremment dans la langue. Une nécessité épistémologique autant que politique}, isbn = {978-2-37361-408-4}, language = {fr}, urldate = {2023}, booktitle = {Qu’est-ce qu’une femme ? Catégories homme/femme : débats contemporains}, publisher = {Éditions Matériologiques}, author = {Touraille, Priscille and Allassonnière-Tang, Marc}, editor = {Lemarchand, Patricia and Salle, Muriel}, year = {2023}, doi = {}, pages = {167--223}, }
Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse Markers

Mathilde Hutin, Liesbeth Degand, and Marc Allassonnière-Tang

In Proc. INTERSPEECH 2023 2023

Abs Bib PDF

Discourse markers (DMs) are (chunks of) words stemming from the diachronic development of other parts-of-speech that tag the discourse’s organization (ex. "well then", "innit"...). However, in synchrony, the formal accounts for the DM class vary from purely discourse-oriented definitions to models relying on a combination of lexico-grammatical and discursive information. We propose to bring new evidence into this debate by comparing the phonetic realizations of 4 DM types: stemming originally from adverbs, coordinators, subordinators and interjections. A discourse-only account would predict that the 4 types would be realized similarly, while a syntactic-discursive account predicts that subordinators would stand out, as they are less prone to syntactic independence. The analysis of various acoustic parameters (segment duration, F0, F1, F2 and HNR) in a finely-annotated 4-hour long corpus of French indicates that a hybrid approach may indeed be more accurate.
@inproceedings{hutin23_interspeech, author = {Hutin, Mathilde and Degand, Liesbeth and Allassonnière-Tang, Marc}, title = {{Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse Markers}}, year = {2023}, booktitle = {Proc. INTERSPEECH 2023}, pages = {2563--2567}, doi = {10.21437/Interspeech.2023-1619}, }
Intra- and inter-speaker variation in eight Russian fricatives

Natalja Ulrich, François Pellegrino, and Marc Allassonnière-Tang

The Journal of the Acoustical Society of America 2023

Abs Bib PDF

Acoustic variation is central to the study of speaker characterization. In this respect, specific phonemic classes such as vowels have been particularly studied, compared to fricatives. Fricatives exhibit important aperiodic energy, which can extend over a high-frequency range beyond that conventionally considered in phonetic analyses, often limited up to 12 kHz. We adopt here an extended frequency range up to 20.05 kHz to study a corpus of 15 812 fricatives produced by 59 speakers in Russian, a language offering a rich inventory of fricatives. We extracted two sets of parameters: the first is composed of 11 parameters derived from the frequency spectrum and duration (acoustic set) while the second is composed of 13 mel frequency cepstral coefficients (MFCCs). As a first step, we implemented machine learning methods to evaluate the potential of each set to predict gender and speaker identity. We show that gender can be predicted with a good performance by the acoustic set and even more so by MFCCs (accuracy of 0.72 and 0.88, respectively). MFCCs also predict individuals to some extent (accuracy = 0.64) unlike the acoustic set. In a second step, we provide a detailed analysis of the observed intra- and inter-speaker acoustic variation.
@article{ulrich_variation_2023, title = {Intra- and inter-speaker variation in eight Russian fricatives}, issn = {0001-4966}, doi = {10.1121/10.0017827}, language = {en}, urldate = {2023-04-17}, journal = {The Journal of the Acoustical Society of America}, volume = {153}, number = {4}, pages = {2285--2297}, author = {Ulrich, Natalja and Pellegrino, François and Allassonnière-Tang, Marc}, month = jan, year = {2023}, }
A corpus-based quantitative study of numeral classifiers in Nepali

Krishna Parajuli, and Marc Allassonnière-Tang

Corpus Linguistics and Linguistic Theory 2023

Abs Bib PDF

Nepali is typologically rare in terms of nominal classification systems, as it is one of the few languages of the world having simultaneously two gender systems (human/non-human, masculine/feminine) and one numeral classifier system (distinguishing features such as human, round-shaped objects, and long objects among others). Such a rare co-occurrence of different nominal classification systems is highly relevant for investigating linguistic complexity, as languages generally do not have several systems of the same type fulfilling the same functions. However, no corpus-based quantitative analyses have been conducted on the productive use of nominal classification systems in Nepali. The current paper aims at filling this gap by providing a token-based study from the Nepali National Corpus (∼20 million words). Our preliminary results show that there is in fact little formal overlap between the classifier and the gender systems.
@article{parajuli_nepali_2023, title = {A corpus-based quantitative study of numeral classifiers in Nepali}, issn = {1613-7035}, doi = {10.1515/cllt-2022-0064}, language = {en}, urldate = {2023-02-13}, journal = {Corpus Linguistics and Linguistic Theory}, author = {Parajuli, Krishna and Allassonnière-Tang, Marc}, month = jan, year = {2023}, }

2022

Defining numeral classifiers and identifying classifier languages of the world

One-Soon Her, Harald Hammarström, and Marc Allassonnière-Tang

Linguistics Vanguard 2022

Abs Bib HTML PDF

This paper presents a precise definition of numeral classifiers, steps to identify a numeral classifier language, and a database of 3,338 languages, of which 723 languages have been identified as having a numeral classifier system. The database, named World Atlas of Classifier Languages (WACL), has been systematically constructed over the last 10 years via a manual survey of relevant literature and also an automatic scan of digitized grammars followed by manual checking. The open-access release of WACL is thus a significant contribution to linguistic research in providing (i) a precise definition and examples of how to identify numeral classifiers in language data and (ii) the largest dataset of numeral classifier languages in the world. As such it offers researchers a rich and stable data source for conducting typological, quantitative, and phylogenetic analyses on numeral classifiers. The database will also be expanded with additional features relating to numeral classifiers in the future in order to allow more fine-grained analyses.
@article{her_defining_2022, title = {Defining numeral classifiers and identifying classifier languages of the world}, issn = {2199-174X}, doi = {10.1515/lingvan-2022-0006}, language = {en}, urldate = {2022-12-25}, journal = {Linguistics Vanguard}, volume = {8}, number = {1}, pages = {151--164}, author = {Her, One-Soon and Hammarström, Harald and Allassonnière-Tang, Marc}, month = nov, year = {2022}, }
The noncausal/causal alternation and genealogical affiliation: Quantitative testing in three Niger-Congo language families

Marc Allassonnière-Tang, Stéphane Robert, and Sylvie Voisin

Linguistique et Langues Africaines 2022

Abs Bib PDF

The noncausal/causal alternation is the pairing of two verb forms that refer to the same core event but differ in the absence vs. presence of a causer for this event (e.g. rise vs. raise, open (intr.) vs. open (tr.), die vs. kill). Languages differ in their overall preferences among the possible strategies for coding this alternation. This study uses machine-learning methods (clustering and tree-based computational classifiers) to investigate the predictive power of the noncausal/causal alternation for the genealogical affiliation of 38 languages belonging to the Atlantic, Mande and Mel families. The languages studied here belong to different contact areas in Senegal and its surroundings. The three families are all affiliated to the Niger-Congo phylum but display quite different typological profiles. The present paper elaborates on an earlier study that used a standard list of 18 verb pairs to establish the coding strategies in these languages. Apart from highlighting which coding strategies are favored in each family, our quantitative analyses show that the family affiliation of the 38 languages can be predicted with an accuracy above the majority baseline based on the information of the noncausal/causal alternation in the 18 verb pairs, but that the predictive power of verb pairs 1‑9 is generally lower than the one of verb pairs 10‑18. Our results confirm the hypothesis that the first group of verb pairs shows universal rather than lineage-specific tendencies concerning the noncausal/causal alternation. Furthermore, our analyses identify which of the 18 verb pairs (and their correlated coding strategies) have the highest predictive power. This study opens new avenues for identifying the relevant synchronic data for genealogical classification in historical linguistics. Future studies could replicate the same analysis in different language families to assess if our results are universal or specific to some language families.
@article{tang_valency_2022, title = {The noncausal/causal alternation and genealogical affiliation: Quantitative testing in three Niger-Congo language families}, issn = {2822-7468}, language = {en}, urldate = {2022-12-31}, journal = {Linguistique et Langues Africaines}, volume = {8}, number = {2}, pages = {1--20}, author = {Allassonnière-Tang, Marc and Robert, Stéphane and Voisin, Sylvie}, month = dec, year = {2022}, }
On Taiwanese Universities’ Two–One Academic Dismissal Policies: A Quantitative Fairness Analysis of the Four Policies of National Chengchi University

One-Soon Her, Jie-Wen Tsai, and Marc Allassonnière-Tang

Journal of Educational Research and Development 2022

Abs Bib PDF

Academic dismissal policies are used by universities worldwide for quality control purposes. Taiwanese universities base their policies solely on the credit fail rate (CFR) of individual semesters (S-CFR). The most common S-CFR is 50% and is called er-yi (two-one), which indicates half or more of the course credits of a semester were failed. Though actual policies vary among universities, their core designs generally rely on the concept of S-CFR. The present study first compares the dismissal policies among universities in the United States, the Netherlands, and Taiwan to demonstrate how the two–one design lacks consultation and review processes. We then argue that the disregard for cumulative grade point average, semester grade point average, and cumulative credit pass rate may lead to bias because it may lead to students with better overall academic performance being dismissed. We further validate the argument by conducting a quantitative analysis of data on the academic performance of students (N=22,703) from National Chengchi University over 11 years under four different policies. Our findings strongly indicate that the core design common in such policies, i.e., the S-CFR, should be reconsidered.
@article{her_taiwanese_2022, title = {On {Taiwanese} {Universities}’ {Two}–{One} {Academic} {Dismissal} {Policies}: {A} {Quantitative} {Fairness} {Analysis} of the {Four} {Policies} of {National} {Chengchi} {University}}, volume = {18}, url = {https://journal.naer.edu.tw/periodical_detail.asp?DID=vol071_03}, doi = {10.6925/SCJ.202212_18(4).0003}, number = {4}, journal = {Journal of Educational Research and Development}, author = {Her, One-Soon and Tsai, Jie-Wen and Allassonnière-Tang, Marc}, year = {2022}, pages = {79--112}, }
Predicting grammatical gender in Nakh languages: Three methods compared

Jesse Wichers Schreur, Marc Allassonnière-Tang, Kate Bellamy, and 1 more author

Linguistic Typology at the Crossroads 2022

Abs Bib PDF

The Nakh languages Chechen and Tsova-Tush each have a five-valued gender system: masculine, feminine, and three “neuter” genders named for their singular agreement forms: B, D and J. Gender assignment in languages is generally analysed as being dependent on both forms and semantics (e.g. Corbett, 1991), with semantics typically prevailing over form (e.g. Bellamy & Wichers Schreur, 2021, Allassonnière-Tang et al., 2021). Most previous studies have considered only binary or tripartite gender systems possessing masculine, feminine, and neuter values. The five-valued system of Nakh thus represents an innovative and insightful case study for analysing gender assignment. In this paper we build on the existing qualitative linguistic analyses of gender assignment in Tsova-Tush (Wichers Schreur, 2021) and apply three machine-learning methods to investigate the weight of form and semantics in predicting grammatical gender in Chechen and Tsova-Tush. The results show that while both form and semantics are helpful for predicting grammatical gender in Nakh, semantics is dominant, which supports findings from existing literature (Allassonnière-Tang, Brown & Fedden, 2021). However, the results also show that the coded semantic information could be further fine-grained to improve the accuracy of the predictions (see also Plaster et al., 2013). In addition, we discuss the implications of the output for our understanding of language-internal and family-internal processes of language change, including how loanwords are integrated from Russian, a three-gender language.
@article{wichers_schreur_predicting_2022, title = {Predicting grammatical gender in {Nakh} languages: {Three} methods compared}, volume = {2}, copyright = {Creative Commons Attribution 4.0 International}, shorttitle = {Predicting grammatical gender in {Nakh} languages}, doi = {10.6092/ISSN.2785-0943/14545}, language = {en}, number = {2}, urldate = {2022-12-25}, journal = {Linguistic Typology at the Crossroads}, author = {Wichers Schreur, Jesse and Allassonnière-Tang, Marc and Bellamy, Kate and Rochant, Neige}, month = dec, year = {2022}, note = {Artwork Size: 93-126 Pages Publisher: Linguistic Typology at the Crossroads}, pages = {93--126}, }
Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

Mathilde Hutin, and Marc Allassonnière-Tang

Languages 2022

Abs Bib PDF

Less-resourced languages are usually left out of phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France in 2015. It contains more than 670k recordings in approximately 150 languages across nearly 740 speakers. As a proof of concept, we consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate data from 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System. Information on the formants of the vowel segments is then extracted to measure how vowels expand in the acoustic space and whether this is correlated with the number of vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.
@article{hutin_operation_2022, title = {Operation {LiLi}: {Using} {Crowd}-{Sourced} {Data} and {Automatic} {Alignment} to {Investigate} the {Phonetics} and {Phonology} of {Less}-{Resourced} {Languages}}, volume = {7}, issn = {2226-471X}, shorttitle = {Operation {LiLi}}, doi = {10.3390/languages7030234}, language = {en}, number = {3}, urldate = {2022-12-25}, journal = {Languages}, author = {Hutin, Mathilde and Allassonnière-Tang, Marc}, month = sep, year = {2022}, pages = {234}, }
Inferring case paradigms in Koalib with computational classifiers

Nicolas Quint, and Marc Allassonnière-Tang

Corpus Linguistics and Linguistic Theory 2022

Abs Bib PDF

The object case inflection in Koalib (Niger-Congo) represents complex patterns that involve phoneme position, syllable structure, and tonal pattern. Few attempts have been made with qualitative and quantitative approaches to identify the rules of the object case paradigms in Koalib. In the current study, information on phonemes, tones, and syllables are automatically extracted from a Koalib sample of 2,677 lexemes. The data is then fed to decision-tree-based classifiers to predict the object case paradigms and extract the interactive patterns between the variables. The results improve the predicting accuracy of existing studies and identify the case paradigms predicted by linguistic hypotheses. New case paradigms are also found by the computational classifiers and explained from a linguistic perspective. Our work demonstrates that the combination of linguistic theoretical knowledge with machine learning techniques can become one of the methodological approaches for linguistic analyses.
@article{quint_inferring_2022, title = {Inferring case paradigms in {Koalib} with computational classifiers}, issn = {1613-7027, 1613-7035}, doi = {10.1515/cllt-2021-0028}, language = {en}, urldate = {2022-02-02}, journal = {Corpus Linguistics and Linguistic Theory}, volume = {19}, number = {2}, pages = {237--269}, author = {Quint, Nicolas and Allassonnière-Tang, Marc}, month = jan, year = {2022}, }
The evolutionary trends of noun class systems in Atlantic languages

Neige Rochant, Marc Allassonnière-Tang, and Chundra Cathcart

In Proceedings of the Joint Conference on Language Evolution 2022

Abs Bib PDF

Nominal classification systems such as grammatical gender (e.g., the masculine/feminine distinction in French) and noun classes (e.g., Bantu noun classes based on fruits, plants, liquids, among others) provide a window on how the human brain perceives and categorizes objects and experiences it encounters. While the diachronic development of grammatical gender systems is well studied, noun class systems have received less attention. We use phylogenetic comparative methods to analyze where noun classes are marked (on nouns, pronouns, demonstratives, articles, adjectives, numbers, and verbs) in thirty-six Atlantic languages and how these markers change diachronically. Our results show that noun class marking is generally preferred and more stable within the noun phrase, i.e., on nouns, demonstratives, and adjectives.
@inproceedings{rochant_evolutionary_2022, title = {The evolutionary trends of noun class systems in {Atlantic} languages}, doi = {10.17617/2.3398549}, urldate = {2022-12-25}, booktitle = {Proceedings of the {Joint} {Conference} on {Language} {Evolution}}, author = {Rochant, Neige and Allassonnière-Tang, Marc and Cathcart, Chundra}, year = {2022}, note = {Artwork Size: 50069 Publisher: (:unas) Version Number: 2}, pages = {624--631}, }
Investigating phonological theories with crowd-sourced data: The Inventory Size Hypothesis in the light of Lingua Libre

Mathilde Hutin, and Marc Allassonnière-Tang

In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology 2022

Abs Bib PDF

Data-driven research in phonetics and phonology relies massively on oral resources, and access thereto. We propose to explore a question in comparative linguistics using an open-source crowd-sourced corpus, Lingua Libre, Wikimedia’s participatory linguistic library, to show that such corpora may offer a solution to typologists wishing to explore numerous languages at once. For the present proof of concept, we compare the realizations of Italian and Spanish vowels (sample size = 5000) to investigate whether vowel production is influenced by the size of the phonemic inventory (the Inventory Size Hypothesis), by the exact shape of the inventory (the Vowel Quality Hypothesis) or by none of the above. Results show that the size of the inventory does not seem to influence vowel production, thus supporting previous research, but also that the shape of the inventory may well be a factor determining the extent of variation in vowel production. Most of all, these results show that Lingua Libre has the potential to provide valuable data for linguistic inquiry.
@inproceedings{hutin_investigating_2022, title = {Investigating phonological theories with crowd-sourced data: {The} {Inventory} {Size} {Hypothesis} in the light of {Lingua} {Libre}}, booktitle = {Proceedings of the 19th {SIGMORPHON} {Workshop} on {Computational} {Research} in {Phonetics}, {Phonology}, and {Morphology}}, author = {Hutin, Mathilde and Allassonnière-Tang, Marc}, year = {2022}, pages = {23--28}, }
Crowd-sourcing for Less-resourced Languages: Lingua Libre for Polish

Mathilde Hutin, and Marc Allassonnière-Tang

In Proceedings of the International Conference on Language Resources and Evaluation 2022

Abs Bib PDF

Oral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the present study, we take Polish, a less-resourced language in terms of phonetic data, as an example, and compare our phonetic observations built on the data from Lingua Libre with the phonetic observations found by previous linguistic studies. We observe that the data from Lingua Libre partially matches the phonetic inventory of Polish as described in previous studies, but that the acoustic values are less precise, thus showing both the potential and the limitations of Lingua Libre to be used for phonetic research.
@inproceedings{hutin_crowd-sourcing_2022, title = {Crowd-sourcing for {Less}-resourced {Languages}: {Lingua} {Libre} for {Polish}}, booktitle = {Proceedings of the {International} {Conference} on {Language} {Resources} and {Evaluation}}, author = {Hutin, Mathilde and Allassonnière-Tang, Marc}, year = {2022}, pages = {41--47}, }

2021

Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systems

Marc Allassonnière-Tang, Olof Lundgren, Maja Robbers, and 5 more authors

Humanities and Social Sciences Communications 2021

Abs Bib PDF

Languages of diverse structures and different families tend to share common patterns if they are spoken in geographic proximity. This convergence is often explained by horizontal diffusibility, which is typically ascribed to language contact. In such a scenario, speakers of two or more languages interact and influence each other’s languages, and in this interaction, more grammaticalized features tend to be more resistant to diffusion compared to features of more lexical content. An alternative explanation is vertical heritability: languages in proximity often share genealogical descent. Here, we suggest that the geographic distribution of features globally can be explained by two major pathways, which are generally not distinguished within quantitative typological models: feature diffusion and language expansion. The first pathway corresponds to the contact scenario described above, while the second occurs when speakers of genetically related languages migrate. We take the worldwide distribution of nominal classification systems (grammatical gender, noun class, and classifier) as a case study to show that more grammaticalized systems, such as gender, and less grammaticalized systems, such as classifiers, are almost equally widespread, but the former spread more by language expansion historically, whereas the latter spread more by feature diffusion. Our results indicate that quantitative models measuring the areal diffusibility and stability of linguistic features are likely to be affected by language expansion that occurs by historical coincidence. We anticipate that our findings will support studies of language diversity in a more sophisticated way, with relevance to other parts of language, such as phonology.
@article{allassonniere-tang_expansion_2021, title = {Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systems}, volume = {8}, issn = {2662-9992}, doi = {10.1057/s41599-021-01003-5}, language = {en}, number = {1}, urldate = {2022-12-25}, journal = {Humanities and Social Sciences Communications}, author = {Allassonnière-Tang, Marc and Lundgren, Olof and Robbers, Maja and Cronhamn, Sandra and Larsson, Filip and Her, One-Soon and Hammarström, Harald and Carling, Gerd}, month = dec, year = {2021}, pages = {331}, }
Identifying the Russian voiceless non-palatalized fricatives /f/, /s/, and /ʃ/ from acoustic cues using machine learning

Natalja Ulrich, Marc Allassonnière-Tang, François Pellegrino, and 1 more author

The Journal of the Acoustical Society of America 2021

Abs Bib PDF

This paper shows that machine learning techniques are very successful at classifying the Russian voiceless non-palatalized fricatives [f], [s], and [ʃ] using a small set of acoustic cues. From a data sample of 6320 tokens of read sentences produced by 40 participants, temporal and spectral measurements are extracted from the full sound, the noise duration, and the middle 30 ms windows. Furthermore, 13 mel-frequency cepstral coefficients (MFCCs) are computed from the middle 30 ms window. Classifiers based on single decision trees, random forests, support vector machines, and neural networks are trained and tested to distinguish between these three fricatives. The results demonstrate that, first, the three acoustic cue extraction techniques are similar in terms of classification accuracy (93% and 99%) but that the spectral measurements extracted from the full frication noise duration result in slightly better accuracy. Second, the center of gravity and the spectral spread are sufficient for the classification of [f], [s], and [ʃ] irrespective of contextual and speaker variation. Third, MFCCs show a marginally higher predictive power over spectral cues (<2%). This suggests that both sets of measures provide sufficient information for the classification of these fricatives and their choice depends on the particular research question or application.
@article{ulrich_identifying_2021, title = {Identifying the {Russian} voiceless non-palatalized fricatives /f/, /s/, and /ʃ/ from acoustic cues using machine learning}, volume = {150}, issn = {0001-4966}, doi = {10.1121/10.0005950}, language = {en}, number = {3}, urldate = {2022-12-25}, journal = {The Journal of the Acoustical Society of America}, author = {Ulrich, Natalja and Allassonnière-Tang, Marc and Pellegrino, François and Dediu, Dan}, month = sep, year = {2021}, pages = {1806--1820}, }
Investigating the branching of Chinese classifier phrases: Evidence from speech perception and production

Marc Allassonnière-Tang, Ying-Chun Chen, Nai-Shing Yen, and 1 more author

Journal of Chinese Linguistics 2021

Abs Bib PDF

The formal structure of the construction formed by a numeral (Num), a sortal classifier (C) or mensural classifier (M), and a noun (N), is controversial, as both left-branching [[Num C/M] N] and right-branching [Num [C/M N]] structures have been argued for in the literature. In this paper we report two psycholinguistic experiments on speech production and perception in Mandarin to investigate this branching issue. First, we applied the syntax-phonology interface of tone 3 (T3) sandhi and performed a phonological analysis of native speakers’ tone sandhi patterns of [Num C/M N] phrases composed of T3 monosyllabic words. Second, we conducted a click-detection experiment to see how native speakers would perceive a click inserted in a C/M phrase composed of monosyllabic words, as compared to when it is inserted in other syntactic structures with attested left or right-branching. Results from both experiments supported the leftbranching structure of classifier phrases.
@article{allassonniere-tang_investigating_2021, title = {Investigating the branching of {Chinese} classifier phrases: {Evidence} from speech perception and production}, volume = {49}, copyright = {All rights reserved}, doi = {10.1353/jcl.2021.0003}, language = {en}, number = {1}, urldate = {2020-10-20}, journal = {Journal of Chinese Linguistics}, author = {Allassonnière-Tang, Marc and Chen, Ying-Chun and Yen, Nai-Shing and Her, One-Soon}, year = {2021}, pages = {71--105}, }
Syllable Complexity and Morphological Synthesis: A Well-Motivated Positive Complexity Correlation Across Subdomains

Shelece Easterday, Matthew Stave, Marc Allassonnière-Tang, and 1 more author

Frontiers in Psychology 2021

Abs Bib PDF

Relationships between phonological and morphological complexity have long been proposed in the linguistic literature, with empirical investigations often seeking complexity trade-offs. Positive complexity correlations tend not to be viewed in terms of motivations. We argue that positive complexity correlations can be diachronically well-motivated, emerging from crosslinguistically prevalent processes of language change. We examine the correlation between syllable complexity and morphological synthesis, hypothesizing that the process of grammaticalization motivates a positive relationship between the two features. To test this, we conduct a typological survey of 95 diverse languages and a corpus study of 21 languages with substantive (predominantly \textgreater10,000 words) corpora from the DoReCo project. The first study establishes a significant positive correlation between syllable complexity, measured in terms of maximal syllable patterns, and the index of synthesis (morpheme/word ratio). The second study tests the hypothesis that the relationship between syllable complexity and synthesis holds at local (word-initial and word-final) levels and within noun and verb types, as predicted by a grammaticalization account. While the findings of the corpus study are limited in their statistical power, the observed tendencies are consistent with our predictions. This study contributes important findings to the complexity literature, as well as a novel method which incorporates broad typological sampling and deep corpus analysis.
@article{easterday_syllable_2021, title = {Syllable {Complexity} and {Morphological} {Synthesis}: {A} {Well}-{Motivated} {Positive} {Complexity} {Correlation} {Across} {Subdomains}}, volume = {12}, copyright = {All rights reserved}, issn = {1664-1078}, doi = {10.3389/fpsyg.2021.638659}, journal = {Frontiers in Psychology}, author = {Easterday, Shelece and Stave, Matthew and Allassonnière-Tang, Marc and Seifart, Frank}, year = {2021}, pages = {583}, }
Testing Semantic Dominance in Mian Gender: Three Machine Learning Models

Marc Allassonnière-Tang, Dunstan Brown, and Sebastian Fedden

Oceanic Linguistics 2021

Abs Bib PDF

The Trans-New Guinea language Mian has a four-valued gender system that has been analyzed in detail as semantic. This means that the principles of gender assignment are based on the meaning of the noun. Languages with purely semantic systems are at one end of a spectrum of possible assignment types, while others are assumed to have both semantic and formal (i.e., phonologyor morphology-based) assignment. Given the possibility of gender assignment by both semantic and formal principles, it is worthwhile testing the empirical validity of the categorization of the Mian system as predominantly semantic. Here, we apply three machine learning models to determine independently what role semantics and phonology play in predicting Mian gender. Information about the formal and semantic features of nouns is extracted automatically from a dictionary. Different types of computational classifiers are trained to predict the grammatical gender of nouns, and the performance of the computational classifiers is used to assess the relevance of form and semantics in relation to gender prediction. The results show that semantics is dominant in predicting the gender of nouns in Mian. While it validates the original analysis of the Mian system, it also provides further evidence that claims of an equal contribution of form-based and semantic features in gender assignment do not hold for at least a proper subset of languages with gender.
@article{allassonniere-tang_testing_2021, title = {Testing {Semantic} {Dominance} in {Mian} {Gender}: {Three} {Machine} {Learning} {Models}}, volume = {60}, issn = {1527-9421}, shorttitle = {Testing {Semantic} {Dominance} in {Mian} {Gender}}, doi = {10.1353/ol.2021.0018}, language = {en}, number = {2}, urldate = {2022-12-25}, journal = {Oceanic Linguistics}, author = {Allassonnière-Tang, Marc and Brown, Dunstan and Fedden, Sebastian}, year = {2021}, pages = {302--334}, }
Interindividual Variation Refuses to Go Away: A Bayesian Computer Model of Language Change in Communicative Networks

Mathilde Josserand, Marc Allassonnière-Tang, François Pellegrino, and 1 more author

Frontiers in Psychology 2021

Abs Bib PDF

Treating the speech communities as homogeneous entities is not an accurate representation of reality, as it misses some of the complexities of linguistic interactions. Inter-individual variation and multiple types of biases are ubiquitous in speech communities, regardless of their size. This variation is often neglected due to the assumption that “majority rules,” and that the emerging language of the community will override any such biases by forcing the individuals to overcome their own biases, or risk having their use of language being treated as “idiosyncratic” or outright “pathological.” In this paper, we use computer simulations of Bayesian linguistic agents embedded in communicative networks to investigate how biased individuals, representing a minority of the population, interact with the unbiased majority, how a shared language emerges, and the dynamics of these biases across time. We tested different network sizes (from very small to very large) and types (random, scale-free, and small-world), along with different strengths and types of bias (modeled through the Bayesian prior distribution of the agents and the mechanism used for generating utterances: either sampling from the posterior distribution [“sampler”] or picking the value with the maximum probability [“MAP”]). The results show that, while the biased agents, even when being in the minority, do adapt their language by going against their a priori preferences, they are far from being swamped by the majority, and instead the emergent shared language of the whole community is influenced by their bias.
@article{josserand_interindividual_2021, title = {Interindividual {Variation} {Refuses} to {Go} {Away}: {A} {Bayesian} {Computer} {Model} of {Language} {Change} in {Communicative} {Networks}}, volume = {12}, issn = {1664-1078}, shorttitle = {Interindividual {Variation} {Refuses} to {Go} {Away}}, doi = {10.3389/fpsyg.2021.626118}, urldate = {2022-12-25}, journal = {Frontiers in Psychology}, author = {Josserand, Mathilde and Allassonnière-Tang, Marc and Pellegrino, François and Dediu, Dan}, month = jun, year = {2021}, pages = {2176}, }
The Diversity of Classifier Inventory in Mandarin Dialects: A Case Study of Baoding

Na Song, and Marc Allassonnière-Tang

Faits de Langues 2021

Abs Bib PDF

Our study compares Standard Mandarin (the Beijing dialect used in spoken and written registers) with the Mandarin dialect of Baoding (one of the Mandarin dialects belonging to the Jì-lŭ Mandarin group, Hebei-Shandong). Standard Mandarin and Baoding are geographically and phylogenetically closely related, but they differ in terms of their classifier system, as Standard Mandarin resorts to a wide array of sortal classifiers whereas Baoding only uses one general classifier. We first provide a detailed analysis of the unconventional classifier system in Baoding. Then, we compare the lexical and discourse functions of sortal classifiers in Standard Mandarin and Baoding. We show that Standard Mandarin does present a certain level of convergence with its geographical neighbour Baoding. However, these varieties also display significant divergences, as several lexical and discourse functions typically associated with classifier systems cannot be fulfilled by the only classifier found in Boading.
@article{song_diversity_2021, title = {The {Diversity} of {Classifier} {Inventory} in {Mandarin} {Dialects}: {A} {Case} {Study} of {Baoding}}, volume = {52}, issn = {1244-5460, 1958-9514}, shorttitle = {The {Diversity} of {Classifier} {Inventory} in {Mandarin} {Dialects}}, doi = {10.1163/19589514-05202001}, number = {2}, urldate = {2022-12-25}, journal = {Faits de Langues}, author = {Song, Na and Allassonnière-Tang, Marc}, month = nov, year = {2021}, pages = {115--132}, }
Topic modelling on archive documents from the 1970s: global policies on refugees

Philip Grant, Ratan Sebastian, Marc Allassonnière-Tang, and 1 more author

Digital Scholarship in the Humanities 2021

Abs Bib PDF

This study conducts a historical analysis of global policies on refugees within typewritten and digitally born documents (c. 55,000 pages) from international and national archives. The data originate from the 1970s and are stored in archives from the UK and US governments, plus the United Nations High Commissioner for Refugees (UNHCR). The overarching theme is to analyse the involvement of the UK, the USA, and the UNHCR in different refugee cases that occurred during the 1970s. To do so, we (1) identify the main topics in each document; (2) investigate the transmission of topics horizontally (between organizations) and vertically (through time); and (3) suggest targeted areas of the document set for further close reading by historians. Standard Optical Character Recognition and object detection are used to extract information from documents and categorize them. Then, natural language processing (NLP) methods like topic modelling and clustering are used to identify topics and the relationships between them across time. The results identify several main themes covered by different organizations and how the focus of each organization changes diachronically. Besides its academic contribution, this study also demonstrates how, through the use of existing techniques with limited customization, digital technologies in the hands of the historian can augment and complement qualitative methods in bringing to light the themes and trends demonstrated in large bodies of historical documents.
@article{grant_topic_2021, title = {Topic modelling on archive documents from the 1970s: global policies on refugees}, volume = {36}, issn = {2055-7671, 2055-768X}, shorttitle = {Topic modelling on archive documents from the 1970s}, doi = {10.1093/llc/fqab018}, language = {en}, number = {4}, urldate = {2022-12-25}, journal = {Digital Scholarship in the Humanities}, author = {Grant, Philip and Sebastian, Ratan and Allassonnière-Tang, Marc and Cosemans, Sara}, month = oct, year = {2021}, pages = {886--904}, }
What conditions tone paradigms in Yukuna: Phonological and machine learning approaches

Magdalena Lemus-Serrano, Marc Allassonnière-Tang, and Dan Dediu

Glossa: a journal of general linguistics 2021

Abs Bib PDF

Yukuna is an understudied Arawak language of North-West Amazonia with a privative tonal system. In this system, roots are underlyingly specified for tone, whilst affixes are toneless. However, affixation interacts with tone, leading to many variations in surface tonal patterns. This paper puts forth a qualitative analysis of Yukuna’s tonal system, and provides data-driven evidence in favor of this analysis using machine learning methods. More precisely, we use decision trees and random forests to assess quantitatively the predictions of the phonological analysis. A manually annotated corpus of verbal paradigms was split into a training and a testing set. We trained the computational classifiers on the first and tested their predictions on the second. We found that they predict the majority of the patterns and support the qualitative analysis. Additionally, they suggest avenues for enhancing the phonological analysis, by providing a ranking of the variables that highlight statistical tendencies within tonal patterns. Besides its contribution to understanding tonal systems in general and of that of Yukuna in particular, our work also suggests that such machine learning approaches might become part of the complex theoretical and methodological toolkit needed for language description and linguistic theory development.
@article{lemus-serrano_what_2021, title = {What conditions tone paradigms in {Yukuna}: {Phonological} and machine learning approaches}, volume = {6}, issn = {2397-1835}, shorttitle = {What conditions tone paradigms in {Yukuna}}, doi = {10.5334/gjgl.1276}, number = {1}, urldate = {2022-12-25}, journal = {Glossa: a journal of general linguistics}, author = {Lemus-Serrano, Magdalena and Allassonnière-Tang, Marc and Dediu, Dan}, month = may, year = {2021}, pages = {1--22}, }
A corpus study of lexical speech errors in Mandarin

I-Ping Wan, and Marc Allassonnière-Tang

Taiwan Journal of Linguistics 2021

Abs Bib PDF

We investigate a corpus of lexical substitution speech errors in Mandarin conversation data and present how Mandarin speakers produce erroneous lexical items and how these items are related to the intended words. The corpus includes 747 lexical speech errors from 100 participants and applies the part-of-speech definition of the Academia Sinica Corpus. Our results partially match with the observations in Germanic and Romance languages. As an example, the data from Mandarin native speakers shows that erroneously produced words and target words are almost always found in the same parts of speech. Moreover, noun substitutions are the most common type of substitution within the majority of content word pairs. However, the occurrence of verb errors is higher in Mandarin than in other languages, possibly reflecting a word frequency effect.
@article{wan_corpus_2021, title = {A corpus study of lexical speech errors in {Mandarin}}, volume = {19}, doi = {10.6519/TJL.202107_19(2).0003}, number = {2}, journal = {Taiwan Journal of Linguistics}, author = {Wan, I-Ping and Allassonnière-Tang, Marc}, year = {2021}, pages = {87--120}, }
An empirical study on the contribution of formal and semantic features to the grammatical gender of nouns

Ali Basirat, Marc Allassonnière-Tang, and Aleksandrs Berdicevskis

Linguistics Vanguard 2021

Abs Bib PDF

This study conducts an experimental evaluation of two hypotheses about the contributions of formal and semantic features to the grammatical gender assignment of nouns. One of the hypotheses (Corbett and Fraser 2000) claims that semantic features dominate formal ones. The other hypothesis, formulated within the optimal gender assignment theory (Rice 2006), states that form and semantics contribute equally. Both hypotheses claim that the combination of formal and semantic features yields the most accurate gender identification. In this paper, we operationalize and test these hypotheses by trying to predict grammatical gender using only character-based embeddings (that capture only formal features), only context-based embeddings (that capture only semantic features) and the combination of both. We performed the experiment using data from three languages with different gender systems (French, German and Russian). Formal features are a significantly better predictor of gender than semantic ones, and the difference in prediction accuracy is very large. Overall, formal features are also significantly better than the combination of form and semantics, but the difference is very small and the results for this comparison are not entirely consistent across languages.
@article{basirat_empirical_2021, title = {An empirical study on the contribution of formal and semantic features to the grammatical gender of nouns}, volume = {7}, issn = {2199-174X}, doi = {10.1515/lingvan-2020-0048}, language = {en}, number = {1}, urldate = {2022-12-25}, journal = {Linguistics Vanguard}, author = {Basirat, Ali and Allassonnière-Tang, Marc and Berdicevskis, Aleksandrs}, month = jan, year = {2021}, pages = {20200048}, }
Classifiers in Morphology

Marcin Kilarski, and Marc Allassonnière-Tang

In Oxford Research Encyclopedia of Linguistics 2021

Abs Bib PDF

Classifiers are partly grammaticalized systems of classification of nominal referents. The choice of a classifier can be based on such criteria as animacy, sex, material, and function as well as physical properties such as shape, size, and consistency. Such meanings are expressed by free or bound morphemes in a variety of morphosyntactic contexts, on the basis of which particular subtypes of classifiers are distinguished. These include the most well-known numeral classifiers which occur with numerals or quantifiers, as in Mandarin Chinese yí liàng chē (one clf.vehicle car) ‘one car’. The other types of classifiers are found in contexts other than quantification (noun classifiers), in possessive constructions (possessive classifiers), in verbs (verbal classifiers), as well as with deictics (deictic classifiers) and in locative phrases (locative classifiers). Classifiers are found in languages of diverse typological profiles, ranging from the analytic languages of Southeast Asia and Oceania to the polysynthetic languages of the Americas. Classifiers are also found in other modalities (i.e., sign languages and writing systems).
@incollection{aronoff_classifiers_2021, address = {Oxford}, title = {Classifiers in {Morphology}}, isbn = {978-0-19-938465-5}, language = {en}, urldate = {2021-07-03}, booktitle = {Oxford {Research} {Encyclopedia} of {Linguistics}}, publisher = {Oxford University Press}, author = {Kilarski, Marcin and Allassonnière-Tang, Marc}, editor = {Aronoff, Mark}, year = {2021}, doi = {10.1093/acrefore/9780199384655.013.546}, pages = {1--28}, }
Classifiers in Southeast Asian languages

Alice Vittrant, and Marc Allassonnière-Tang

In The languages and Linguistics of Mainland Southeast Asia 2021

Abs Bib PDF

Classifiers are one of the types of nominal classifications systems that help speakers to identify discourse referents. They are commonly found in Southeast Asian languages, which motivates the geographical focus of this chapter. Given the semantic as well as the morphosyntactic overlap between the various systems, classifier devices are first presented in the context of all systems of nominal classifications. Then, the analysis focuses on the different constructional subtypes of classifiers and discusses their origin along with how they are used by speakers in discourse.
@incollection{sidwell_classifiers_2021, address = {Berlin}, title = {Classifiers in {Southeast} {Asian} languages}, isbn = {978-3-11-055814-2}, urldate = {2022-12-25}, booktitle = {The languages and Linguistics of Mainland Southeast Asia}, publisher = {De Gruyter}, author = {Vittrant, Alice and Allassonnière-Tang, Marc}, editor = {Sidwell, Paul and Jenny, Mathias}, month = aug, year = {2021}, doi = {10.1515/9783110558142-031}, pages = {733--772}, }
The Effect of Word Frequency and Position-in-Utterance in Mandarin Speech Errors: A Connectionist Model of Speech Production

I.-Ping Wan, and Marc Allassonnière-Tang

In Chinese Lexical Semantics 2021

Abs Bib PDF

The connectionist model of speech processing infers that word frequency and position-in-utterance play a major role in the occurrence of speech errors. First, words that are not frequently used are more likely to result in speech errors since they generally receive less activation than frequently occurring words and require more activation to be chosen. Second, speech errors are more likely to occur near the end of utterances since, according to the given-before-new-principle, utterance-final words convey new information that has not yet been activated in the preceding context. The information of word frequency and position-in-utterance is extracted automatically from 382 utterances of a Mandarin speech error corpus and fed to generalized linear mixed models and a decision-tree based classifier. The results show that word frequency and position-in-utterance can predict the occurrence of speech errors with a performance over (but close to) the majority baseline. Therefore, additional information is required to improve the accuracy in the predictions.
@incollection{liu_effect_2021, address = {Cham}, title = {The {Effect} of {Word} {Frequency} and {Position}-in-{Utterance} in {Mandarin} {Speech} {Errors}: {A} {Connectionist} {Model} of {Speech} {Production}}, isbn = {978-3-030-81196-9 978-3-030-81197-6}, shorttitle = {The effect of word frequency and position-in-utterance in Mandarin speech errors}, language = {en}, urldate = {2022-12-25}, booktitle = {Chinese {Lexical} {Semantics}}, publisher = {Springer International Publishing}, author = {Wan, I.-Ping and Allassonnière-Tang, Marc}, editor = {Liu, Meichun and Kit, Chunyu and Su, Qi}, year = {2021}, doi = {10.1007/978-3-030-81197-6_42}, note = {Series Title: Lecture Notes in Computer Science}, pages = {491--500}, }
Keyword Spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptions

Harald Hammarström, One-Soon Her, and Marc Allassonnière-Tang

In Proceedings of the Swedish Language Technology Conference 2021

Abs Bib PDF

Starting from a large collection of digitized raw-text descriptions of languages of the world, we address the problem of extracting information of interest to linguists from these. We describe a general technique to extract properties of the described languages associated with a specific term. The technique is simple to implement, simple to explain, requires no training data or annotation, and requires no manual tuning of thresholds. The results are evaluated on a large gold standard database on classifiers with accuracy results that match or supersede human inter-coder agreement on similar tasks. Although accuracy is competitive, the method may still be enhanced by a more rigorous probabilistic background theory and usage of extant NLP tools for morphological variants, collocations and vector-space semantics.
@inproceedings{hammarstrom_keyword_2021, address = {Copenhagen}, title = {Keyword {Spotting}: {A} quick-and-dirty method for extracting typological features of language from grammatical descriptions}, copyright = {All rights reserved}, booktitle = {Proceedings of the {Swedish} {Language} {Technology} {Conference}}, publisher = {Northern European Journal of Language Technology}, author = {Hammarström, Harald and Her, One-Soon and Allassonnière-Tang, Marc}, year = {2021}, pages = {27--34}, }

2020

The evolutionary trends of grammatical gender in Indo-Aryan languages

Marc Allassonnière-Tang, and Michael Dunn

Language Dynamics and Change 2020

Abs Bib PDF

This paper infers the processes of development and change of grammatical gender in Indo-Aryan languages using phylogenetic comparative methods. 48 Indo-Aryan languages are coded based on 44 presence-absence features relating to gender marking on the verbs, adjectives, personal pronouns, demonstrative pronouns, and possessive pronouns. A Bayesian Reverse Jump Hyper Prior analysis, which infers the evolutionary dynamics of changes between feature values, gives results that are consistent with historical linguistic and typological studies on gender systems in Indo-Aryan languages and predicts the evolutionary trends of the features included in the dataset.
@article{allassonniere-tang_evolutionary_2020, title = {The evolutionary trends of grammatical gender in {Indo}-{Aryan} languages}, volume = {11}, issn = {2210-5824, 2210-5832}, doi = {10.1163/22105832-bja10011}, number = {2}, urldate = {2022-12-25}, journal = {Language Dynamics and Change}, author = {Allassonnière-Tang, Marc and Dunn, Michael}, month = jul, year = {2020}, pages = {211--240}, }
A Statistical Explanation of the Distribution of Sortal Classifiers in Languages of the World via Computational Classifiers

One-Soon Her, and Marc Tang

Journal of Quantitative Linguistics 2020

Abs Bib PDF

Previous studies demonstrate that morphosyntactic plural markers and the structure of numeral systems have individually strong predictive power with regard to the usage of sortal classifiers in languages. We use these two factors as explanatory variables to train the computational classifier of random forests and evaluate the accuracy of their predictive power when selecting the existence/absence of sortal classifiers as response variable. Our results show that these two factors result in an excellent discrimination performance of random forests, even when taking into account sortal classifiers as an areal feature. However, the correlation between morphosyntactic plural markers and multiplicative bases is weaker than the correlation between sortal classifiers and plural markers plus multiplicative bases. We are thus able to provide novel insights with regard to probabilistic universals on sortal classifiers, and suggest an innovative cross-disciplinary approach to test the effect of implicational universals with computational methods.
@article{her_statistical_2020, title = {A {Statistical} {Explanation} of the {Distribution} of {Sortal} {Classifiers} in {Languages} of the {World} via {Computational} {Classifiers}}, volume = {27}, issn = {0929-6174, 1744-5035}, doi = {10.1080/09296174.2018.1523777}, language = {en}, number = {2}, urldate = {2022-12-25}, journal = {Journal of Quantitative Linguistics}, author = {Her, One-Soon and Tang, Marc}, month = apr, year = {2020}, pages = {93--113}, }
Functions of gender and numeral classifiers in Nepali

Marc Allassonnière-Tang, and Marcin Kilarski

Poznan Studies in Contemporary Linguistics 2020

Abs Bib PDF

We examine the complex nominal classification system in Nepali (Indo-European, Indic), a language spoken at the intersection of the Indo-European and Sino-Tibetan language families, which are usually associated with prototypical examples of grammatical gender and numeral classifiers, respectively. In a typologically rare pattern, Nepali possesses two gender systems based on the human/non-human and masculine/feminine oppositions, in addition to which it has also developed an inventory of at least ten numeral classifiers as a result of contact with neighbouring Sino-Tibetan languages. Based on an analysis of the lexical and discourse functions of the three systems, we show that their functional contribution involves a largely complementary distribution of workload with respect to individual functions as well as the type of categorized nouns and referents. The study thus contributes to the ongoing discussions concerning the typology and functions of nominal classification as well as the effects of long-term language contact on language structure.
@article{allassonniere-tang_functions_2020, title = {Functions of gender and numeral classifiers in {Nepali}}, volume = {56}, copyright = {All rights reserved}, issn = {0137-2459, 1897-7499}, doi = {10.1515/psicl-2020-0004}, number = {1}, urldate = {2020-03-29}, journal = {Poznan Studies in Contemporary Linguistics}, author = {Allassonnière-Tang, Marc and Kilarski, Marcin}, year = {2020}, pages = {113--168}, }
A simple introduction to programming and statistics with decision trees in R

Marc Tang

Teaching Statistics 2020

Abs Bib PDF

University students in other disciplines without prior knowledge in statistics and/or programming language are introduced to the statistical method of decision trees in the programming language R during a 45-minute teaching and practice session. Statistics and programming skills are now frequently required within a wide variety of research fields and private industries. However, students unfamiliar with these subjects may be reluctant to join a full course because of time or student workloads or other commitments or a belief it is not for them. The proposed session is short and can be used as an ice-breaker to let students have a basic understanding of running statistical models in programming language.
@article{tang_simple_2020, title = {A simple introduction to programming and statistics with decision trees in {R}}, volume = {42}, copyright = {All rights reserved}, issn = {0141-982X, 1467-9639}, doi = {10.1111/test.12210}, language = {en}, number = {2}, urldate = {2020-03-29}, journal = {Teaching Statistics}, author = {Tang, Marc}, month = feb, year = {2020}, pages = {36--40}, }
Numeral base, numeral classifier, and noun: Word order harmonization

Marc Allassonnière-Tang, and One-Soon Her

Language and Linguistics 2020

Abs Bib PDF

Greenberg ( 1990a : 292) suggests that classifiers ( clf ) and numeral bases tend to harmonize in word order, i.e. a numeral (Num) with a base-final [ n base ] order appears in a clf -final [Num clf ] order, e.g. in Mandarin Chinese, san1-bai3 (three hundred) ‘300’ and san1 zhi1 gou3 (three clf animal dog) ‘three dogs’, and a base-initial [ base n ] Num appears in a clf -initial [ clf Num] order, e.g. in Kilivila (Eastern Malayo-Polynesian, Oceanic), akatu-tolu (hundred three) ‘300’ and na-tolu yena ( clf animal -three fish) ‘three fish’. In non-classifier languages, base and noun (N) tend to harmonize in word order. We propose that harmonization between clf and N should also obtain. A detailed statistical analysis of a geographically and phylogenetically weighted set of 400 languages shows that the harmonization of word order between numeral bases, classifiers, and nouns is statistically highly significant, as only 8.25% (33/400) of the languages display violations, which are mostly located at the meeting points between head-final and head-initial languages, indicating that language contact is the main factor in the violations to the probabilistic universals.
@article{allassonniere-tang_numeral_2020, title = {Numeral base, numeral classifier, and noun: {Word} order harmonization}, volume = {21}, copyright = {All rights reserved}, issn = {1606-822X, 2309-5067}, shorttitle = {Numeral base, numeral classifier, and noun}, doi = {10.1075/lali.00069.all}, language = {en}, number = {4}, urldate = {2020-09-30}, journal = {Language and Linguistics}, author = {Allassonnière-Tang, Marc and Her, One-Soon}, month = sep, year = {2020}, pages = {511--556}, }
Sociocultural gender in nominal classification: A study of grammatical gender

Marc Allassonnière-Tang, and Hiram Ring

Indian Linguistics 2020

Abs Bib

We analyse how sociocultural gender can be reflected through grammatical gender and select Hindi (Indo-European) and Pnar (Austroasiatic) as case studies. We demonstrate that these grammatical gender systems share universal tendencies based on human cognition, i.e. associating long, thin, and vertical objects with masculine grammatical gender whereas round, flat, horizontal ones are associated with feminine grammatical gender. We also show that these grammatical gender systems distinguish between sociocultural values of the language speakers. Speakers of Hindi maintain a patrilineal kinship system, and in their language objects of large size are generally assigned to the masculine gender. Pnar kinship is matrilineal and in the language large sized objects tend to be associated with feminine gender. Similar asymmetries are observed with regard to generic gender and gender reversal. These results contribute to the impact of universal cognitive principles and culture on grammatical structures by showing that both tendencies are not necessarily complementary and that they can co-exist in the same language.
@article{allassonniere-tang_sociocultural_2020, title = {Sociocultural gender in nominal classification: {A} study of grammatical gender}, volume = {81}, copyright = {All rights reserved}, number = {1-2}, journal = {Indian Linguistics}, author = {Allassonnière-Tang, Marc and Ring, Hiram}, year = {2020}, pages = {43--62}, }
Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment

Hartger Veeman, Marc Allassonnière-Tang, Aleksandrs Berdicevskis, and 1 more author

In Proceedings of the 24th Conference on Computational Natural Language Learning 2020

Abs Bib PDF

Grammatical gender is assigned to nouns differently in different languages. Are all factors that influence gender assignment idiosyncratic to languages or are there any that are universal? Using cross-lingual aligned word embeddings, we perform two experiments to address these questions about language typology and human cognition. In both experiments, we predict the gender of nouns in language X using a classifier trained on the nouns of language Y, and take the classifier’s accuracy as a measure of transferability of gender systems. First, we show that for 22 Indo-European languages the transferability decreases as the phylogenetic distance increases. This correlation supports the claim that some gender assignment factors are idiosyncratic, and as the languages diverge, the proportion of shared inherited idiosyncrasies diminishes. Second, we show that when the classifier is trained on two Afro-Asiatic languages and tested on the same 22 Indo-European languages (or vice versa), its performance is still significantly above the chance baseline, thus showing that universal factors exist and, moreover, can be captured by word embeddings. When the classifier is tested across families and on inanimate nouns only, the performance is still above baseline, indicating that the universal factors are not limited to biological sex.
@inproceedings{veeman_cross-lingual_2020, address = {Online}, title = {Cross-lingual {Embeddings} {Reveal} {Universal} and {Lineage}-{Specific} {Patterns} in {Grammatical} {Gender} {Assignment}}, copyright = {All rights reserved}, booktitle = {Proceedings of the 24th {Conference} on {Computational} {Natural} {Language} {Learning}}, publisher = {Association for Computational Linguistics}, author = {Veeman, Hartger and Allassonnière-Tang, Marc and Berdicevskis, Aleksandrs and Basirat, Ali}, month = nov, year = {2020}, pages = {265--275}, }

2019

Word order of numeral classifiers and numeral bases

One-Soon Her, Marc Tang, and Bing-Tsiong Li

STUF - Language Typology and Universals 2019

Abs Bib PDF

In a numeral classifier language, a sortal classifier (C) or a mensural classifier (M) is needed when a noun is quantified by a numeral (Num). Num and C/M are adjacent cross-linguistically, either in a [Num C/M] order or [C/M Num]. Likewise, in a complex numeral with a multiplicative composition, the base may follow the multiplier as in [ n×base ], e.g., san-bai ‘three hundred’ in Mandarin. However, the base may also precede the multiplier in some languages, thus [ base×n ]. Interestingly, base and C/M seem to harmonize in word order, i.e., [ n×base ] numerals appear with a [Num C/M] alignment, and [ base×n ] numerals, with [C/M Num]. This paper follows up on the explanation of the base-C/M harmonization based on the multiplicative theory of classifiers and verifies it empirically within six language groups in the world’s foremost hotbed of classifier languages: Sinitic, Miao-Yao, Austro-Asiatic, Tai-Kadai, Tibeto-Burman, and Indo-Aryan. Our survey further reveals two interesting facts: base-initial ([ base×n ]) and C/M-initial ([C/M Num]) orders exist only in Tibeto-Burman (TB) within our dataset. Moreover, the few scarce violations to the base-C/M harmonization are also all in TB and are mostly languages having maintained their original base-initial numerals but borrowed from their base-final and C/M-final neighbors. We thus offer an explanation based on Proto-TB’s base-initial numerals and language contact with neighboring base-final, C/M-final languages.
@article{her_word_2019, title = {Word order of numeral classifiers and numeral bases}, volume = {72}, copyright = {All rights reserved}, issn = {1867-8319, 2196-7148}, doi = {10.1515/stuf-2019-0017}, number = {3}, urldate = {2019-10-03}, journal = {STUF - Language Typology and Universals}, author = {Her, One-Soon and Tang, Marc and Li, Bing-Tsiong}, month = sep, year = {2019}, pages = {421--452}, }
Insights on the Greenberg-Sanches-Slobin generalization: Quantitative typological data on classifiers and plural markers

Marc Tang, and One-Soon Her

Folia Linguistica 2019

Abs Bib PDF

This paper offers quantitative typological data to investigate a revised version of the Greenberg-Sanches-Slobin generalization (GSSG), which states that (a) a language is unlikely to have both sortal classifiers and morphosyntactic plural markers, and (b) if a language does have both, then their use is in complementary distribution. Morphosyntactic plurals engage in grammatical agreement outside the noun phrase, while morphosemantic plurals that relate to collective and associative marking do not. A database of 400 phylogenetically and geographically weighted languages was created to test this generalization. The statistical test of conditional inference trees was applied to investigate the effect of areal, phylogenetic, and linguistic factors on the distribution of classifiers and morphosyntactic plural markers. The results show that the presence of classifiers is affected by areal factors as most classifier languages are concentrated in Asia. Yet, the low ratio of languages with both features simultaneously is still statistically significant. Part (a) of the GSSG can thus be seen as a statistical universal. We then look into the few languages that do have both features and tentatively conclude that part (b) also seems to hold but further investigation into some of these languages is needed.
@article{tang_insights_2019, title = {Insights on the {Greenberg}-{Sanches}-{Slobin} generalization: {Quantitative} typological data on classifiers and plural markers}, volume = {53}, copyright = {All rights reserved}, doi = {10.1515/flin-2019-2013}, number = {2}, journal = {Folia Linguistica}, author = {Tang, Marc and Her, One-Soon}, year = {2019}, pages = {297--331}, }
A typology of classifiers and gender: From description to computation

Marc Tang

In Acta Universitatis Upsaliensis 2019

Abs Bib PDF

Categorization is one the most relevant tasks realized by humans during their life, as we consistently need to categorize the things and experience that we encounter. Such need is reflected in language via various mechanisms, the most prominent being nominal classification systems (e.g., grammatical gender such as the masculine/feminine distinction in French). Typological methods are used to investigate the underlying functions and structures of such systems, using a wide variety of cross-linguistic data to examine universality and variability. This analysis is itself a classification task, as languages are categorized and clustered according to their grammatical features. This thesis provides a cross-linguistic typological analysis of nominal classification systems and in parallel compares a number of quantitative methods that can be applied at different scales. First, this thesis provides an analysis of nominal classification systems (i.e., gender and classifiers) via the description of three languages with respectively gender, classifiers, and both. While the analysis of the first two languages are more of a descriptive nature and aligns with findings in the existing literature, the third language provides novel insights to the typology of nominal classification systems by demonstrating how classifiers and gender may co-occur in one language in terms of distribution of functions. Second, the underlying logic of nominal classification systems is commonly considered difficult to investigate, e.g., is there a consistent logic behind gender assignment in language? is it possible to explain the distribution of classifier languages of the world while taking into account geographical and genealogical effects? This thesis addresses the lack of arbitrariness of nominal classification systems at three different scales: The distribution of classifiers at the worldwide level, the presence of gender within a language family, and gender assignment at the language-internal level. The methods of random forests, phylogenetics, and word embeddings with neural networks are selected since they are respectively applicable at three different scales of research questions (worldwide, family-internal, language-internal).
@book{tang_typology_2019, address = {Uppsala}, series = {Studia {Linguistica} {Upsaliensia}}, title = {A typology of classifiers and gender: {From} description to computation}, copyright = {All rights reserved}, isbn = {978-91-513-0507-3}, number = {23}, publisher = {Acta Universitatis Upsaliensis}, author = {Tang, Marc}, year = {2019}, keywords = {Språkvetenskap, Genus (språkvetenskap), Neuronnät (datorer)}, }
Linguistic Information in Word Embeddings

Ali Basirat, and Marc Tang

In Agents and Artificial Intelligence 2019

Abs Bib PDF

We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.
@incollection{van_den_herik_linguistic_2019, address = {Cham}, title = {Linguistic {Information} in {Word} {Embeddings}}, volume = {11352}, copyright = {All rights reserved}, isbn = {978-3-030-05452-6 978-3-030-05453-3}, urldate = {2019-09-25}, booktitle = {Agents and {Artificial} {Intelligence}}, publisher = {Springer International Publishing}, author = {Basirat, Ali and Tang, Marc}, editor = {van den Herik, Jaap and Rocha, Ana Paula}, year = {2019}, doi = {10.1007/978-3-030-05453-3_23}, pages = {492--513}, }
Predicting Speech Errors in Mandarin Based on Word Frequency

Marc Tang, and I-Ping Wan

In From Minimal Contrast to Meaning Construct 2019

Abs Bib PDF

This paper investigates the effect of word frequency on the occurrence of speech errors in Mandarin. A corpus of 390 speech errors along with their surrounding linguistic context was gathered. The information of word frequency was extracted from the Academia Sinica Corpus. Our analysis with a computational classifier based on conditional inference trees shows that intended words having a frequency lower than words of the surrounding context are more likely to generate speech errors.
@incollection{su_predicting_2019, address = {Singapore}, title = {Predicting {Speech} {Errors} in {Mandarin} {Based} on {Word} {Frequency}}, copyright = {All rights reserved}, isbn = {978-981-329-239-0 978-981-329-240-6}, language = {en}, urldate = {2019-10-01}, booktitle = {From {Minimal} {Contrast} to {Meaning} {Construct}}, publisher = {Springer}, author = {Tang, Marc and Wan, I-Ping}, editor = {Su, Qi and Zhan, Weidong}, year = {2019}, doi = {10.1007/978-981-32-9240-6_20}, pages = {289--303}, }

Review of McGregor & Wichmann (2018) The diachrony of classification systems

Marc Tang

Linguistic Variation 2019

Bib PDF

@article{tang_review_2019,
  title = {Review of {McGregor} \& {Wichmann} (2018) {The} diachrony of classification systems},
  volume = {19},
  copyright = {All rights reserved},
  issn = {2211-6834, 2211-6842},
  doi = {10.1075/lv.00012.tan},
  language = {en},
  number = {2},
  urldate = {2019-09-25},
  journal = {Linguistic Variation},
  author = {Tang, Marc},
  year = {2019},
  pages = {386--392},
}

2018

The lexical and discourse functions of grammatical gender in Marathi

Pär Eliasson, and Marc Tang

Journal of South Asian Languages and Linguistics 2018

Abs Bib PDF

We provide a functional analysis of the grammatical gender system of Marathi (Indo-Aryan) in Western India. The majority of the new Indo-Aryan languages typically classifies each noun of the lexicon according to biological gender as masculine and feminine. Only a few Indo-Aryan languages such as Marathi diverge in terms of agreement pattern by categorizing nouns as masculine, feminine, and neuter. Yet gender in Marathi has not been extensively described in terms of functions. We thus use apply functional typology to analyze grammatical gender in Marathi and provide detailed examples of its lexical and discourse functions.
@article{eliasson_lexical_2018, title = {The lexical and discourse functions of grammatical gender in {Marathi}}, volume = {5}, copyright = {All rights reserved}, issn = {2196-0771, 2196-078X}, doi = {10.1515/jsall-2018-0012}, number = {2}, urldate = {2019-09-25}, journal = {Journal of South Asian Languages and Linguistics}, author = {Eliasson, Pär and Tang, Marc}, month = nov, year = {2018}, pages = {131--157}, }

The dynamics of nominal classification: Productive and lexicalised uses of gender agreement in Mawng by Ruth Singer

Marc Tang

Oceanic Linguistics 2018

Bib PDF

@article{tang_dynamics_2018,
  title = {The dynamics of nominal classification: {Productive} and lexicalised uses of gender agreement in {Mawng} by {Ruth} {Singer}},
  volume = {57},
  copyright = {All rights reserved},
  issn = {1527-9421},
  shorttitle = {The dynamics of nominal classification},
  doi = {10.1353/ol.2018.0010},
  language = {en},
  number = {1},
  urldate = {2019-09-25},
  journal = {Oceanic Linguistics},
  author = {Tang, Marc},
  year = {2018},
  pages = {255--260},
}

Lexical and morpho-syntactic features in word embeddings: A case study of nouns in Swedish

Ali Basirat, and Marc Tang

In Proceedings of the 10th International Conference on Agents and Artificial Intelligence 2018

Abs Bib PDF

We apply real-valued word vectors combined with two different types of classifiers (linear discriminant analy- sis and feed-forward neural network) to scrutinize whether basic nominal categories can be captured by simple word embedding models. We also provide a linguistic analysis of the errors generated by the classifiers. The targeted language is Swedish, in which we investigate three nominal aspects: uter/neuter, common/proper, and count/mass. They represent respectively grammatical, semantic, and mixed types of nominal classification within languages. Our results show that word embeddings can capture typical grammatical and semantic fea- tures such as uter/neuter and common/proper nouns. Nevertheless, the model encounters difficulties to identify classes such as count/mass which not only combine both grammatical and semantic properties, but are also subject to conversion and shift. Hence, we answer the call of the Special Session on Natural Language Process- ing in Artif icial Intelligence by approaching the topic of interfaces between morphology, lexicon, semantics, and syntax via interdisciplinary methods combining machine learning of language and general linguistics.
@inproceedings{basirat_lexical_2018, title = {Lexical and morpho-syntactic features in word embeddings: {A} case study of nouns in {Swedish}}, copyright = {All rights reserved}, doi = {10.5220/0006729606630674}, booktitle = {Proceedings of the 10th International Conference on Agents and Artificial Intelligence}, author = {Basirat, Ali and Tang, Marc}, year = {2018}, pages = {663--674}, }
The coalescence of grammatical gender and numeral classifiers in the general classifier wota in Nepali

Marcin Kilarski, and Marc Tang

In Proceedings of the Linguistic Society of America 2018

Abs Bib PDF

While nominal classification has received considerable attention, relatively little is known about cross-linguistically rare complex systems. An example is provided by Nepali (Indo-European, Indic), which possesses both grammatical gender and numeral classifiers. Our aim is to examine morphosyntactic and functional properties of the general classifier wota. Unusually, the classifier exhibits gender agreement both in its independent forms and as fused with a numeral, raising questions about its lexical and pragmatic functions. Our study contributes to the typology of nominal classification by proposing a functional approach to cases of complex co-occurrence of gender and classifiers.
@inproceedings{kilarski_coalescence_2018, title = {The coalescence of grammatical gender and numeral classifiers in the general classifier wota in {Nepali}}, copyright = {All rights reserved}, issn = {2473-8689}, doi = {10.3765/plsa.v3i1.4352}, urldate = {2019-09-25}, booktitle = {Proceedings of the Linguistic Society of America}, author = {Kilarski, Marcin and Tang, Marc}, year = {2018}, pages = {56}, }

2017

Explaining the acquisition order of classifiers and measure words via their mathematical complexity

Marc Tang

Journal of Child Language Acquisition and Development 2017

Abs Bib PDF

We provide theoretical explanation for the acquisition of numeral classifiers (sortal classifiers) and measure words (mensural classifiers) in Mandarin Chinese. Previous research in various languages separately observed that the general classifier is acquired before specific classifiers and that classifiers are acquired previous to measure words. However no theoretical discussion was fully developed and no study combined general classifier, specific classifiers and measure words in one dataset. We propose to fill these gaps by combining semantic complexity (Brown, 1973) and a mathematical approach (Her, 2012): given that the relative complexity of x, y and z is unknown, x + y is more complex than either x or y, and x + y + z is more complex than any of them. By applying the mathematical approach, it is observed that general classifier carries the mathematical value of times one, noted x, while specific classifiers posses x plus a semantic value of y, which highlights an inherent feature of the referent. Finally, measure words detain both x and y, along with a new information of quantity z. Therefore, the acquisition order is expected to start from the simplest semanticity and develop toward the most complex, i.e. general classifiers (x) > specific classifier (x+y)> measure word (x+y+z). As supporting evidence, we gathered longitudinal data from CHILDES (Child Language Data Exchange System; Zhou, 2008). The participants included 110 children from 1-6 years old, providing a total of 110 conversations of 20 minutes each with 1851 tokens of numeral classifiers and measure words. Our methodology applied the definition of acquisition from Brown (1973) and the equation of Suppliance in Obligatory Context (SOC) cross-checked with Target-Like Usage (TLU) from Pica (1983). The results demonstrated that our model generated correct prediction, serving as theoretical basis for future studies in the field of language acquisition.
@article{tang_explaining_2017, title = {Explaining the acquisition order of classifiers and measure words via their mathematical complexity}, volume = {5}, copyright = {All rights reserved}, number = {1}, urldate = {2017-05-07}, journal = {Journal of Child Language Acquisition and Development}, author = {Tang, Marc}, year = {2017}, pages = {31--52}, }