- Why we need a gradient approach to word orderNatalia Levshina, Savithry Namboodiripad, Marc Allassonnière-Tang, and 12 more authorsLinguistics 2023
This article argues for a gradient approach to word order, which treats word order preferences, both within and across languages, as a continuous variable. Word order variability should be regarded as a basic assumption, rather than as something exceptional. Although this approach follows naturally from the emergentist usage-based view of language, we argue that it can be beneficial for all frameworks and linguistic domains, including language acquisition, processing, typology, language contact, language evolution and change, and formal approaches. Gradient approaches have been very fruitful in some domains, such as language processing, but their potential is not fully realized yet. This may be due to practical reasons. We discuss the most pressing methodological challenges in corpus-based and experimental research of word order and propose some practical solutions.
- Intra- and inter-speaker variation in eight Russian fricativesNatalja Ulrich, François Pellegrino, and Marc Allassonnière-TangThe Journal of the Acoustical Society of America 2023
Acoustic variation is central to the study of speaker characterization. In this respect, specific phonemic classes such as vowels have been particularly studied, compared to fricatives. Fricatives exhibit important aperiodic energy, which can extend over a high-frequency range beyond that conventionally considered in phonetic analyses, often limited up to 12 kHz. We adopt here an extended frequency range up to 20.05 kHz to study a corpus of 15 812 fricatives produced by 59 speakers in Russian, a language offering a rich inventory of fricatives. We extracted two sets of parameters: the first is composed of 11 parameters derived from the frequency spectrum and duration (acoustic set) while the second is composed of 13 mel frequency cepstral coefficients (MFCCs). As a first step, we implemented machine learning methods to evaluate the potential of each set to predict gender and speaker identity. We show that gender can be predicted with a good performance by the acoustic set and even more so by MFCCs (accuracy of 0.72 and 0.88, respectively). MFCCs also predict individuals to some extent (accuracy = 0.64) unlike the acoustic set. In a second step, we provide a detailed analysis of the observed intra- and inter-speaker acoustic variation.
- A corpus-based quantitative study of numeral classifiers in NepaliKrishna Parajuli, and Marc Allassonnière-TangCorpus Linguistics and Linguistic Theory 2023
Nepali is typologically rare in terms of nominal classification systems, as it is one of the few languages of the world having simultaneously two gender systems (human/non-human, masculine/feminine) and one numeral classifier system (distinguishing features such as human, round-shaped objects, and long objects among others). Such a rare co-occurrence of different nominal classification systems is highly relevant for investigating linguistic complexity, as languages generally do not have several systems of the same type fulfilling the same functions. However, no corpus-based quantitative analyses have been conducted on the productive use of nominal classification systems in Nepali. The current paper aims at filling this gap by providing a token-based study from the Nepali National Corpus (∼20 million words). Our preliminary results show that there is in fact little formal overlap between the classifier and the gender systems.
- Defining numeral classifiers and identifying classifier languages of the worldOne-Soon Her, Harald Hammarström, and Marc Allassonnière-TangLinguistics Vanguard 2022
This paper presents a precise definition of numeral classifiers, steps to identify a numeral classifier language, and a database of 3,338 languages, of which 723 languages have been identified as having a numeral classifier system. The database, named World Atlas of Classifier Languages (WACL), has been systematically constructed over the last 10 years via a manual survey of relevant literature and also an automatic scan of digitized grammars followed by manual checking. The open-access release of WACL is thus a significant contribution to linguistic research in providing (i) a precise definition and examples of how to identify numeral classifiers in language data and (ii) the largest dataset of numeral classifier languages in the world. As such it offers researchers a rich and stable data source for conducting typological, quantitative, and phylogenetic analyses on numeral classifiers. The database will also be expanded with additional features relating to numeral classifiers in the future in order to allow more fine-grained analyses.
- The noncausal/causal alternation and genealogical affiliation: Quantitative testing in three Niger-Congo language familiesMarc Allassonnière-Tang, Stéphane Robert, and Sylvie VoisinLinguistique et Langues Africaines 2022
The noncausal/causal alternation is the pairing of two verb forms that refer to the same core event but differ in the absence vs. presence of a causer for this event (e.g. rise vs. raise, open (intr.) vs. open (tr.), die vs. kill). Languages differ in their overall preferences among the possible strategies for coding this alternation. This study uses machine-learning methods (clustering and tree-based computational classifiers) to investigate the predictive power of the noncausal/causal alternation for the genealogical affiliation of 38 languages belonging to the Atlantic, Mande and Mel families. The languages studied here belong to different contact areas in Senegal and its surroundings. The three families are all affiliated to the Niger-Congo phylum but display quite different typological profiles. The present paper elaborates on an earlier study that used a standard list of 18 verb pairs to establish the coding strategies in these languages. Apart from highlighting which coding strategies are favored in each family, our quantitative analyses show that the family affiliation of the 38 languages can be predicted with an accuracy above the majority baseline based on the information of the noncausal/causal alternation in the 18 verb pairs, but that the predictive power of verb pairs 1‑9 is generally lower than the one of verb pairs 10‑18. Our results confirm the hypothesis that the first group of verb pairs shows universal rather than lineage-specific tendencies concerning the noncausal/causal alternation. Furthermore, our analyses identify which of the 18 verb pairs (and their correlated coding strategies) have the highest predictive power. This study opens new avenues for identifying the relevant synchronic data for genealogical classification in historical linguistics. Future studies could replicate the same analysis in different language families to assess if our results are universal or specific to some language families.
- On Taiwanese Universities’ Two–One Academic Dismissal Policies: A Quantitative Fairness Analysis of the Four Policies of National Chengchi UniversityOne-Soon Her, Jie-Wen Tsai, and Marc Allassonnière-TangJournal of Educational Research and Development 2022
Academic dismissal policies are used by universities worldwide for quality control purposes. Taiwanese universities base their policies solely on the credit fail rate (CFR) of individual semesters (S-CFR). The most common S-CFR is 50% and is called er-yi (two-one), which indicates half or more of the course credits of a semester were failed. Though actual policies vary among universities, their core designs generally rely on the concept of S-CFR. The present study first compares the dismissal policies among universities in the United States, the Netherlands, and Taiwan to demonstrate how the two–one design lacks consultation and review processes. We then argue that the disregard for cumulative grade point average, semester grade point average, and cumulative credit pass rate may lead to bias because it may lead to students with better overall academic performance being dismissed. We further validate the argument by conducting a quantitative analysis of data on the academic performance of students (N=22,703) from National Chengchi University over 11 years under four different policies. Our findings strongly indicate that the core design common in such policies, i.e., the S-CFR, should be reconsidered.
- Predicting grammatical gender in Nakh languages: Three methods comparedJesse Wichers Schreur, Marc Allassonnière-Tang, Kate Bellamy, and 1 more authorLinguistic Typology at the Crossroads 2022
The Nakh languages Chechen and Tsova-Tush each have a five-valued gender system: masculine, feminine, and three “neuter” genders named for their singular agreement forms: B, D and J. Gender assignment in languages is generally analysed as being dependent on both forms and semantics (e.g. Corbett, 1991), with semantics typically prevailing over form (e.g. Bellamy & Wichers Schreur, 2021, Allassonnière-Tang et al., 2021). Most previous studies have considered only binary or tripartite gender systems possessing masculine, feminine, and neuter values. The five-valued system of Nakh thus represents an innovative and insightful case study for analysing gender assignment. In this paper we build on the existing qualitative linguistic analyses of gender assignment in Tsova-Tush (Wichers Schreur, 2021) and apply three machine-learning methods to investigate the weight of form and semantics in predicting grammatical gender in Chechen and Tsova-Tush. The results show that while both form and semantics are helpful for predicting grammatical gender in Nakh, semantics is dominant, which supports findings from existing literature (Allassonnière-Tang, Brown & Fedden, 2021). However, the results also show that the coded semantic information could be further fine-grained to improve the accuracy of the predictions (see also Plaster et al., 2013). In addition, we discuss the implications of the output for our understanding of language-internal and family-internal processes of language change, including how loanwords are integrated from Russian, a three-gender language.
- Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced LanguagesMathilde Hutin, and Marc Allassonnière-TangLanguages 2022
Less-resourced languages are usually left out of phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France in 2015. It contains more than 670k recordings in approximately 150 languages across nearly 740 speakers. As a proof of concept, we consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate data from 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System. Information on the formants of the vowel segments is then extracted to measure how vowels expand in the acoustic space and whether this is correlated with the number of vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.
- Inferring case paradigms in Koalib with computational classifiersNicolas Quint, and Marc Allassonnière-TangCorpus Linguistics and Linguistic Theory 2022
The object case inflection in Koalib (Niger-Congo) represents complex patterns that involve phoneme position, syllable structure, and tonal pattern. Few attempts have been made with qualitative and quantitative approaches to identify the rules of the object case paradigms in Koalib. In the current study, information on phonemes, tones, and syllables are automatically extracted from a Koalib sample of 2,677 lexemes. The data is then fed to decision-tree-based classifiers to predict the object case paradigms and extract the interactive patterns between the variables. The results improve the predicting accuracy of existing studies and identify the case paradigms predicted by linguistic hypotheses. New case paradigms are also found by the computational classifiers and explained from a linguistic perspective. Our work demonstrates that the combination of linguistic theoretical knowledge with machine learning techniques can become one of the methodological approaches for linguistic analyses.
- The evolutionary trends of noun class systems in Atlantic languagesNeige Rochant, Marc Allassonnière-Tang, and Chundra CathcartIn Proceedings of the Joint Conference on Language Evolution 2022
Nominal classification systems such as grammatical gender (e.g., the masculine/feminine distinction in French) and noun classes (e.g., Bantu noun classes based on fruits, plants, liquids, among others) provide a window on how the human brain perceives and categorizes objects and experiences it encounters. While the diachronic development of grammatical gender systems is well studied, noun class systems have received less attention. We use phylogenetic comparative methods to analyze where noun classes are marked (on nouns, pronouns, demonstratives, articles, adjectives, numbers, and verbs) in thirty-six Atlantic languages and how these markers change diachronically. Our results show that noun class marking is generally preferred and more stable within the noun phrase, i.e., on nouns, demonstratives, and adjectives.
- Investigating phonological theories with crowd-sourced data: The Inventory Size Hypothesis in the light of Lingua LibreMathilde Hutin, and Marc Allassonnière-TangIn Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology 2022
Data-driven research in phonetics and phonology relies massively on oral resources, and access thereto. We propose to explore a question in comparative linguistics using an open-source crowd-sourced corpus, Lingua Libre, Wikimedia’s participatory linguistic library, to show that such corpora may offer a solution to typologists wishing to explore numerous languages at once. For the present proof of concept, we compare the realizations of Italian and Spanish vowels (sample size = 5000) to investigate whether vowel production is influenced by the size of the phonemic inventory (the Inventory Size Hypothesis), by the exact shape of the inventory (the Vowel Quality Hypothesis) or by none of the above. Results show that the size of the inventory does not seem to influence vowel production, thus supporting previous research, but also that the shape of the inventory may well be a factor determining the extent of variation in vowel production. Most of all, these results show that Lingua Libre has the potential to provide valuable data for linguistic inquiry.
- Crowd-sourcing for Less-resourced Languages: Lingua Libre for PolishMathilde Hutin, and Marc Allassonnière-TangIn Proceedings of the International Conference on Language Resources and Evaluation 2022
Oral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the present study, we take Polish, a less-resourced language in terms of phonetic data, as an example, and compare our phonetic observations built on the data from Lingua Libre with the phonetic observations found by previous linguistic studies. We observe that the data from Lingua Libre partially matches the phonetic inventory of Polish as described in previous studies, but that the acoustic values are less precise, thus showing both the potential and the limitations of Lingua Libre to be used for phonetic research.
- Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systemsMarc Allassonnière-Tang, Olof Lundgren, Maja Robbers, and 5 more authorsHumanities and Social Sciences Communications 2021
Languages of diverse structures and different families tend to share common patterns if they are spoken in geographic proximity. This convergence is often explained by horizontal diffusibility, which is typically ascribed to language contact. In such a scenario, speakers of two or more languages interact and influence each other’s languages, and in this interaction, more grammaticalized features tend to be more resistant to diffusion compared to features of more lexical content. An alternative explanation is vertical heritability: languages in proximity often share genealogical descent. Here, we suggest that the geographic distribution of features globally can be explained by two major pathways, which are generally not distinguished within quantitative typological models: feature diffusion and language expansion. The first pathway corresponds to the contact scenario described above, while the second occurs when speakers of genetically related languages migrate. We take the worldwide distribution of nominal classification systems (grammatical gender, noun class, and classifier) as a case study to show that more grammaticalized systems, such as gender, and less grammaticalized systems, such as classifiers, are almost equally widespread, but the former spread more by language expansion historically, whereas the latter spread more by feature diffusion. Our results indicate that quantitative models measuring the areal diffusibility and stability of linguistic features are likely to be affected by language expansion that occurs by historical coincidence. We anticipate that our findings will support studies of language diversity in a more sophisticated way, with relevance to other parts of language, such as phonology.
- Identifying the Russian voiceless non-palatalized fricatives /f/, /s/, and /ʃ/ from acoustic cues using machine learningNatalja Ulrich, Marc Allassonnière-Tang, François Pellegrino, and 1 more authorThe Journal of the Acoustical Society of America 2021
This paper shows that machine learning techniques are very successful at classifying the Russian voiceless non-palatalized fricatives [f], [s], and [ʃ] using a small set of acoustic cues. From a data sample of 6320 tokens of read sentences produced by 40 participants, temporal and spectral measurements are extracted from the full sound, the noise duration, and the middle 30 ms windows. Furthermore, 13 mel-frequency cepstral coefficients (MFCCs) are computed from the middle 30 ms window. Classifiers based on single decision trees, random forests, support vector machines, and neural networks are trained and tested to distinguish between these three fricatives. The results demonstrate that, first, the three acoustic cue extraction techniques are similar in terms of classification accuracy (93% and 99%) but that the spectral measurements extracted from the full frication noise duration result in slightly better accuracy. Second, the center of gravity and the spectral spread are sufficient for the classification of [f], [s], and [ʃ] irrespective of contextual and speaker variation. Third, MFCCs show a marginally higher predictive power over spectral cues (<2%). This suggests that both sets of measures provide sufficient information for the classification of these fricatives and their choice depends on the particular research question or application.
- Investigating the branching of Chinese classifier phrases: Evidence from speech perception and productionMarc Allassonnière-Tang, Ying-Chun Chen, Nai-Shing Yen, and 1 more authorJournal of Chinese Linguistics 2021
The formal structure of the construction formed by a numeral (Num), a sortal classifier (C) or mensural classifier (M), and a noun (N), is controversial, as both left-branching [[Num C/M] N] and right-branching [Num [C/M N]] structures have been argued for in the literature. In this paper we report two psycholinguistic experiments on speech production and perception in Mandarin to investigate this branching issue. First, we applied the syntax-phonology interface of tone 3 (T3) sandhi and performed a phonological analysis of native speakers’ tone sandhi patterns of [Num C/M N] phrases composed of T3 monosyllabic words. Second, we conducted a click-detection experiment to see how native speakers would perceive a click inserted in a C/M phrase composed of monosyllabic words, as compared to when it is inserted in other syntactic structures with attested left or right-branching. Results from both experiments supported the leftbranching structure of classifier phrases.
- Syllable Complexity and Morphological Synthesis: A Well-Motivated Positive Complexity Correlation Across SubdomainsShelece Easterday, Matthew Stave, Marc Allassonnière-Tang, and 1 more authorFrontiers in Psychology 2021
Relationships between phonological and morphological complexity have long been proposed in the linguistic literature, with empirical investigations often seeking complexity trade-offs. Positive complexity correlations tend not to be viewed in terms of motivations. We argue that positive complexity correlations can be diachronically well-motivated, emerging from crosslinguistically prevalent processes of language change. We examine the correlation between syllable complexity and morphological synthesis, hypothesizing that the process of grammaticalization motivates a positive relationship between the two features. To test this, we conduct a typological survey of 95 diverse languages and a corpus study of 21 languages with substantive (predominantly \textgreater10,000 words) corpora from the DoReCo project. The first study establishes a significant positive correlation between syllable complexity, measured in terms of maximal syllable patterns, and the index of synthesis (morpheme/word ratio). The second study tests the hypothesis that the relationship between syllable complexity and synthesis holds at local (word-initial and word-final) levels and within noun and verb types, as predicted by a grammaticalization account. While the findings of the corpus study are limited in their statistical power, the observed tendencies are consistent with our predictions. This study contributes important findings to the complexity literature, as well as a novel method which incorporates broad typological sampling and deep corpus analysis.
- Testing Semantic Dominance in Mian Gender: Three Machine Learning ModelsMarc Allassonnière-Tang, Dunstan Brown, and Sebastian FeddenOceanic Linguistics 2021
The Trans-New Guinea language Mian has a four-valued gender system that has been analyzed in detail as semantic. This means that the principles of gender assignment are based on the meaning of the noun. Languages with purely semantic systems are at one end of a spectrum of possible assignment types, while others are assumed to have both semantic and formal (i.e., phonologyor morphology-based) assignment. Given the possibility of gender assignment by both semantic and formal principles, it is worthwhile testing the empirical validity of the categorization of the Mian system as predominantly semantic. Here, we apply three machine learning models to determine independently what role semantics and phonology play in predicting Mian gender. Information about the formal and semantic features of nouns is extracted automatically from a dictionary. Different types of computational classifiers are trained to predict the grammatical gender of nouns, and the performance of the computational classifiers is used to assess the relevance of form and semantics in relation to gender prediction. The results show that semantics is dominant in predicting the gender of nouns in Mian. While it validates the original analysis of the Mian system, it also provides further evidence that claims of an equal contribution of form-based and semantic features in gender assignment do not hold for at least a proper subset of languages with gender.
- Interindividual Variation Refuses to Go Away: A Bayesian Computer Model of Language Change in Communicative NetworksMathilde Josserand, Marc Allassonnière-Tang, François Pellegrino, and 1 more authorFrontiers in Psychology 2021
Treating the speech communities as homogeneous entities is not an accurate representation of reality, as it misses some of the complexities of linguistic interactions. Inter-individual variation and multiple types of biases are ubiquitous in speech communities, regardless of their size. This variation is often neglected due to the assumption that “majority rules,” and that the emerging language of the community will override any such biases by forcing the individuals to overcome their own biases, or risk having their use of language being treated as “idiosyncratic” or outright “pathological.” In this paper, we use computer simulations of Bayesian linguistic agents embedded in communicative networks to investigate how biased individuals, representing a minority of the population, interact with the unbiased majority, how a shared language emerges, and the dynamics of these biases across time. We tested different network sizes (from very small to very large) and types (random, scale-free, and small-world), along with different strengths and types of bias (modeled through the Bayesian prior distribution of the agents and the mechanism used for generating utterances: either sampling from the posterior distribution [“sampler”] or picking the value with the maximum probability [“MAP”]). The results show that, while the biased agents, even when being in the minority, do adapt their language by going against their a priori preferences, they are far from being swamped by the majority, and instead the emergent shared language of the whole community is influenced by their bias.
- The Diversity of Classifier Inventory in Mandarin Dialects: A Case Study of BaodingNa Song, and Marc Allassonnière-TangFaits de Langues 2021
Our study compares Standard Mandarin (the Beijing dialect used in spoken and written registers) with the Mandarin dialect of Baoding (one of the Mandarin dialects belonging to the Jì-lŭ Mandarin group, Hebei-Shandong). Standard Mandarin and Baoding are geographically and phylogenetically closely related, but they differ in terms of their classifier system, as Standard Mandarin resorts to a wide array of sortal classifiers whereas Baoding only uses one general classifier. We first provide a detailed analysis of the unconventional classifier system in Baoding. Then, we compare the lexical and discourse functions of sortal classifiers in Standard Mandarin and Baoding. We show that Standard Mandarin does present a certain level of convergence with its geographical neighbour Baoding. However, these varieties also display significant divergences, as several lexical and discourse functions typically associated with classifier systems cannot be fulfilled by the only classifier found in Boading.
- Topic modelling on archive documents from the 1970s: global policies on refugeesPhilip Grant, Ratan Sebastian, Marc Allassonnière-Tang, and 1 more authorDigital Scholarship in the Humanities 2021
This study conducts a historical analysis of global policies on refugees within typewritten and digitally born documents (c. 55,000 pages) from international and national archives. The data originate from the 1970s and are stored in archives from the UK and US governments, plus the United Nations High Commissioner for Refugees (UNHCR). The overarching theme is to analyse the involvement of the UK, the USA, and the UNHCR in different refugee cases that occurred during the 1970s. To do so, we (1) identify the main topics in each document; (2) investigate the transmission of topics horizontally (between organizations) and vertically (through time); and (3) suggest targeted areas of the document set for further close reading by historians. Standard Optical Character Recognition and object detection are used to extract information from documents and categorize them. Then, natural language processing (NLP) methods like topic modelling and clustering are used to identify topics and the relationships between them across time. The results identify several main themes covered by different organizations and how the focus of each organization changes diachronically. Besides its academic contribution, this study also demonstrates how, through the use of existing techniques with limited customization, digital technologies in the hands of the historian can augment and complement qualitative methods in bringing to light the themes and trends demonstrated in large bodies of historical documents.
- What conditions tone paradigms in Yukuna: Phonological and machine learning approachesMagdalena Lemus-Serrano, Marc Allassonnière-Tang, and Dan DediuGlossa: a journal of general linguistics 2021
Yukuna is an understudied Arawak language of North-West Amazonia with a privative tonal system. In this system, roots are underlyingly specified for tone, whilst affixes are toneless. However, affixation interacts with tone, leading to many variations in surface tonal patterns. This paper puts forth a qualitative analysis of Yukuna’s tonal system, and provides data-driven evidence in favor of this analysis using machine learning methods. More precisely, we use decision trees and random forests to assess quantitatively the predictions of the phonological analysis. A manually annotated corpus of verbal paradigms was split into a training and a testing set. We trained the computational classifiers on the first and tested their predictions on the second. We found that they predict the majority of the patterns and support the qualitative analysis. Additionally, they suggest avenues for enhancing the phonological analysis, by providing a ranking of the variables that highlight statistical tendencies within tonal patterns. Besides its contribution to understanding tonal systems in general and of that of Yukuna in particular, our work also suggests that such machine learning approaches might become part of the complex theoretical and methodological toolkit needed for language description and linguistic theory development.
- A corpus study of lexical speech errors in MandarinI-Ping Wan, and Marc Allassonnière-TangTaiwan Journal of Linguistics 2021
We investigate a corpus of lexical substitution speech errors in Mandarin conversation data and present how Mandarin speakers produce erroneous lexical items and how these items are related to the intended words. The corpus includes 747 lexical speech errors from 100 participants and applies the part-of-speech definition of the Academia Sinica Corpus. Our results partially match with the observations in Germanic and Romance languages. As an example, the data from Mandarin native speakers shows that erroneously produced words and target words are almost always found in the same parts of speech. Moreover, noun substitutions are the most common type of substitution within the majority of content word pairs. However, the occurrence of verb errors is higher in Mandarin than in other languages, possibly reflecting a word frequency effect.
- An empirical study on the contribution of formal and semantic features to the grammatical gender of nounsAli Basirat, Marc Allassonnière-Tang, and Aleksandrs BerdicevskisLinguistics Vanguard 2021
This study conducts an experimental evaluation of two hypotheses about the contributions of formal and semantic features to the grammatical gender assignment of nouns. One of the hypotheses (Corbett and Fraser 2000) claims that semantic features dominate formal ones. The other hypothesis, formulated within the optimal gender assignment theory (Rice 2006), states that form and semantics contribute equally. Both hypotheses claim that the combination of formal and semantic features yields the most accurate gender identification. In this paper, we operationalize and test these hypotheses by trying to predict grammatical gender using only character-based embeddings (that capture only formal features), only context-based embeddings (that capture only semantic features) and the combination of both. We performed the experiment using data from three languages with different gender systems (French, German and Russian). Formal features are a significantly better predictor of gender than semantic ones, and the difference in prediction accuracy is very large. Overall, formal features are also significantly better than the combination of form and semantics, but the difference is very small and the results for this comparison are not entirely consistent across languages.
- Classifiers in MorphologyMarcin Kilarski, and Marc Allassonnière-TangIn Oxford Research Encyclopedia of Linguistics 2021
Classifiers are partly grammaticalized systems of classification of nominal referents. The choice of a classifier can be based on such criteria as animacy, sex, material, and function as well as physical properties such as shape, size, and consistency. Such meanings are expressed by free or bound morphemes in a variety of morphosyntactic contexts, on the basis of which particular subtypes of classifiers are distinguished. These include the most well-known numeral classifiers which occur with numerals or quantifiers, as in Mandarin Chinese yí liàng chē (one clf.vehicle car) ‘one car’. The other types of classifiers are found in contexts other than quantification (noun classifiers), in possessive constructions (possessive classifiers), in verbs (verbal classifiers), as well as with deictics (deictic classifiers) and in locative phrases (locative classifiers). Classifiers are found in languages of diverse typological profiles, ranging from the analytic languages of Southeast Asia and Oceania to the polysynthetic languages of the Americas. Classifiers are also found in other modalities (i.e., sign languages and writing systems).
- Classifiers in Southeast Asian languagesAlice Vittrant, and Marc Allassonnière-TangIn The languages and Linguistics of Mainland Southeast Asia 2021
Classifiers are one of the types of nominal classifications systems that help speakers to identify discourse referents. They are commonly found in Southeast Asian languages, which motivates the geographical focus of this chapter. Given the semantic as well as the morphosyntactic overlap between the various systems, classifier devices are first presented in the context of all systems of nominal classifications. Then, the analysis focuses on the different constructional subtypes of classifiers and discusses their origin along with how they are used by speakers in discourse.
- The Effect of Word Frequency and Position-in-Utterance in Mandarin Speech Errors: A Connectionist Model of Speech ProductionI.-Ping Wan, and Marc Allassonnière-TangIn Chinese Lexical Semantics 2021
The connectionist model of speech processing infers that word frequency and position-in-utterance play a major role in the occurrence of speech errors. First, words that are not frequently used are more likely to result in speech errors since they generally receive less activation than frequently occurring words and require more activation to be chosen. Second, speech errors are more likely to occur near the end of utterances since, according to the given-before-new-principle, utterance-final words convey new information that has not yet been activated in the preceding context. The information of word frequency and position-in-utterance is extracted automatically from 382 utterances of a Mandarin speech error corpus and fed to generalized linear mixed models and a decision-tree based classifier. The results show that word frequency and position-in-utterance can predict the occurrence of speech errors with a performance over (but close to) the majority baseline. Therefore, additional information is required to improve the accuracy in the predictions.
- Keyword Spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptionsHarald Hammarström, One-Soon Her, and Marc Allassonnière-TangIn Proceedings of the Swedish Language Technology Conference 2021
Starting from a large collection of digitized raw-text descriptions of languages of the world, we address the problem of extracting information of interest to linguists from these. We describe a general technique to extract properties of the described languages associated with a specific term. The technique is simple to implement, simple to explain, requires no training data or annotation, and requires no manual tuning of thresholds. The results are evaluated on a large gold standard database on classifiers with accuracy results that match or supersede human inter-coder agreement on similar tasks. Although accuracy is competitive, the method may still be enhanced by a more rigorous probabilistic background theory and usage of extant NLP tools for morphological variants, collocations and vector-space semantics.
- The evolutionary trends of grammatical gender in Indo-Aryan languagesMarc Allassonnière-Tang, and Michael DunnLanguage Dynamics and Change 2020
This paper infers the processes of development and change of grammatical gender in Indo-Aryan languages using phylogenetic comparative methods. 48 Indo-Aryan languages are coded based on 44 presence-absence features relating to gender marking on the verbs, adjectives, personal pronouns, demonstrative pronouns, and possessive pronouns. A Bayesian Reverse Jump Hyper Prior analysis, which infers the evolutionary dynamics of changes between feature values, gives results that are consistent with historical linguistic and typological studies on gender systems in Indo-Aryan languages and predicts the evolutionary trends of the features included in the dataset.
- A Statistical Explanation of the Distribution of Sortal Classifiers in Languages of the World via Computational ClassifiersOne-Soon Her, and Marc TangJournal of Quantitative Linguistics 2020
Previous studies demonstrate that morphosyntactic plural markers and the structure of numeral systems have individually strong predictive power with regard to the usage of sortal classifiers in languages. We use these two factors as explanatory variables to train the computational classifier of random forests and evaluate the accuracy of their predictive power when selecting the existence/absence of sortal classifiers as response variable. Our results show that these two factors result in an excellent discrimination performance of random forests, even when taking into account sortal classifiers as an areal feature. However, the correlation between morphosyntactic plural markers and multiplicative bases is weaker than the correlation between sortal classifiers and plural markers plus multiplicative bases. We are thus able to provide novel insights with regard to probabilistic universals on sortal classifiers, and suggest an innovative cross-disciplinary approach to test the effect of implicational universals with computational methods.
- Functions of gender and numeral classifiers in NepaliMarc Allassonnière-Tang, and Marcin KilarskiPoznan Studies in Contemporary Linguistics 2020
We examine the complex nominal classification system in Nepali (Indo-European, Indic), a language spoken at the intersection of the Indo-European and Sino-Tibetan language families, which are usually associated with prototypical examples of grammatical gender and numeral classifiers, respectively. In a typologically rare pattern, Nepali possesses two gender systems based on the human/non-human and masculine/feminine oppositions, in addition to which it has also developed an inventory of at least ten numeral classifiers as a result of contact with neighbouring Sino-Tibetan languages. Based on an analysis of the lexical and discourse functions of the three systems, we show that their functional contribution involves a largely complementary distribution of workload with respect to individual functions as well as the type of categorized nouns and referents. The study thus contributes to the ongoing discussions concerning the typology and functions of nominal classification as well as the effects of long-term language contact on language structure.
- A simple introduction to programming and statistics with decision trees in RMarc TangTeaching Statistics 2020
University students in other disciplines without prior knowledge in statistics and/or programming language are introduced to the statistical method of decision trees in the programming language R during a 45-minute teaching and practice session. Statistics and programming skills are now frequently required within a wide variety of research fields and private industries. However, students unfamiliar with these subjects may be reluctant to join a full course because of time or student workloads or other commitments or a belief it is not for them. The proposed session is short and can be used as an ice-breaker to let students have a basic understanding of running statistical models in programming language.
- Numeral base, numeral classifier, and noun: Word order harmonizationMarc Allassonnière-Tang, and One-Soon HerLanguage and Linguistics 2020
Greenberg ( 1990a : 292) suggests that classifiers ( clf ) and numeral bases tend to harmonize in word order, i.e. a numeral (Num) with a base-final [ n base ] order appears in a clf -final [Num clf ] order, e.g. in Mandarin Chinese, san1-bai3 (three hundred) ‘300’ and san1 zhi1 gou3 (three clf animal dog) ‘three dogs’, and a base-initial [ base n ] Num appears in a clf -initial [ clf Num] order, e.g. in Kilivila (Eastern Malayo-Polynesian, Oceanic), akatu-tolu (hundred three) ‘300’ and na-tolu yena ( clf animal -three fish) ‘three fish’. In non-classifier languages, base and noun (N) tend to harmonize in word order. We propose that harmonization between clf and N should also obtain. A detailed statistical analysis of a geographically and phylogenetically weighted set of 400 languages shows that the harmonization of word order between numeral bases, classifiers, and nouns is statistically highly significant, as only 8.25% (33/400) of the languages display violations, which are mostly located at the meeting points between head-final and head-initial languages, indicating that language contact is the main factor in the violations to the probabilistic universals.
- Sociocultural gender in nominal classification: A study of grammatical genderMarc Allassonnière-Tang, and Hiram RingIndian Linguistics 2020
We analyse how sociocultural gender can be reflected through grammatical gender and select Hindi (Indo-European) and Pnar (Austroasiatic) as case studies. We demonstrate that these grammatical gender systems share universal tendencies based on human cognition, i.e. associating long, thin, and vertical objects with masculine grammatical gender whereas round, flat, horizontal ones are associated with feminine grammatical gender. We also show that these grammatical gender systems distinguish between sociocultural values of the language speakers. Speakers of Hindi maintain a patrilineal kinship system, and in their language objects of large size are generally assigned to the masculine gender. Pnar kinship is matrilineal and in the language large sized objects tend to be associated with feminine gender. Similar asymmetries are observed with regard to generic gender and gender reversal. These results contribute to the impact of universal cognitive principles and culture on grammatical structures by showing that both tendencies are not necessarily complementary and that they can co-exist in the same language.
- Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender AssignmentHartger Veeman, Marc Allassonnière-Tang, Aleksandrs Berdicevskis, and 1 more authorIn Proceedings of the 24th Conference on Computational Natural Language Learning 2020
Grammatical gender is assigned to nouns differently in different languages. Are all factors that influence gender assignment idiosyncratic to languages or are there any that are universal? Using cross-lingual aligned word embeddings, we perform two experiments to address these questions about language typology and human cognition. In both experiments, we predict the gender of nouns in language X using a classifier trained on the nouns of language Y, and take the classifier’s accuracy as a measure of transferability of gender systems. First, we show that for 22 Indo-European languages the transferability decreases as the phylogenetic distance increases. This correlation supports the claim that some gender assignment factors are idiosyncratic, and as the languages diverge, the proportion of shared inherited idiosyncrasies diminishes. Second, we show that when the classifier is trained on two Afro-Asiatic languages and tested on the same 22 Indo-European languages (or vice versa), its performance is still significantly above the chance baseline, thus showing that universal factors exist and, moreover, can be captured by word embeddings. When the classifier is tested across families and on inanimate nouns only, the performance is still above baseline, indicating that the universal factors are not limited to biological sex.
- Word order of numeral classifiers and numeral basesOne-Soon Her, Marc Tang, and Bing-Tsiong LiSTUF - Language Typology and Universals 2019
In a numeral classifier language, a sortal classifier (C) or a mensural classifier (M) is needed when a noun is quantified by a numeral (Num). Num and C/M are adjacent cross-linguistically, either in a [Num C/M] order or [C/M Num]. Likewise, in a complex numeral with a multiplicative composition, the base may follow the multiplier as in [ n×base ], e.g., san-bai ‘three hundred’ in Mandarin. However, the base may also precede the multiplier in some languages, thus [ base×n ]. Interestingly, base and C/M seem to harmonize in word order, i.e., [ n×base ] numerals appear with a [Num C/M] alignment, and [ base×n ] numerals, with [C/M Num]. This paper follows up on the explanation of the base-C/M harmonization based on the multiplicative theory of classifiers and verifies it empirically within six language groups in the world’s foremost hotbed of classifier languages: Sinitic, Miao-Yao, Austro-Asiatic, Tai-Kadai, Tibeto-Burman, and Indo-Aryan. Our survey further reveals two interesting facts: base-initial ([ base×n ]) and C/M-initial ([C/M Num]) orders exist only in Tibeto-Burman (TB) within our dataset. Moreover, the few scarce violations to the base-C/M harmonization are also all in TB and are mostly languages having maintained their original base-initial numerals but borrowed from their base-final and C/M-final neighbors. We thus offer an explanation based on Proto-TB’s base-initial numerals and language contact with neighboring base-final, C/M-final languages.
- Insights on the Greenberg-Sanches-Slobin generalization: Quantitative typological data on classifiers and plural markersMarc Tang, and One-Soon HerFolia Linguistica 2019
This paper offers quantitative typological data to investigate a revised version of the Greenberg-Sanches-Slobin generalization (GSSG), which states that (a) a language is unlikely to have both sortal classifiers and morphosyntactic plural markers, and (b) if a language does have both, then their use is in complementary distribution. Morphosyntactic plurals engage in grammatical agreement outside the noun phrase, while morphosemantic plurals that relate to collective and associative marking do not. A database of 400 phylogenetically and geographically weighted languages was created to test this generalization. The statistical test of conditional inference trees was applied to investigate the effect of areal, phylogenetic, and linguistic factors on the distribution of classifiers and morphosyntactic plural markers. The results show that the presence of classifiers is affected by areal factors as most classifier languages are concentrated in Asia. Yet, the low ratio of languages with both features simultaneously is still statistically significant. Part (a) of the GSSG can thus be seen as a statistical universal. We then look into the few languages that do have both features and tentatively conclude that part (b) also seems to hold but further investigation into some of these languages is needed.
- A typology of classifiers and gender: From description to computationMarc TangIn Acta Universitatis Upsaliensis 2019
Categorization is one the most relevant tasks realized by humans during their life, as we consistently need to categorize the things and experience that we encounter. Such need is reflected in language via various mechanisms, the most prominent being nominal classification systems (e.g., grammatical gender such as the masculine/feminine distinction in French). Typological methods are used to investigate the underlying functions and structures of such systems, using a wide variety of cross-linguistic data to examine universality and variability. This analysis is itself a classification task, as languages are categorized and clustered according to their grammatical features. This thesis provides a cross-linguistic typological analysis of nominal classification systems and in parallel compares a number of quantitative methods that can be applied at different scales. First, this thesis provides an analysis of nominal classification systems (i.e., gender and classifiers) via the description of three languages with respectively gender, classifiers, and both. While the analysis of the first two languages are more of a descriptive nature and aligns with findings in the existing literature, the third language provides novel insights to the typology of nominal classification systems by demonstrating how classifiers and gender may co-occur in one language in terms of distribution of functions. Second, the underlying logic of nominal classification systems is commonly considered difficult to investigate, e.g., is there a consistent logic behind gender assignment in language? is it possible to explain the distribution of classifier languages of the world while taking into account geographical and genealogical effects? This thesis addresses the lack of arbitrariness of nominal classification systems at three different scales: The distribution of classifiers at the worldwide level, the presence of gender within a language family, and gender assignment at the language-internal level. The methods of random forests, phylogenetics, and word embeddings with neural networks are selected since they are respectively applicable at three different scales of research questions (worldwide, family-internal, language-internal).
- Linguistic Information in Word EmbeddingsAli Basirat, and Marc TangIn Agents and Artificial Intelligence 2019
We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.
- Predicting Speech Errors in Mandarin Based on Word FrequencyMarc Tang, and I-Ping WanIn From Minimal Contrast to Meaning Construct 2019
This paper investigates the effect of word frequency on the occurrence of speech errors in Mandarin. A corpus of 390 speech errors along with their surrounding linguistic context was gathered. The information of word frequency was extracted from the Academia Sinica Corpus. Our analysis with a computational classifier based on conditional inference trees shows that intended words having a frequency lower than words of the surrounding context are more likely to generate speech errors.
- The lexical and discourse functions of grammatical gender in MarathiPär Eliasson, and Marc TangJournal of South Asian Languages and Linguistics 2018
We provide a functional analysis of the grammatical gender system of Marathi (Indo-Aryan) in Western India. The majority of the new Indo-Aryan languages typically classifies each noun of the lexicon according to biological gender as masculine and feminine. Only a few Indo-Aryan languages such as Marathi diverge in terms of agreement pattern by categorizing nouns as masculine, feminine, and neuter. Yet gender in Marathi has not been extensively described in terms of functions. We thus use apply functional typology to analyze grammatical gender in Marathi and provide detailed examples of its lexical and discourse functions.
- Lexical and morpho-syntactic features in word embeddings: A case study of nouns in SwedishAli Basirat, and Marc TangIn Proceedings of the 10th International Conference on Agents and Artificial Intelligence 2018
We apply real-valued word vectors combined with two different types of classifiers (linear discriminant analy- sis and feed-forward neural network) to scrutinize whether basic nominal categories can be captured by simple word embedding models. We also provide a linguistic analysis of the errors generated by the classifiers. The targeted language is Swedish, in which we investigate three nominal aspects: uter/neuter, common/proper, and count/mass. They represent respectively grammatical, semantic, and mixed types of nominal classification within languages. Our results show that word embeddings can capture typical grammatical and semantic fea- tures such as uter/neuter and common/proper nouns. Nevertheless, the model encounters difficulties to identify classes such as count/mass which not only combine both grammatical and semantic properties, but are also subject to conversion and shift. Hence, we answer the call of the Special Session on Natural Language Process- ing in Artif icial Intelligence by approaching the topic of interfaces between morphology, lexicon, semantics, and syntax via interdisciplinary methods combining machine learning of language and general linguistics.
- The coalescence of grammatical gender and numeral classifiers in the general classifier wota in NepaliMarcin Kilarski, and Marc TangIn Proceedings of the Linguistic Society of America 2018
While nominal classification has received considerable attention, relatively little is known about cross-linguistically rare complex systems. An example is provided by Nepali (Indo-European, Indic), which possesses both grammatical gender and numeral classifiers. Our aim is to examine morphosyntactic and functional properties of the general classifier wota. Unusually, the classifier exhibits gender agreement both in its independent forms and as fused with a numeral, raising questions about its lexical and pragmatic functions. Our study contributes to the typology of nominal classification by proposing a functional approach to cases of complex co-occurrence of gender and classifiers.
- Explaining the acquisition order of classifiers and measure words via their mathematical complexityMarc TangJournal of Child Language Acquisition and Development 2017
We provide theoretical explanation for the acquisition of numeral classifiers (sortal classifiers) and measure words (mensural classifiers) in Mandarin Chinese. Previous research in various languages separately observed that the general classifier is acquired before specific classifiers and that classifiers are acquired previous to measure words. However no theoretical discussion was fully developed and no study combined general classifier, specific classifiers and measure words in one dataset. We propose to fill these gaps by combining semantic complexity (Brown, 1973) and a mathematical approach (Her, 2012): given that the relative complexity of x, y and z is unknown, x + y is more complex than either x or y, and x + y + z is more complex than any of them. By applying the mathematical approach, it is observed that general classifier carries the mathematical value of times one, noted x, while specific classifiers posses x plus a semantic value of y, which highlights an inherent feature of the referent. Finally, measure words detain both x and y, along with a new information of quantity z. Therefore, the acquisition order is expected to start from the simplest semanticity and develop toward the most complex, i.e. general classifiers (x) > specific classifier (x+y)> measure word (x+y+z). As supporting evidence, we gathered longitudinal data from CHILDES (Child Language Data Exchange System; Zhou, 2008). The participants included 110 children from 1-6 years old, providing a total of 110 conversations of 20 minutes each with 1851 tokens of numeral classifiers and measure words. Our methodology applied the definition of acquisition from Brown (1973) and the equation of Suppliance in Obligatory Context (SOC) cross-checked with Target-Like Usage (TLU) from Pica (1983). The results demonstrated that our model generated correct prediction, serving as theoretical basis for future studies in the field of language acquisition.