Daghestanian loans database

Authors: Ilya Chechuro, Michael Daniel, and Samira Verhees.

The DagLoans database contains results of elicitation of a list of 164 lexical meanings across languages of Daghestan. Data collection was aimed at assessing the amount of lexical transfer between these languages. The database includes lists elicited from 125 speakers of minority languages and 21 lists based on dictionaries (15 of them of East Caucasian languages). In addition to the data from East Caucasian languages, it also includes data from other languages relevant for the study of language contact in the area, including Persian, Arabic, Azerbaijani, Georgian and Russian (6 dictionaries).

The general objective of the DagLoans project is the study of lexical borrowing in the languages of Daghestan on the level of granularity that is sensitive to the difference between village varieties. For this purpose, we developed a method for obtaining comparable lexical data through eliciting a relatively short (146 concepts) wordlist that serves as a litmus paper, a quick field probe for the amount of lexical transfer. Using a fixed list allows discovering quantitative correlates of sociolinguistic differences between areas, such as the spread of a certain lingua franca or the presence and degree of contact with particular languages. In combination with the sociolinguistic data on multilingualism in Daghestan, our data shows that the conditions and the degree of language contact for each village vary and correlate with bilingualism rates as reported in our another project, Atlas of Multilingualism in Daghestan.

The table shows the concepts (lexical meanings) and their translations into target languages. Translations are grouped into similarity sets, sets of words that look similarly and were used as translations of the same concept. Whenever the similarity is shared by different language families or sufficiently distant branches, we consider this as an indication that the lexical item night have been shared through language contact. Metadata includes the name of the village where the word was recorded and its location, the language spoken in the village, and the list ID. The ID corresponds to a particular speaker or, in some cases, to a written dictionary source. All data are accessible here. The dataset is also available here in the one-hot format.

The DagLoans database has been compiled by Ilia Chechuro and Samira Verhees. The field data was primarily collected by Ilya Chechuro, Michael Daniel, Nina Dobrushina, Samira Verhees and the students supervised by them (see Acknowledgements). The data is copyrighted by Linguistic Convergence Laboratory, HSE University, Moscow, and may be used in other academic projects (see How to cite). The project was funded by the Basic Research Program at the National Research University Higher School of Economics (HSE) and supported within the framework of a subsidy by the Russian Academic Excellence Project ‘5-100’.

[1] "DLoans"

Contents:

              [,1]
target_words 25809
languages       36

How to cite this project

If you use data from the database in your research, please cite as follows:

Chechuro I., Daniel M., Dobrushina N., and Verhees S. 2019. Daghestanian loans database. Linguistic Convergence Laboratory, HSE. (Available online at https://lingconlab.github.io/Dagloan_database/DL_database.html, DOI, accessed on October 02, 2019.)

The database

For now, the table shows source Concepts and target Words. Each target word is grouped in a similarity Set - a set of words that have the same meaning and look similar. In the future, data will be added on borrowing sources. Metadata includes the name of the Village where the word was recorded, the administrative District it is part of, the Language spoken there, and the List ID: these ID’s correspond to a particular speaker or in some cases a written source like a dictionary. Data is accessible at: Github/LingConLab/DagloanDatabase.
The dataset in the one-hot format is available here.

The table below can be sorted and filtered, the resulting subset can be downloaded by pressing on the “CSV” button.


Version: 2019-10-02. For questions or comments contact .


Map of the surveyed villages

Hover over and / or click on a dot on the map to know more. The color of the dots corresponds to the number of lists collected in a village. Orange = dictionary data.

Sample lexical map

The map below shows the distribution of different stems for the concept ‘pepper’.

Sources of lexical influence

The four plots presented here show lexical influence from Turkic, Avar, Georgian and Chechen. The data are split by districts on X axis. Y axis shows the percentage of the loans found in the elicited samples. Dots represent elicitations (color shows the village), with a horizontal line showing the median value per district.

Cluster dendrogram of foreign influence

The dendrogram presented here show how the word lists collected from diffferent speakers group according to the sets of loanwords of different origin. This tree is built as follows. 0 distance is given only to two matching non-empty cells, otherwise the distance is 1. The NA’s are not counted.

     Speaker Language Village District Alibeglo1 Arkhit1 Arkhit2 Arkhit3
     Arkhit4 Arkhit5 Arkhit6 Bezhta1 Darvag1 Darvag2 Darvag3 Darvag4
     Darvag5 Darvag6 Dyubek1 Dyubek2 Dyubek3 Dyubek4 Dzhavgat1 Dzhavgat2
     Dzhavgat3 Dzhavgat4 Dzhibakhni1 Dzhibakhni2 Dzhibakhni3 Dzhibakhni4
     Helmets1 Helmets2 Helmets3 Ikhrek1 Ikhrek2 Ikhrek3 Ikhrek4 Ilisu1
     Karata1 Karata2 Karata3 Karata4 Khapil1 Khapil2 Khapil3 Khapil4
     Khapil5 Khiv1 Khiv2 Khiv3 Khiv4 Khlut1 Khlut2 Khlut3 Khlut4 Khlut5
     Khoredzh1 Khoredzh2 Khoredzh3 Khoredzh4 Khoredzh5 Khoredzh6 Khutkhul1
     Khutkhul2 Khutkhul3 Khutkhul4 Kiche1 Kiche2 Kidero1 Kidero2 Kidero3
     Kina1 Kina2 Kina3 Kurag1 Kusur1 Laka1 Laka2 Laka3 Laka4 Laka5 Laka6
     Meshabash1 Meshabash2 Mikik1 Mikik2 Qax1 Qax2 Qax3 Qax4 Qax5 Qax6
     Qax7 Qax8 Qax9 Qum1 Qum2 Rikvani1 Rutul1 Tad-Magitl1 Tad-Magitl2
     Tatil1 Tatil2 Tatil3 Tatil4 Tatil5 Tlibisho1 Tlibisho2 Tlibisho3
     Tlibisho4 Tpig1 Tsinit1 Tsinit2 Tsinit3 Tsinit4 Tsinit5 Tukita1
     Yagdyg1 Yagdyg2 Yagdyg3 Yagdyg4 Yagdyg5 Yagdyg6 Yersi1 Yersi2 Yersi3
     Yersi4 Zilo1 Zilo2
 [ reached 'max' / getOption("max.print") -- omitted 125 rows ]

Cluster dendrogram of foreign influence (strict distances)

The dendrogram presented here show how the word lists collected from diffferent speakers group according to the sets of loanwords of different origin. This tree is built as follows. 0 distance is given only to two matching non-empty cells, otherwise the distance is 1. This leads to the huge distances even if speakers are similar. The NA’s are counted. This dendrogram is different from the previous one in that the penalties for matches between speakers are higher.

     Speaker Language Village District Alibeglo1 Arkhit1 Arkhit2 Arkhit3
     Arkhit4 Arkhit5 Arkhit6 Bezhta1 Darvag1 Darvag2 Darvag3 Darvag4
     Darvag5 Darvag6 Dyubek1 Dyubek2 Dyubek3 Dyubek4 Dzhavgat1 Dzhavgat2
     Dzhavgat3 Dzhavgat4 Dzhibakhni1 Dzhibakhni2 Dzhibakhni3 Dzhibakhni4
     Helmets1 Helmets2 Helmets3 Ikhrek1 Ikhrek2 Ikhrek3 Ikhrek4 Ilisu1
     Karata1 Karata2 Karata3 Karata4 Khapil1 Khapil2 Khapil3 Khapil4
     Khapil5 Khiv1 Khiv2 Khiv3 Khiv4 Khlut1 Khlut2 Khlut3 Khlut4 Khlut5
     Khoredzh1 Khoredzh2 Khoredzh3 Khoredzh4 Khoredzh5 Khoredzh6 Khutkhul1
     Khutkhul2 Khutkhul3 Khutkhul4 Kiche1 Kiche2 Kidero1 Kidero2 Kidero3
     Kina1 Kina2 Kina3 Kurag1 Kusur1 Laka1 Laka2 Laka3 Laka4 Laka5 Laka6
     Meshabash1 Meshabash2 Mikik1 Mikik2 Qax1 Qax2 Qax3 Qax4 Qax5 Qax6
     Qax7 Qax8 Qax9 Qum1 Qum2 Rikvani1 Rutul1 Tad-Magitl1 Tad-Magitl2
     Tatil1 Tatil2 Tatil3 Tatil4 Tatil5 Tlibisho1 Tlibisho2 Tlibisho3
     Tlibisho4 Tpig1 Tsinit1 Tsinit2 Tsinit3 Tsinit4 Tsinit5 Tukita1
     Yagdyg1 Yagdyg2 Yagdyg3 Yagdyg4 Yagdyg5 Yagdyg6 Yersi1 Yersi2 Yersi3
     Yersi4 Zilo1 Zilo2
 [ reached 'max' / getOption("max.print") -- omitted 125 rows ]

Mediation of Turkic influence (speakers)

The four plots presented here show how lexical influence from Turkic is mediated by the local major languages (Lezgian and Avar) across Daghestan. The data are split by districts on X axis. Y axis shows the percentage of the loans found in the elicited samples. Dots represent elicitations (color shows the village), with a horizontal line showing the median value per district. The plot indicates that Avar mediation may be high in the north but not present in the south, while Lezgian mediation in not probable in both regions.

Mediation of Turkic influence (villages)

The four plots presented here show how lexical influence from Turkic is mediated by the local major languages (Lezgian and Avar) across Daghestan. The data are split by districts on X axis. Y axis shows the percentage of the loans found in the elicited samples. Dots represent unions of all elicitations per village (color shows the village, too), with a horizontal line showing the median value per district. The plot indicates that Avar mediation may be high in the north but not present in the south, while Lezgian mediation in not probable in both regions.

Mediation of total Turkic influence

The plot presented here illustrates the intersections between the sets of Turkic loans between villages (plot 1) and regions (plot 2).

Mediation of Standard Azerbaijani influence

The plot presented here illustrates the intersections between the sets of loans from Standard Azerbaijani between villages (plot 1) and regions (plot 2).

#Mediation of Turkic influence via major languages

The first two plots (the first set) presented here in the show how lexical influence from Turkic is mediated by the Lezgian and the Khlut (Akhty) dialect of Lezgian across Daghestan. The data presented here are mostly relevant for the Rutul region, but are still comparable across Daghestan. The data are split by districts on X axis. Y axis shows the percentage of the loans found in the elicited samples. Dots represent elicitations (color shows the village, too), with a horizontal line showing the median value per district. The plot indicates that Avar mediation may be high in the north but not present in the south, while Lezgian mediation in not probable in both regions.

The second set of plots represent the same data for the Chechen and Avar mediation. These data are relevant for the Botlikh region, but are still comparable across Daghestan.

The third set of plots represent the same data for the Georgian and Avar mediation. These data are relevant for the Tsunta region, but are still comparable across Daghestan.

The last plot set is a combination of the first three made for convenience.

#R-shiny version of the page

This page also exists as an R-Shiny application, which features greater interactivity. The R-Shiny application has been created by Ernest Shklyar and Vitaliy Balashov. The R-Shiny code currently works as a desktop application (requires the R software and several additional packages), which can be downloaded from their GitHub page. The Shiny version of the web-page will substitute the HTML version as soon as possible. The code is also deployed at https://ilchec.shinyapps.io/dagloans-shiny/ and can be accessed with a web-browser. Note that the version of the R-Shiny app may be slightly beyond the HTML page and contain flaws and mistakes that have been corrected on the HTML page (we are working on keeping the apps as similar as possible, but our resources are limited).

#Acknowledgements The creators of the project owe a debt of gratitude to the students of the HSE university Arseniy Averin, Faina Daniel, Lilia Terekhina for the field work, Aleksandra Martynova, Anastasia Safonova, Anna Vishenkova, Anastasia Chasovskikh for annotating the dictionary data, to Zarina Kerimova from the Russian State University for Humanities for the transcription of elicitations, ; to Sejdul from Darvag, Said from Kina, Azim from Khiv and Abdurizag from Tpig for their hospitality and help during the field trips; and to Ernest Shklyar and Vitaliy Balashov for creating the R-Shiny version of this web-page.

References

Auguie, Baptiste. 2017. GridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Barnier, Julien. 2019. Rmdformats: HTML Output Formats and Templates for ’Rmarkdown’ Documents. https://CRAN.R-project.org/package=rmdformats.

Boettiger, Carl. 2017. Knitcitations: Citations for ’Knitr’ Markdown Files. https://CRAN.R-project.org/package=knitcitations.

Galili, Tal. 2015. “Dendextend: An R Package for Visualizing, Adjusting, and Comparing Trees of Hierarchical Clustering.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btv428.

Gehlenborg, Nils. 2017. UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. https://CRAN.R-project.org/package=UpSetR.

Haspelmath, Martin, and Uri Tadmor. 2009. Loanwords in the World’s Languages: A Comparative Handbook. Walter de Gruyter.

Moroz, George. 2017. Lingtypology: Easy Mapping for Linguistic Typology. https://CRAN.R-project.org/package=lingtypology.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sievert, Carson. 2018. Plotly for R. https://plotly-r.com.

Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling. https://CRAN.R-project.org/package=pvclust.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.name/knitr/.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2019. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Ilya Chechuro, Michael Daniel, Samira Verhees

2019-10-02