Tuesday, September 20, 2016

Five years after the article "Quantitative analysis of culture using millions of digitized books" in Science. End of culturomics?


Citing "articles" = 1000.

Languages cool as they expand: 

Allometric scaling and the decreasing need for new words
Nature 2013.

We study language evolution by analyzing the word frequencies of millions of distinct words in seven languages recorded in books from the past two centuries. For all languages and time spans we confirm that two scaling regimes characterize the word frequency distributions, with the more common words in each language obeying the Zipf law. We measure the allometric scaling relation between corpus size and vocabulary size, confirming recent theoretical predictions that relate the Heaps law to the Zipf law. We measure a decreasing trend in the annual growth fluctuations of word use with increasing corpus size suggesting that the rate of linguistic evolution decreases as the language expands, implying that new words have increasing marginal returns, and that languages can be said to “cool by expansion.” Counteracting this cooling are periods of political conflict which are not only characterized by decreases in literary productivity but also by a globalized media focus which may increase the mobility of concepts and words across political borders.  

When physicists do linguistics

Is English ‘cooling’? A scientific paper gets the cold shoulder

PlosOne 2015:

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900 s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

Plos comput biol 2016:
The Virtuous Cycle of a Data Ecosystem
Digital data of all types are being created at an ever-increasing rate, doubling approximately every two years.
Annual data creation rates are estimated to reach 44 trillion gigabytes by 2020.
Similarly, the rate at which primary scientific data are being collected is accelerating. This astounding growth in scientific data creation has led to the contemporary discussion of scientific data sharing policies. Many of the criticisms levied against data sharing have focused on practical issues such as the economics and logistics of data storage, technical challenges for doing so, or appropriate attribution of credit. In contrast, the arguments in favor of data sharing have focused largely on scientific replication, reproducibility, facilitation of collaborative research, and increased citations for publications that share data. This is largely an ethical argument wherein there is an obligation to share data collected using public funds.


No comments:

Post a Comment