NLP Reveals How Scientific Writing Changed Since 1665
Every year, hundreds or even thousands of words are created. And as we all know from our daily experiences, language changes significantly over time. New slang is introduced, cultural trends and historical events favor certain words, and others go by the wayside. But do language variations hold weight under natural language processing techniques?
The Royal Society of London is the oldest currently active scientific journal in the world, and had over 17,000 papers published in it between 1665 and 1920. This corresponds to 78 million tokens (roughly, ‘words’) of scientific writing. This amount of writing would be next to impossible for a human to sift through, but by applying NLP techniques, otherwise inaccessible insights can be gleaned from this expansive dataset.
In spite of the availability and widespread use of NLP techniques, minimal research has been conducted on long-term linguistic change of scientific writing specifically. Long-term linguistic analyses can shed light on language evolution and contribute to a deeper understanding of the history, culture, and society in which scientific ideas were communicated.
Tokens & Length:
More and more tokens were added to the corpus over time as scientific writing became more popular. Additionally, in the 1600s papers contained about 2,000 tokens on average, but this increased to almost 8,000 tokens per document during the 1900s as papers themselves grew longer.
Sentence Length & Clarity:
Despite the increase in overall paper length, sentences became consistently shorter over time. This could signal a cultural shift towards clearer, more concise communication. This transformation might be attributed to increasing mass education, the need for more efficient dissemination of information, and a wider audience appeal. Title lengths also became significantly shorter, dropping from a high of 35 tokens on average in 1760, to about 12 tokens in the 1900s.
Predictability:
Surprisal scores are a measure of unpredictability. In this study, I looked at surprisal within the lens of a rolling 10 year period, so the baseline was the language that occurred ‘adjacent’ to the word at the time. Under this model, there was a statistically significant shift towards more predictable language over time. Potential reasons could include the standardization of English, technological advancements, industrialization, the growth of the publishing industry, and broader education systems, all of which help to create a more uniform language style.
Nominalization & Precision:
Sentences became more noun-heavy over time. In order to study this, I used the noun-verb ratio — a computational measure of nominal forms. This trend may have been caused by increased innovation and technological advancements that caused a greater focus on precise categorization and description of abstract concepts.
Loanwords & Standardization:
Loanwords, or ‘foreign words’, fluctuated but never reached the same levels seen in the late 1700s. This may have been due to the growing prominence of English, or due to pushes towards standardization of language. Additionally, the first definitive English dictionary by Samiel Johnson was published in 1755, and might have played a role in this shift towards English within scientific writing.
Grammar’s Steadfastness:
Amidst the diverse shifts in language over time, prepositions and subordinating conjunctions remained extremely stable over time, seeming to uphold the language’s grammatical structure. This indicates a commitment to clear and structured scientific communication that has remained unchanged.
Takeaways:
All of the trends point to the gradual evolution, widespread acceptance, and standardization of scientific language across the ages, with special emphasis on clarity and precision. These findings demonstrate how academic discourse has changed significantly over time, and demanded different linguistic structures as a result.
If you’re interested in learning more, this article was just a snippet of the findings from my full study.