nhorozov.xyz/blog/newspapers-to-numbers
Newspapers to Numbers
This was done for my English Honors class in mid 11th grade.
I have been interested in some time how words change meaning over time. My English teacher gave us an amazing month long assigment in which we had 2 weeks to learn everything we could on a subject, and 2 weeks to synthesize what we learned into something we would present. I wanted to use methods in machine learning and AI (like the idea of quantifying the meanings of words and representing them as vectors) to quantitativly model how language change. I looked into various datasets of historical text, and eventually settled on the Library of Congress’s collection of scanned newspapers from 1850-1950.
I wrote a program that trained a small ~11 million parameter BERT model on 300 megabytes of text. Each document I fed the model was prefixed with the year the document was from, so the model would learn the associations between the tokens for the year, and the rest of the newspaper exerpt. I found that the system was easier to train when it was given more documents that each were shorter, so I cut up each newspaper articles into 20 exerpts. To look at trends in how words change over time, I considered the document embeddedings the BERT model used to produce its distribution over its vocabulary, from its last hidden layer. I then took the embeddings in document-space the BERT model gave me for a series of strings containing a year and a term (eg. “YEAR: 1893 | Spanish Empire”). I took all these vectors and computed the consine distace between them and a base vector, such as BERT-vector(“Spanish empire”) ⋅ BERT-vector(“YEAR: $year | Spanish Empire). I then plotted the cosine distances between these ideas, as one stayed static through time, and one moved though time.
I trained a few such models, experimenting with different model sizes and amounts of training data. The largest one I tested was 110 million parameters and took nearly 16 hours of compute on my PC. I found that I could get better results from these systems if they were given more training data, not more parameters.
Here is my final slideshow with the graphs I made for this.
Not every year is represented in the dataset, and I’ve checked and those unrepresented years acount for most of the outliers.