Sunday, August 19, 2012

Find: Novel text analysis uses PageRank to identify influential Victorian authors

Novel text analysis uses PageRank to identify influential Victorian authors



A literature professor has developed software using Google's PageRank algorithm that has identified Jane Austen and Walter Scott as the most influential authors of the 1800s.

Matthew Jockers of the University of Nebraska analysed 3,592 digitized novels published in the UK, Ireland and the US between 1780 and 1900 using a combination of Google's algorithm, machine learning and a series of techniques used in computational text analysis including stylometry, corpus linguistics and network analysis.

After ensuring the gender balance was split roughly evenly, Jockers went about using his software to extract thematic data&mdsah;this included the frequency of specific words or groups of words. Network software was then used to categorize and rank this data—Jockers began with a network consisting of 12,902,464 rows and three columns, with a source book allotted to the first column, a target book to the second and the third used to calculate the distance between the two (i.e. how many similarities they share according to the thematic data). After narrowing these data sets down to 6,447,640, the information was imported into network analysis software Gephi and PageRank was used to help identify down those novels which had the most links to future tomes, as well as the strongest links to those tomes.