TechPolicy.ca Data mining politics and public policy. The politics of data mining.

4Feb/100

Graphs, Maps, and Trees

I just finished reading Graphs, Maps, and Trees by Franco Moretti. The book was recommended to me by a friend (thanks Tom!) and I must say I really enjoyed it.

While the book does not discuss information theory, machine learning, or data mining, it provides a very interesting argument for more rigour in literary studies. Furthermore, I believe it provides a great introduction to the possibilities that information theory holds for political science, business intelligence, and related fields. A particularly powerful example of this is when Moretti writes,

What do literary maps do... First, they are a good way to prepare text for analysis. You choose a unit--walks, lawsuits, luxury goods, whatever--find its occurrences, place them in space... Or in other words, you reduce the text to a few elements, and abstract them from the narrative flow, and construct a new, artificial object like the maps that I have been discussing. And with a little luck, these maps will be more than the sum of their parts: they will possess 'emerging' qualities, which were not visible at the lower level.

In this paragraph, Moretti specifically discusses the use of geographical representations of novels to study the patterns behind the stories therein. If we go beyond maps specifically and discuss graphs, trees, networks, and other abstract analytical tools, we can see how using any such tools may illuminate underlying patterns in literary works.

As Moretti discusses at the start of his book, a major challenge to literary research is that reading all the novels published in a specific period is impossible. There is simply too many of them. The use of graphs allows one to analyze such works in aggregate while dealing with the shortcoming of not being able to read as fast as content is produced. Social media and press tracking has a similar challenge. There are too many blog posts, articles, Tweets, status updates, and websites out there for a consultant or researcher to read and aggregate by hand. As such, one needs more abstract frameworks for dealing with the data.

If you are looking for a non-technical introduction to the possibilities held within information retrieval and data mining, this is a great book. While Moretti doesn't discuss automated or algorithmic approaches to his work, the mental leap from his work to automated strategies is short and easy.