TechPolicy.ca Data mining politics and public policy. The politics of data mining.

6Sep/100

Moving On

Well, it's time to say "goodbye". On August 30, I started a new job and will unlikely have the time to update this blog or work on the projects discussed in other posts. As such, dear readers, it's been nice writing and hopefully I'll start something again in a couple of years.

Filed under: Admin No Comments
17Aug/100

Information Flow and Arbitrage in the Political Blogosphere

As some of you may know, I recently submitted my dissertation for the MSc in Social Science of the Internet at the Oxford Internet Institute. The dissertation is still being graded, so I don't want to post it here just yet. However, the title of the dissertation is "Information Flow and Arbitrage in the Political Blogosphere" and the abstract is below. E-mail me if you'd like to discuss it or get a copy of the dissertation.

Over the last decade, political blogging has significantly grown in popularity and now represents a popular form of political engagement and information collection. This dissertation explores the political blogosphere in the context of the 2008 US Presidential election. From May 2008 to April 2009, 16,741 blogs were crawled on a daily basis, with their content and hyperlinks stored and analyzed. This dissertation provides an analysis of the flow of information through the blogosphere in the context of this data, through the use of social network analysis. Through a number of network-based methodological approaches, it is shown that the political blogosphere is organized in a core-periphery structure, with popular, elite bloggers organized in the core. The core itself is fragmented, composed of tightly-knit communities and members who are mutually aware of each other. These communities are fragmented and information does not easily flow between them. The periphery acts as a bridge, however, with information flowing to peripheral bloggers from multiple communities.

Through a node-level analysis, the dissertation further concludes that one can define different forms of influence and precursors to influence. Using a statistical model that controls for in- and out-degree distributions of the network, this dissertation is able to identify bloggers that act as information arbiters within their personal networks. A striking finding of the research is that the core is no better at information arbitrage and introducing their personal networks to new sources of information than the periphery; being a popular blogger entrenched in an elite community does not make it easier to promote new sources of information.

Furthermore, this thesis makes significant contributions to methodological work in social network analysis. It provides new approaches to analyzing longitudinal network data, and explores new statistical models for analyzing the significance of different social mechanisms in a network.

28Jun/100

Data Mining for Development: Sales Pitch

Over the last few weeks, I've been rekindling my interest in mathematics and international development. I studied both subjects during my undergraduate degree, and have kept trying to figure out a way to combine the two. I'm hoping to spend some time over the next two months running pilot projects in this area to see how well data mining, artificial intelligence, and statistics can be used to help development agencies, non-profits, and related institutions to do their work. Below is the rough draft of a short sales pitch I am working on in this area. E-mail me or comment if you are interested in learning more.

Data Mining for Development: Proposal

Organizations face two major data-related challenges: (1) how to collect meaningful data about their work, and (2) how to improve their operations and impact using that data. We’re here to help.

Today, non-profit organizations and social enterprises work in a challenging environment. With governments focusing on austerity measures and funding bodies receiving lower returns on their financial investments, non-profit organizations are continually facing an uphill battle. They are constantly being encouraged to improve how they operate and prove that they are having a positive effect.

Data Mining for Development (DM4D) is a non-profit project that aims to help organizations and social enterprises achieve their objectives. Composed of a team of graduate students, researchers, and professionals trained in mathematics, statistics, and computer science, DM4D helps organizations in a number of ways:

  1. DM4D helps organizations decide how to collect data and run project evaluations at a lower cost. We do this by finding ways to automate the data collection process, and by developing indicators for variables that are difficult to measure.
  2. DM4D researchers are familiar with extracting meaningful information from thousands of documents at a time. Organizations produce countless reports, e-mails, and articles, and we can help make sense of such data.
  3. DM4D builds mathematical models to help predict how the work environments of non-profit organizations are changing, so they can better prepare for unexpected events and changes coming in the future.

Collecting and understanding large amounts of information and data is difficult and expensive. However, doing so can help an organization achieve its mission and objectives. DM4D is here to provide services and advice to help organizations leverage their data.

For more information, please contact Wojciech Gryc at wojciech@gmail.com.

12Jun/102

Prototype: More Web-Friendly Visualizations in R

I've spent some more time thinking about how best to put together the package for creating web-friendly, interactive data visualizations in R. I have a pretty substantial JavaScript package that does a lot of basic visualizations now, and it's really exciting to see where this is going. With this in mind, I'm releasing a new version of the R package prototype I keep discussing in this blog.

A number of functions are included here, including wv.plot(), wv.lineplot(), wv.snaplot(), wv.bargraph. The documentation still needs a lot of work, and there are no interactive abilities yet (though they exist in the JavaScript code).

What is most exciting about this package is that a lot of the steps one takes to make a complete graph have been split into individual functions. Thus, while one can make a scatterplot with wv.plot(), one can also use wv.axis() and wv.points() to do so as well. Each data visualization gets its own ID, or can be assigned one, so one can later start passing visualization (e.g. the points in the scatterplot itself) as arguments to other functions, thus allowing one to begin adding functions for interactivity.

A few examples of the visualizations are shown below, along with the necessary R code to get them to display. Note that these are embedded into the blog, I did so through the use of an inline frame.

Basic Scatterplot

The code below will generate a basic scatterplot.
x = rnorm(30)
y = rnorm(30)
wv.plot(x, y, "~/Desktop/scatterplot", height=300, width=300, xlim=c(-2.5,2.5), ylim=c(-2.5,2.5), xbreaks=c(0), ybreaks=c(0))



Plot with Multiple Data Types

Supposing you want to have a scatterplot with multiple point types and a line. You can build this manually with the following code.

x = rnorm(30); y = rnorm(30); z = runif(30);
wv.open("~/Desktop/plot3/", height=300, width=300);
wv.axis(c(-3.5, 3.5), c(-3.5, 3.5), xbreaks=-2:2, ybreaks=-2:2);
wv.points(x, y, xlim=c(-3.5, 3.5), ylim=c(-3.5, 3.5));
wv.lines(sort(x), z, col="red", xlim=c(-3.5, 3.5), ylim=c(-3.5, 3.5));
wv.close();



Bar Graph

This is a new graph format.

x = c(2.5, 7, 11);
wv.bargraph(x, cats, "~/Desktop/barplot", ylim=c(0, 15), ybreaks=(1:5)*3);



As always, comments are welcome.

27May/102

Canadian CPI: Visualization Brainstorm

After finishing the R prototype for data visualization, I've started abstracting the various methods necessary to create beautiful graphs. While there's no preliminary version of the R package yet, I think I've taken a number of exciting steps. These include:

  • Abstracting graph objects. Objects such as lines, scatter plots, and other graph types can all be treated in a similar fashion in JavaScript. I use this approach in the new version of the JavaScript graph presented below.
  • Including axes. The last graphs did not have axes, grid lines, and other information cues. These ones do. While they have to be manually set, this presents an advantage in that one can choose which grid lines and axis points to show.
  • Interactivity. The graph below actually has useful interactive features. Mousing over points provides information on the value of the point itself, while mousing over the line plot provides the title. Nothing too complex, but already fairly useful.

I chose to present data on the Canadian consumer price index (CPI). This is freely available data and serves as a reminder of the major political issue of our time... While I don't want to make this post political, the ultimate goal of this blog is to use such visualizations and mathematical models to better understand public policy and the role of data mining therein. Might as well start referencing useful data in this regard.

So without further ado, here's the graph...



The next step is fairly clear: making the above possible in R!

18May/105

Prototype: Web-Friendly Visualizations in R

Developing web-friendly data visualizations is not very difficult, though as far as I know, a package that allows one to do this directly in R does not exist (e-mail me if you know of one). As someone who has been developing lots of data-oriented software tools, it's always nice to post visualizations online. To facilitate this task, I've been fooling around with creating a data visualization prototype in R. While the package is very limited in what it does, I hope it'll generate a discussion on the types of visualization tools that could help R users post their work on the web.

At this stage, the package has three functions to illustrate scatter plots, line graphs, and social networks. Each function creates a new directory with all the necessary JavaScript and HTML files. The HTML file could then be embedded using an inline frame (as done below) or used as a standalone website.

You can download the prototype here, and below are some examples of visualizations.

Scatter Plot

x = rnorm(25)
y = rnorm(25)
wv.scatterplot(x, y, "/wv-scatterplot", height=300, width=300, marginsize=0.1)



Line Graph

x = -100:100/10
y = sin(x)
wv.lineplot(x, y, "/wv-lineplot", height=300, width=300, marginsize=0.1)



Social Network


library(igraph)
g <- erdos.renyi.game(15, 0.175)
wv.sna(g, "/wv-sna", rnorm(15, 2, 0.75), width=400, height=400)



Next Steps

I apologize in advance, as some of the code above may be buggy and it certainly isn't very customizable. The next step -- assuming there's interest -- is to abstract the graph drawing to individual functions so one can then produce multiple graphs in one canvas or frame. Making more options for interactivity, labels, and so on is also a must. Again, comments and suggestions are very welcome.

2Mar/101

Visualizing Networks in JavaScript

Continuing my exploration of JavaScript-based data visualization, I've created a basic network visualizer for the MP data I'm collecting. Below is a social network of all the Canadian federal ministers who have been mentioned together in various press and social media sources in the last week.

Note that the size of the node represents the number of articles mentioning the MP in the past week.

If you want the source code or if the visualization does not work, please e-mail me.

Tagged as: 1 Comment
21Feb/104

Beautiful Web-Based Graphs

I regularly show charts on this website, and for the past few days, have been trying to find a good way to do this. Many of the charts so far have been shown as PDF or JPG files. These are fine, but they are not very responsive. Furthermore, many of the packages available for graphing are proprietary or not open source, and this is a problem for me. I decided to look for something I could live with when it comes to displaying charts and graphs.

Quite a few people have recommended Google Charts, which definitely has a lot to offer. However, I also want to customize my charts and make my own chart types (for example, social network illustrations). Another good package is Open Flash Chart, but I don't have a Flash license and prefer things to be a bit more open. Finally, there's Processing. This is a great language, but Java applets on a website bug me.

I'm quite picky, but have finally found a useful tool: Raphaël -- a library meant for representing vector graphics using JavaScript. While they have a graphing library, I decided to write my own code to play around with the library and customize the graphics. Overall, I must say that I am very impressed with the package.

As an example, the chart below shows a bubble plot. While fairly basic, I'm really happy with how easy it is to make interactive charts. Scrolling over the bubbles changes their colour, and adding other features is fairly easy.

Another example is a line chart, shown below.

I'll do my best to improve these charts and make them more interactive and useful. Please e-mail me if you want the source code.

12Feb/100

Mobile World Congress 2010

In a few hours, I'm flying to Barcelona for the World Mobile Congress, an annual event that showcases pretty much everything related to mobile technologies. I'm quite excited about this event, as it's bringing together around 40,000 to 50,000 people interested in mobile technologies, business, and related areas.

If you're attending and interested in data mining, social network analysis, social media mining, and mobile technologies, feel free to e-mail me. I'm always open to meeting people!

7Feb/100

Tracking the Press: Minister Networks

About a week ago, I discussed tracking Canadian MPs based on the number of times they get mentioned in various news media, and who they get mentioned with. At the time, I only showed a chart of mentions, and discussed some shortcomings of the approaches used for tracking politicians -- or, for that matter, any brands.

I've been working on improving my tracking software and also working on new visualizations. The work has culminated in the network below, and a high quality PDF version is also available:

This network tracks Canadian federal ministers in various blogs, magazines, and newspapers. The size of the circle with the minister's name represents the number of articles (i.e. the larger the circle, the more articles), while a connection exists between ministers if they have been mentioned together in at least one article or blog post over the last week.

Such a network representation provides very useful information about press coverage of Canadian ministers. A great example is that Prime Minister Stephen Harper gets mentioned very often relative to other ministers, but is not mentioned often with other ministers. Tony Clement or Jim Prentice, on the other hand, get mentioned with more ministers, but have fewer articles about them.

One thing the network does not show, however, is how often the co-mentions occur. It's possible, for example, that a set of five or six ministers was mentioned in one article, and this would create something like the dense set of connections with ministers Flaherty, Prentice, Clement, and others. More information would be necessary to analyze whether this is the case or not.

Stay tuned for more updates on the software. I also hope to have a website set up where this is all done automatically and people can peruse social media surrounding Canadian politics.