analyticjournalism.com

It's not "all about story" if you don't have anything to say. So go get some data.

SIDEBAR

»
S
I
D
E
B
A
R
«

How-to: Turning Netflix data into map

Feb 8th, 2010 by analyticjournalism

From the Society of Newspaper Designers via FlowingData:

The making of the NYT’s Netflix graphic

January 20th, 2010

By Kevin Quealy

One of The Times’ recent graphics, “A Peek Into Netflix Queues,” ended up being one of our more popular graphics of the past few months. (A good roundup of what people wrote is here). Since then, there have been a few questions about the how the graphic was made and Tyson Evans, a friend and colleague, thought it might interest SND members. (I bother Tyson with questions about CSS and Ruby pretty regularly, so I owe him a few favors.)

Most readers are probably interested in the interactive graphic, although I will say that we also ran a lovely full-page graphic in print in the Metropolitan section, which goes out to readers in the New York region. That graphic had a lot of interesting statistical analysis – in fact, it would have been nice to get some analysis in the web version, more on that later – but for this I will focus mostly on the web version. If there are questions about the print graphic, I will make sure I get Amanda Cox to try to explain cluster analysis to me again.

First is the data itself. Jo Craven McGinty, a CAR reporter, was in contact with Netflix to obtain a database of the top 50 movies in each ZIP code for every ZIP in the country. That’s about 1.9 million records. The database did not include the number of people renting the movie – just the rank. (We [more here: http://www.snd.org/2010/01/nyt-netflix-graphic ]

No Comments »

The Heatmap

In case you don't know what a heatmap is, it's basically a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different metric like above, or it can be all the same like this one. It's useful for finding highs and lows and sometimes, patterns.

On to the tutorial.

Step 0. Download R

We're going to use R for this. It's a statistical computing language and environment, and it's free. Get it for Windows, Mac, or Linux. It's a simple one-click install for Windows and Mac. I've never tried Linux.

Did you download and install R? Okay, let's move on.

Step 1. Load the data

Like all visualization, you should start with the data. No data? No visualization for you.

For this tutorial, we'll use NBA basketball statistics from last season that I downloaded from databaseBasketball. I've made it available here as a CSV file. You don't have to download it though. R can do it for you.

I'm assuming you started R already. You should see a blank window.

1Rconsole

Now we'll load the data using read.csv().

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep=",")

We've read a CSV file from a URL and specified the field separator as a comma. The data is stored in nba.

Type nba in the window, and you can see the data.

2load

Step 2. Sort data

The data is sorted by points per game, greatest to least. Let's make it the other way around so that it's least to greatest.

nba <- nba[order(nba$PTS),]

We could just as easily chosen to order by assists, blocks, etc.

Step 3. Prepare data

As is, the column names match the CSV file's header. That's what we want.

But we also want to name the rows by player name instead of row number, so type this in the window:

row.names(nba) <- nba$Name

Now the rows are named by player, and we don't need the first column anymore so we'll get rid of it:

nba <- nba[,2:20]

Step 4. Prepare data, again

Are you noticing something here? It's important to note that a lot of visualization involves gathering and preparing data. Rarely, do you get data exactly how you need it, so you should expect to do some data munging before the visuals. Anyways, moving on.

The data was loaded into a data frame, but it has to be a data matrix to make your heatmap. The difference between a frame and a matrix is not important for this tutorial. You just need to know how to change it.

nba_matrix <- data.matrix(nba)

Step 5. Make a heatmap

It's time for the finale. In just one line of code, build the heatmap (remove the line break):

nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))

You should get a heatmap that looks something like this:

3heatmap

Step 6. Color selection

Maybe you want a different color scheme. Just change the argument to col, which is cm.colors(256) in the line of code we just executed. Type ?cm.colors for help on what colors R offers. For example, you could use more heat-looking colors:

nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))

4heat

For the heatmap at the beginning of this post, I used the RColorBrewer library. Really, you can choose any color scheme you want. The col argument accepts any vector of hexidecimal-coded colors.

Step 7. Clean it up – optional

If you're using the heatmap to simply see what your data looks like, you can probably stop. But if it's for a report or presentation, you'll probably want to clean it up. You can fuss around with the options in R or you can save the graphic as a PDF and then import it into your favorite illustration software.

I personally use Adobe Illustrator, but you might prefer Inkscape, the open source (free) solution. Illustrator is kind of expensive, but you can probably find an old version on the cheap. I still use CS2. Adobe's up to CS4 already.

For the final basketball graphic, I used a blue color scheme from RColorBrewer and then lightened the blue shades, added white border, changed the font, and organized the labels in Illustrator. Voila.

nba_heatmap_revised

Rinse and repeat to use with your own data. Have fun heatmapping.

No Comments »

So what ARE people talking abouit

Jan 14th, 2010 by analyticjournalism

One of the things we've noticed about journalism operation that allow comments and discussion on their web pages is that few take the time to analyze that interchange and content. Partially, that's because of a lack of tools. The “tldr Project” is a step toward meeting that challenge.

tldr PROJECT – http://demaws.net/projects/tldr#about

Recent years have seen a proliferation of large-scale discussion spaces on the internet. With increasing user participation, it is not uncommon to find discussion spaces with hundreds to thousands of messages/participants. This phenomenon can be observed on a wide variety of websites – news outlets, blogs, social media websites, community websites and support forums. While most of these discussion spaces are able to support small discussions, their effectiveness is greatly reduced as the discussions grow larger. Users participating in these discussions are overwhelmed by the sheer amount of information presented, and the systems that support these conversations are lacking in functionality that lets users navigate to content of interest.

tldr is an application for navigating through large-scale online discussions. The application visualizes structures and patterns within ongoing conversations to let the user browse to content of most interest. In addition to visual overviews, it also incorporates features such as thread summarization, non-linear navigation, multi-dimensional filtering, and various other features that improve the experience of participating in large-discussions.

The current version of the application is functional for discussions on Reddit. This application will be released shortly. Until the application can be released, here is a video that presents many of the unique features built into the application. For best results, watch the video with HD turned on, or download a high-resolution version from Vimeo. More soon!

VISUALIZATION GALLERY

Here is a sample of patterns seen with the visualizations built into the application. Each of these visualizations present unique insight into the nature of the conversation, and help in discerning points of interest within a large conversation.

PUBLICATION

Narayan, Srikanth and Cheshire, Coye – “Not too long to read: The tldr Interface for Exploring and Navigating Large-Scale Discussion Spaces”. To appear in The 43rd Annual Hawaii International Conference on System Sciences – Persistent Conversations Track – Jan 2010

No Comments »

David Rumsey's website redesign

Dec 14th, 2009 by analyticjournalism

New davidrumsey.com Website Redesign

“For the first time since its launch in 1999, the www.davidrumsey.com website has been completely redesigned and updated. With better navigation and structure, users will find it easier to explore the site's many viewers and collection database with over 21,000 maps online. A new Blog has been added to the site, and includes entries for Recent Additions, News, Featured Maps, Related Sites, and Videos. Over 200 historic maps from the collection can be viewed in a new browser-based version of Google Earth, and users can enter the Second Life version of the map collection directly from a dedicated Second Life portal page on the site. And the collection ticker at the bottom of the home page shows the entire online map library in random order over about 10 hours. As always, all maps can be downloaded for free directly from the site at full resolution. And a new service from Pictopia allows purchase of reproductions of any map in the collection directly from the new LUNA viewing software.”

No Comments »

Cartography 2.0

Dec 11th, 2009 by analyticjournalism

From Internet Scout </a>:

Cartography 2.0

http://cartography2.org/

“Professor Mark Harrower at the University of Wisconsin Madison's Department of Geography was frustrated with the “inability of traditional textbooks to keep pace with Web technologies.” So he and his colleagues set out to create Cartography 2.0, which is a “free knowledge base and e-textbook for students and professionals interested in interactive and animated maps.” First-time visitors might want to look over the “Purpose” section before diving into the separate “Chapters” of the book. All of the chapters can be found on the homepage, and they cover topics such as map animation, virtual globes, elements of design, and map interaction techniques. Each chapter contains descriptive essays, along with maps and diagrams that illustrate key principles. The “New Content” section on the homepage features the latest additions to the site, and overall this work is a model for educators who might be interested in crafting an engaging and dynamic online textbook.”

No Comments »

A fine how-to from FlowingData on making an "Interactive Area Graph"

Dec 9th, 2009 by analyticjournalism

Nathan, the guy behind the code at the FlowingData blog, offers up a good how-to set for producing interactive area graph.

How to Make an Interactive Area Graph with Flare

Posted by Nathan / Dec 9, 2009 to Tutorials / 3 comments

You've seen the NameExplorer from the Baby Name Wizard by Martin Wattenberg. It's an interactive area chart that lets you explore the popularity of names over time. Search by clicking on names or typing in a name in the prompt. It's simple. It's sexy. Everybody loves it.

This is a step-by-step guide on how to make a similar visualization in Actionscript/Flash with your own data and how to customize the design for whatever you need. We're after last week's graphic on consumer spending:

consumer spending

Audience

This tutorial is for people with at least a little bit of programming experience. I'll try to make it as straightforward as possible, but the concepts might be a little hard to grasp if you've never written a line of code. Just a heads up. Of course it never hurts to try.

If you don't care about customization or integration into an application and don't mind putting your data in the public domain, you could also just dump your data into Many Eyes, and use the Stack Graph.

Get Adobe Flex Builder

Like I said, this is all in Actionscript, so before we start anything, I strongly recommend you get Adobe Flex Builder if you don't already have it. You can buy it, get a trial version from the Adobe site, or if you're in education, you can get it for free.

There are ways to compile Actionscript without Flex Builder, but they are more complicated. [read more here]

No Comments »

Swimming in Data? Three Benefits of Visualization

Dec 4th, 2009 by analyticjournalism

Good piece on dataviz from Harvard Business Publishing.

John Sviokla The Near Futurist RSS Feed

Swimming in Data? Three Benefits of Visualization

4:11 PM Friday December 4, 2009

Tags:Information & technology, Knowledge management

“A good sketch is better than a long speech…” — a quote often attributed to Napoleon Bonaparte

The ability to visualize the implications of data is as old as humanity itself. Yet due to the vast quantities, sources, and sinks of data being pumped around our global economy at an ever increasing rate, the need for superior visualization is great and growing. To give dimension to the size of the challenge, the EMC reports that the “digital universe” added 487 exabytes — or 487 billion gigabytes — in 2008. They project that in 2012, we will add five times as much digital information as we did last year.

I believe that we will naturally migrate toward superior visualizations to cope with this information ocean. Since the days of the cave paintings, graphic depiction has always been an integral part of how people think, communicate, and make sense of the world. In the modern world, new information systems are at the heart of all management processes and organizational activities.

About ten years ago, I vividly remember visiting the Cabinet War Rooms in the basement of Whitehall, where Churchill had his war room during WW II. The desks were full of phones, and the walls covered with maps and information about troop levels and movements. These used color coded pieces of string to help Churchill's team easily understand what was happening:

On the one hand, I was struck by how primitive their information environment was only sixty years ago. But on the other, I found it reassuring to see how similar their approach was to war fighting today. The mode, quality and speed of data capture has changed greatly from the 1940s, but the paradigm for visualization of the terrain, forces, and strategy are almost identical to those of WWII. So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality. [more]

No Comments »

Distributed Data Analysis at Facebook

Dec 1st, 2009 by analyticjournalism

This is a few months old, but we're wondering if any readers have used Hive or tried to deploy it in newsrooms, where “exploring and analyzing data…[is] everyone's responsibility.”

Distributed Data Analysis at Facebook

Tuesday, August 11, 2009 at 2:53pm

Exploring and analyzing data isn’t the responsibility of one team here at Facebook; it’s everyone’s responsibility. “Move fast” is one of our core values, and to facilitate fast data-driven decisions, the Data Infrastructure Team has created tools like Hive and its UI sidekick, HiPal, to make analyzing Facebook’s petabytes of data easy for anyone in the company. The Data Science team runs open tutorial sessions for groups eager to run their own analysis using these tools. And non-programmers on every team have fearlessly rolled up their sleeves to learn how to write Hive queries.

Today, Facebook counts 29% of its employees (and growing!) as Hive users. More than half (51%) of those users are outside of Engineering. They come from distinct groups like User Operations, Sales, Human Resources, and Finance. Many of them had never used a database before working here. Thanks to Hive, they are now all data ninjas who are able to move fast and make great decisions with data.

If you like to move fast and want to be a data ninja (no matter what team you are in), check out our Careers page.

No Comments »

Roll-your-own choropleth map with free tools

Nov 12th, 2009 by analyticjournalism

Nathan, honcho at FlowingData, has put together a fine tutorial on making a choropleth map using free tools. This is one bookmark you will want to save.

How to Make a US County Thematic Map Using Free Tools

Posted: 11 Nov 2009 10:57 PM PST

There are about a million ways to make a choropleth map. You know, the maps that color regions by some metric. The problem is that a lot of solutions require expensive software or have a high learning curve…or both. What if you just want a simple map without all the GIS stuff? In this post, I'll show you how to make a county-specific choropleth map using only free tools.

The Result

Here's what we're after. It's the most recent unemployment map from last week.

Unemployment in the United States

Step 0. System requirements

Just as a heads up, you'll need Python installed on your computer. Python comes pre-installed on the Mac. I'm not sure about Windows. If you're on Linux, well, I'm sure you're a big enough nerd to already be fluent in Python.

We're going to make good use of the Python library Beautiful Soup, so you'll need that too. It's a super easy, super useful HTML/XML parser that you should come to know and love.

Step 1. Prepare county-specific data

The first step of every visualization is to get the data. You can't do anything without it. In this example we're going to use county-level unemployment data from the Bureau of Labor Statistics. However, you have to go through FTP to get the most recent numbers, so to save some time, download the comma-separated (CSV) file here.

No Comments »

» Substance:WordPress » Style:Ahren Ahimsa

The making of the NYT’s Netflix graphic

More Visualization Links on Twitter

By: Jeff Clark Date: Sat, 23 Jan 2010

Top Collections of Data Visualization Links

Top Data Visualization Product Links Mentioned on Twitter

Top Data Visualization Websites Mentioned on Twitter

How to Make a Heatmap – a Quick and Easy Solution

The Heatmap

Step 0. Download R

Step 1. Load the data

Step 2. Sort data

Step 3. Prepare data

Step 4. Prepare data, again

Step 5. Make a heatmap

Step 6. Color selection

Step 7. Clean it up – optional

New davidrumsey.com Website Redesign

How to Make an Interactive Area Graph with Flare

Audience

Get Adobe Flex Builder

John Sviokla The Near Futurist RSS Feed

Swimming in Data? Three Benefits of Visualization

The Result

Step 0. System requirements

Step 1. Prepare county-specific data