A Dabble in d3

by Willem Klumpenhouwer

Wednesday, January 13, 2021

Note: This document is a living work in progress. I plan to add to this guide as time allows.

As a researcher and data analyst, I have always placed a lot of value in finding interesting and engaging ways to share research findings and data results. Peer-review publications are great and an important part of ensuring quality work, but they often suffer from being inaccessible, both metaphorically (due to the complexity of the minutia of individual projects) and literally (due to the paywall-happy design of the journal system).

A couple of years ago, I was introduced to d3 (short for data-driven-documents) as a potential way to put together highly customizable, interacive visualizations on the web. Beautiful examples abound, but I was really sold when I saw it applied to transit in Boston to provide an annotated breakdown of a day in the life of MBTA's subway system.

I started with some basic plots, and was able to wrestle some rail tonnage information into an interactive chord diagram, but the code was not pretty, it didn't quite behave the way I wanted, and I didn't fully understand what I had done.

This became a recurring theme: Using the plethora of examples out there, I was able to get diagrams to work (after a lot of hair pulling), but I didn't really understand how exactly d3 was doing what I needed it to do, and more importantly I wasn't able to structure my projects in a way that was scalable, repeatable, and easily adjusted (say, when a requirement changed mid-project). Since then, I've gained a lot more experience working with d3 on larger projects; enough to look back on my previous work and see much room for improvement.

Which brings me to this article. I find myself going back to old, sub-par code or revisiting difficult-to-parse examples from the internet, which are often focused more on getting a single plot working quickly instead of working within the context of a larger visualization project. In an effort to avoid repeating this cycle forever, I'm collecting my thoughts, code snippets, and "how-tos" into a single place

The Essentials

This section is mainly here for comprehensiveness. There are oodles of "get started widh d3" guides out there, each with their own approach and levels of detail. You can use those to get the environment set up the way you want; depending on the web framework you're using that can all look very different, and I'm not going to touch that with a 30-foot pole. Suffice it to say that you need some basics for this to work. First, you need the d3 package itself. I've written this using v5, but I would say grab the latest version and iron out the problems later. Then be sure to include the file followed by your own script near the bottom of your HTML document:

<script type="text/javascript" src="path/to/d3.v5.min.js"></script>
<script type="text/javascript" src="path/to/projectScript.js"></script>

You're also going to need a CSS file to style all these beautiful plots, so go ahead and set one up now. I like to have a style sheet for each project.

Structuring a Project

After many haphazard iterations of various plotting approaches and data structures, I've settled on organizing my thinking and design around function. Breaking down the project into simple tasks that need to be executed allows us to organize our structure logically and in a way that can be adapt to the differences between projects. These functions are:

  • Loading and manipulating the data

  • Putting the data into SVG elements

  • Styling those SVG elements

  • Interacting with those SVG elements

These are the basic functions that each project will need to perform, often multiple times and in multiple ways.

A lot of d3 plotting examples focus heavily on one-off plots. That is, the code works alright for a single plotting example, but requires a fair amount of manual customization or manipulation to get the plot that you actually want. Further, many of these plots restrict the shape and size of the plots to a fixed size, and even responsive plots don't scale well unless your data and plot area is naturally 1:1 in ratio. In most cases, especially with dashboards or visual articles, this isn't the case. Instead, some responsiveness is required.

It's also the case that there are seeminly infinite subtly different ways to to about setting up your plot. Some folks use margins and translation, some don't. Some set the plot sizes in the JavaScript, some don't. I find that my JavaScript bloats no matter what I do so for my own sanity I tend to prioritize making it easy to know where I am and what I'm working with over condensing my code to be as small as possible. The projects I do don't tend to get big enough that loading is an issue (and if they do, it's likely the data loading is orders of magnitude longer than the time taken to serve up JavaScript).

In my HTML, I like to set up my page with a div for each chart that I'm going to be using, with an ID of course:

<!-- Narrative stuff -->
<div id="timeSeries"></div>
<!-- More narrative stuff -->
<div id="phaseSpace"></div>
<!-- Yet more narrative stuff -->

This makes it easy when I start putting together my JavaScript code; I can use the naming convention I've built in the page structure to initialize my SVGs. I do it this way so that they are readily accessible by various functions and events and can be updated within the code as needed. First, I like to define margins and bounding boxes in the code by pulling them from the page definition, and then create two objects: an svg object that holds the whole plot, and a g object which I transform according to the margin:

//Start by initializing the dimension variables
var phaseSpaceMargin = {top: 20, right: 50, bottom: 40, left: 30}
var phaseSpaceDIV = d3.select("#phaseSpace") // Keep the div handy
var phaseSpaceWidth = phaseSpaceDIV.node().getBoundingClientRect().width - phaseSpaceMargin.left - phaseSpaceMargin.right
var phaseSpaceHeight = phaseSpaceDIV.node().getBoundingClientRect().height - phaseSpaceMargin.top - phaseSpaceMargin.bottom

// It can be advantageous to hold the data outside of all the plotting
var phaseSpaceData = []

// Now attach the svg element and the graph itself
// The svg is good to keep separate from the g element
var phaseSpaceSVG = phaseSpaceDIV
	.append('svg')
	.attr('width',  phaseSpaceDIV.node().getBoundingClientRect().width)
	.attr('height', phaseSpaceDIV.node().getBoundingClientRect().height)

// The g element is what we draw our stuff on
var phaseSpaceG = phaseSpaceBox
	.append('g')
	.attr("transform", "translate(" + phaseSpaceMargin.left + "," + phaseSpaceMargin.top + ")");

loadPhaseSpaceData() // Data loading to come next!

The next thing is to set up the base CSS for the container. This will depend on the visual look you're going for and the positioning of the plots in the page, but in this case I'm setting myself up for a plot that is responsive horizontally, but fixed vertically:

#phaseSpace{
	width: 100%;
	height: 150px;
}

I'll do this for each chart element on the page. Up to you whether you want to break them into separate files or keep it as one for the whole project. Whatever works for your brain.

Loading and Manipulating Data

Ideally, pretty much all data processing takes place outside of d3, namely the data that you are working with in d3 are already stripped of any excess information and is in a structure that is easy to work with. That may not always be possible, especially in cases where you may be pulling live data from another site. In either case, loading data (in CSV or JSON format) looks something like this:

// Pull CSV data using d3
d3.csv("url/to/file.csv")
	.then(function(data){
		// Manipuate and process the data as needed
		// Pass them along to make some charts
	});
// Pull JSON data using d3
d3.json("url/to/file.json")
	.then(function(data){
		// Manipuate and process the data as needed
		// Pass them along to make some charts
	});

For plotting purposes, I like to have my data as an Array of dictionaries. This seems like the cleanest way to hold the data but still give me the option of sorting things and adding fields. Ideally, we strip out whatever we don't need ahead of time, but if that's not possible we can do it here. Here, for example, I apply a general filter on two criteria, and then use a jQuery loop to only pull data in the result that matches a strict date format and stick that into an array. Finally, I sort the data based on ascending dates using d3's sort function:

function loadPhaseSpaceData(){
	d3.csv("url/to/file.csv")
	.then(function(data){
		// Filtering the data down
		var filteredData = data.filter(d => (d['Attribute1'] == 'Filter1' & d['Attribute2'] == 'Filter2'));
		phaseSpaceData = []

		// Example iteration using JQuery
		$.each(filteredData[0], function(key, val){
			if (moment(key, "M/D/YY", true).isValid()){
				if (val > 1){
					phaseSpaceData.push({"date":moment(key, "MM/DD/YY").valueOf(), "cases": +val})
				}
			}
		});

		// Sort using d3's sorting capabilities
		phaseSpaceData.sort((a, b) => d3.ascending(a.date, b.date))

		// Calcuate a 7-day rolling average, leaving the first seven days empty
		// Example iteration using a straight up and down for loop
		for (i=0; i < phaseSpaceData.length; i++){
			if (i < 7){
				phaseSpaceData[i]['new'] = null;
			}
			else{
				phaseSpaceData[i]['new'] = (phaseSpaceData[i].cases - phaseSpaceData[i-7].cases)/7
			}
		}

		plotPhaseSpace(phaseSpaceData)// Plotting is next!
	}
}

Notice that I've put all this into a function call. While it's not necessary in many cases, I find this approach allows for better data management and sets you up for easier handing of interactivity and events in the future. Separating the loading funcitonality from the plotting funcitonality means, for example, that you can feel free to update your plot on a window resize wihtout worrying about the pain of re-loading the data. It can also allow for some comparisons to be made on other events to see if loading the data is really necessary (say if the user selects the same option as is currently selected).

Be careful of asynchronicity! All of the code that sits inside the .then() call happens after the data load, so you can be confident that the data you're working with is the stuff you just loaded. If you call loadPhaseSpaceData() and immediately call another function on the next line, the data may not have fully loaded and you will run into issues.

Coming Up Next: Plotting the Data