Thought-provoking collection of essays about different aspects of data. The book was a bit slow to start but soon picked up and covered a variety of thought-provoking topics. Although I feel that it just scratched the surface and the whole book isn't greater than the sum of its parts, frequent changes of subject with each chapter worked to its benefit and kept me interested.

Your reading activity
Dates 24 May 2012 – 08 November 2013
Time spent reading 9 hours, 15 minutes
Highlights 69
Comments 28
Used app Readmill
Friends who read
  • William Melody William Melody

The royalties for this book are being donated to Creative Commons and the Sunlight Foundation, two organizations dedicated to making the world better by freeing data.

This is the case for YFD. PEIR users, on the other hand

Andrew Doran Andrew Doran

Having the explanation of both systems intertwined in the text like this is a bit confusing.

The key method for collecting data from people online is, of course, through the use of the dreaded form. There is no artifact potentially more valuable to a business, or more boring and tedious to a participant.

while email faces its own set of gatekeepers—namely, automated junk mail filters—very few people, as of yet, hire others to read it for them.

Any time a survey requires an action from the respondent, you're inviting him to decide that the extra effort is not worth it, and to give up.

There is no designers' holy grail that can make people enthusiastic about filling out a form.

phishing expedition

Andrew Doran Andrew Doran

What a great phrase!

Damien Ryan Damien Ryan

I've never heard it used like that. Statisticians usually refer to running many analyses on data until you get the right results as a fishing expedition.

Andrew Doran Andrew Doran

Exactly. The context here, however, is the design of online surveys and trying to get participants trust you with their data. If they think you are phishing (even if you are not) then they are unlikely to complete the survey.

Damien Ryan Damien Ryan

Oh, that's very interesting. Must pick up a copy of the book; I'm studying an MSc in Psychological Research Methods and my current module is all about data gathering - especially survey data. Survey attrition is one of the big problems with that form of data collection and anything to help increase response rates would be a boon.

Andrew Doran Andrew Doran

Sounds completely relevant. There's a whole chapter of the book on a case study of survey design, which is where this highlight is from. You can get it DRM-free (and Readmill-friendly) from O'Reilly.

One of the secrets of email surveys is that the second mailing to the same list generally receives just as many responses as the first.

In embedded systems, the use of dynamic memory allocation is usually considered to be a Really Bad Idea.

the number of lines of code dedicated to error checking and fault handling was roughly equal to the lines of code that actually processed or otherwise handled the data.

when a user in California is trying to tag a photo with a keyword, she definitely does not want to wait for the system to commit that tag to the Singapore replica of the tag database

it is necessary to understand something about how the updates are distributed in the key space, and if necessary, preemptively split and move partitions to prepare for the upcoming onslaught of updates

the known trade-off between consistency, availability, and partition tolerance: only two of those three properties can be guaranteed at all times.

In a gossip protocol, an update is propagated to randomly chosen replicas, which in turn propagates it to other randomly chosen replicas.

Andrew Doran Andrew Doran

Great name.

Consider, for example, Alice, who updates her status from "Sleeping" to "Busy" and then updates her location from "Home" to "Work." Because of the order of updates, the only valid states of the record (from Alice's perspective, which is what matters) are (Sleeping,Home), (Busy,Home), and (Busy,Work). Under eventual consistency, if the two updates are made at different replicas, some replicas might receive the update to "Work" first, meaning that those replicas show a state of (Sleeping,Work) temporarily.

All reasonably curious people wonder how their brain works. At 17, I was unreasonably curious.

A challenge to the Inmon orthodoxy came in 1996 when Ralph Kimball published The Data Warehouse Toolkit (Wiley).

Andrew Doran Andrew Doran

I used this, along with The Data Warehouse Lifecycle Toolkit, when I project managed a large data warehouse initiative a few years ago. Excellent books.

the MySQL database had quite a bit of recovery to catch up on. It was three days before we had a working data warehouse again.

Andrew Doran Andrew Doran

At what point do you terminate it vs letting it continue to run? Three days is a LOT of downtime.

our first full day sent 400 gigabytes of unstructured data rushing over the bow of our Oracle database.

Andrew Doran Andrew Doran


we were unable to aggregate a day of clickstream data in less than 24 hours.

Andrew Doran Andrew Doran

So interesting how often we come across limits like this. Requires more than just tuning - rearchitecting, more hardware or a completely different approach.

you can see the results for yourself at

Andrew Doran Andrew Doran

It doesn't seem that I can.

the efficient use of space is an important aspect of beautiful visualization in that it supports the process of visual discovery of geographic (and other) patterns.

most color schemes are perceptually nonuniform; in other words, the perceived similarity of two colors a fixed distance apart from each other varies across the color space. We therefore chose to use the CIELab color model, which provides a more perceptually consistent color gamut.

Figure 6-9 provides an example in which displacement vectors are curved more sharply at their node position end than their geographic location end.

Andrew Doran Andrew Doran

To my eyes, the figures are all getting a bit messy and incomprehensible. If you need a detailed key to understand the picture, I think you're doing it wrong.

offer the figures presented in this chapter as candidate beautiful depictions of beautiful data.

Andrew Doran Andrew Doran

I have to disagree.

Federated search cannot support the "data finds data" mission, because it has no ability to deliver on enterprise discoverability at scale.

Semantically reconciled[9] directories are directories that attempt to exploit synonyms

Risk-assessment engines, for example, must be configured to produce alarms appropriate to one's individualized risk, staffing, and ability to respond.

The major types of mainstream data that have been moving around over the past 15 years can be categorized as pornography and general consumer commerce

If something can't be done over HTTP, that something needs to be reconsidered.

"Winning data" is readable by a programmer. "Losing data" can be consumed only by machines.

For the first time, at scale, we have software being integrated with other software with extreme regularity.

Ever since the Jurassic period, these two execution-flow paradigms formed the foundation of software development.

Andrew Doran Andrew Doran


The DiSo project is a major catalyst in bringing relevant parties to the table, across APIs, in order to distill more consumable social data

resin statues of horses

Andrew Doran Andrew Doran


the URLs obtained from forms that use get are unique (and dependent on submitted values), whereas the ones obtained with post are not.

The definitive source for the video is the project's Google Code page:

Andrew Doran Andrew Doran

No longer. Try

I highly recommend you visit and check it out. As far as I'm concerned, it is the best programming language for artists, designers, or anyone interested in dynamic data visualization.

Perfection is an admirable goal, but not always the most creative.

Partner up with people who are more talented than you are, and your project will benefit enormously.

it was important to make the information more findable by creating a data-first user interface. Data first means that it's possible to start with a broad visual overview, and narrow down search results by type, time, or geography.

Where there is potential ambiguity—for example, date-first "/crimes/2009-01-09/Robbery" versus type-first "/crimes/Robbery/2009-01-09" or singular "/crime/Robbery" versus plural "/crimes/Robbery"—we introduce an HTTP redirect to the proper, canonical form.

Anscombe's Quartet, a collection of statistically similar data sets illustrating the use of visualization to aid understanding.

Andrew Doran Andrew Doran

This is excellent. Would be great to use this as part of an A-Level statistics course.

The most useful tool was Tableau, a database visualization system.

familiar dimensions such as geography and time enable users to quickly look for themselves (or people like them) in the data and form narratives.

Stacked graphs show aggregate patterns clearly and comfortably support interactive filtering, but do so at the cost of obscuring individual trends—perception of a trend is biased by the contour of the series stacked beneath it.

An animation duration of ~1 second provided transitions that viewers could follow without slowing down the analysis process.

using ColorBrewer ( to determine color choices

The stacked graphs and population pyramid were built using the open source prefuse toolkit (

The tendency to create a story out of noise is sometimes dubbed the narrative fallacy.

Andrew Doran Andrew Doran

I see this every day and it annoys the hell out of me. "Asian markets fell on weak US manufacturing data..." – we can't prove cause and effect here, all we know us that (a) Asian markets fell and that (b) there was weak US manufacturing data. The financial press seems to always present possible relationships as fact I'm their commentaries.

Andrew Doran Andrew Doran

In, not I'm!

Consider, also, the loose causal quips thrown around by financial journalists: "the Dow dropped 100 points on fear of rising unemployment."

Andrew Doran Andrew Doran

Exactly! Annoying.

Karl Popper described an asymmetry in how we use data to answer questions: while no number of results in support of a hypothesis will ever confirm it, a single contradictory result will disprove it.

As the great ex-programmer Bill Gates once said, "Measuring programming progress by lines of code is like measuring aircraft building progress by weight."

Henrik Berggren Henrik Berggren


we never want software to be the rate-limiting step in a workflow.

Sequencescape's project management and laboratory information management tools are open source, and available to download from

While the raw data is not backed up (restoring from tape would take three months),

Andrew Doran Andrew Doran

Another example in he book of natural limits for current computing. I love these challenges where the team has to really question the usual way of doing things.

there is no accepted standard approach for saying "this number is a bit dodgy."

Second Life may seem an odd choice for a scientific visualization environment.

We love exploring big data sets. Rather than confirm prebaked hypotheses, we'll search for interesting patterns and correlations.

When there is an overload of data, scatterplots can be misleading. One way to deal with this is to smooth the data, by plotting an estimated distribution rather than the points themselves (see Figure 17-6). We use a standard technique called kernel density estimation

It's perhaps surprising how extremely gendered words such as "handsome," "gamer," "Bubbly," and "slut" are. They appear with their gender almost all of the time.

Andrew Doran Andrew Doran

Why is that surprising? I'm not surprised.

Inflation-adjusted house prices in 2003 dollars (black), and unadjusted prices (gray).

Andrew Doran Andrew Doran

The ePub version shows red and blue.

even the word "statistics" reveals the connection of data collection for and about the state.

graphs not as beautiful standalone artifacts but rather as tools to help us understand beautiful reality.


Andrew Doran Andrew Doran

Fascinating word. Never knew that was the definition (or what it really meant, if I'm honest!)

here's a pretty big list (but a very small sample) of what's out there:

Andrew Doran Andrew Doran

Great list of publicly-available data sets.

consider what would happen if public data from hundreds of sources could be combined and we could search for connections between things. What would we find?

Andrew Doran Andrew Doran

Reminds me of a blog post by JP Rangaswami (@jobsworth on Twitter) from a couple of years ago. He quotes Danah Boyd who says "Just because something is publicly accessible does not mean that people want it to be publicized. Making something that is public more public is a violation of privacy." Lots to think about when trying to join up big sets of data. See

This is often called the "information silo problem," referring to the fact that information is cleanly separated and largely inaccessible—like grain in a silo (yes, I always thought that metaphor was a bit of a stretch).

I believe that taking advantage of the full network of data is the key to solving this matching problem. The idea is embodied in a series of techniques called collective reconciliation or collective entity resolution.