Beautiful Data

The royalties for this book are being donated to Creative Commons and the Sunlight Foundation, two organizations dedicated to making the world better by freeing data.

24 May 2012 · at 1%

This is the case for YFD. PEIR users, on the other hand

25 May 2012 · at 3%

Andrew Doran

Having the explanation of both systems intertwined in the text like this is a bit confusing.

17:25 · 25 May 2012

The key method for collecting data from people online is, of course, through the use of the dreaded form. There is no artifact potentially more valuable to a business, or more boring and tedious to a participant.

25 May 2012 · at 6%

while email faces its own set of gatekeepers—namely, automated junk mail filters—very few people, as of yet, hire others to read it for them.

25 May 2012 · at 6%

Any time a survey requires an action from the respondent, you're inviting him to decide that the extra effort is not worth it, and to give up.

25 May 2012 · at 7%

There is no designers' holy grail that can make people enthusiastic about filling out a form.

26 May 2012 · at 7%

phishing expedition

26 May 2012 · at 8% · Henrik Berggren liked this

Andrew Doran

What a great phrase!

06:28 · 26 May 2012

Damien Ryan

I've never heard it used like that. Statisticians usually refer to running many analyses on data until you get the right results as a fishing expedition.

13:24 · 14 June 2012

Andrew Doran

Exactly. The context here, however, is the design of online surveys and trying to get participants trust you with their data. If they think you are phishing (even if you are not) then they are unlikely to complete the survey.

13:45 · 14 June 2012

Damien Ryan

Oh, that's very interesting. Must pick up a copy of the book; I'm studying an MSc in Psychological Research Methods and my current module is all about data gathering - especially survey data. Survey attrition is one of the big problems with that form of data collection and anything to help increase response rates would be a boon.

16:33 · 14 June 2012

Andrew Doran

Sounds completely relevant. There's a whole chapter of the book on a case study of survey design, which is where this highlight is from. You can get it DRM-free (and Readmill-friendly) from O'Reilly.

17:16 · 14 June 2012

One of the secrets of email surveys is that the second mailing to the same list generally receives just as many responses as the first.

26 May 2012 · at 9%

In embedded systems, the use of dynamic memory allocation is usually considered to be a Really Bad Idea.

26 May 2012 · at 12%

the number of lines of code dedicated to error checking and fault handling was roughly equal to the lines of code that actually processed or otherwise handled the data.

26 May 2012 · at 13%

when a user in California is trying to tag a photo with a keyword, she definitely does not want to wait for the system to commit that tag to the Singapore replica of the tag database

30 May 2012 · at 16%

it is necessary to understand something about how the updates are distributed in the key space, and if necessary, preemptively split and move partitions to prepare for the upcoming onslaught of updates

30 May 2012 · at 17%

the known trade-off between consistency, availability, and partition tolerance: only two of those three properties can be guaranteed at all times.

30 May 2012 · at 17%

In a gossip protocol, an update is propagated to randomly chosen replicas, which in turn propagates it to other randomly chosen replicas.

02 June 2012 · at 19%

Andrew Doran

Great name.

13:25 · 02 June 2012

Consider, for example, Alice, who updates her status from "Sleeping" to "Busy" and then updates her location from "Home" to "Work." Because of the order of updates, the only valid states of the record (from Alice's perspective, which is what matters) are (Sleeping,Home), (Busy,Home), and (Busy,Work). Under eventual consistency, if the two updates are made at different replicas, some replicas might receive the update to "Work" first, meaning that those replicas show a state of (Sleeping,Work) temporarily.

02 June 2012 · at 19%

All reasonably curious people wonder how their brain works. At 17, I was unreasonably curious.

06 June 2012 · at 20%

A challenge to the Inmon orthodoxy came in 1996 when Ralph Kimball published The Data Warehouse Toolkit (Wiley).

06 June 2012 · at 21%

Andrew Doran

I used this, along with The Data Warehouse Lifecycle Toolkit, when I project managed a large data warehouse initiative a few years ago. Excellent books.

19:53 · 06 June 2012

the MySQL database had quite a bit of recovery to catch up on. It was three days before we had a working data warehouse again.

06 June 2012 · at 21%

Andrew Doran

At what point do you terminate it vs letting it continue to run? Three days is a LOT of downtime.

19:54 · 06 June 2012

our first full day sent 400 gigabytes of unstructured data rushing over the bow of our Oracle database.

06 June 2012 · at 21%

Andrew Doran

Cripes.

19:54 · 06 June 2012

we were unable to aggregate a day of clickstream data in less than 24 hours.

06 June 2012 · at 22%

Andrew Doran

So interesting how often we come across limits like this. Requires more than just tuning - rearchitecting, more hardware or a completely different approach.

19:53 · 06 June 2012

you can see the results for yourself at http://facebook.com/lexicon.

06 June 2012 · at 22%

Andrew Doran

It doesn't seem that I can.

20:54 · 06 June 2012

the efficient use of space is an important aspect of beautiful visualization in that it supports the process of visual discovery of geographic (and other) patterns.

07 June 2012 · at 25%

most color schemes are perceptually nonuniform; in other words, the perceived similarity of two colors a fixed distance apart from each other varies across the color space. We therefore chose to use the CIELab color model, which provides a more perceptually consistent color gamut.

07 June 2012 · at 27%

Figure 6-9 provides an example in which displacement vectors are curved more sharply at their node position end than their geographic location end.

07 June 2012 · at 27%

Andrew Doran

To my eyes, the figures are all getting a bit messy and incomprehensible. If you need a detailed key to understand the picture, I think you're doing it wrong.

19:42 · 07 June 2012

offer the figures presented in this chapter as candidate beautiful depictions of beautiful data.

07 June 2012 · at 28%

Andrew Doran

I have to disagree.

19:41 · 07 June 2012

Federated search cannot support the "data finds data" mission, because it has no ability to deliver on enterprise discoverability at scale.

11 June 2012 · at 31%

Semantically reconciled[9] directories are directories that attempt to exploit synonyms

11 June 2012 · at 31%

Risk-assessment engines, for example, must be configured to produce alarms appropriate to one's individualized risk, staffing, and ability to respond.

11 June 2012 · at 32%

The major types of mainstream data that have been moving around over the past 15 years can be categorized as pornography and general consumer commerce

11 June 2012 · at 33%

If something can't be done over HTTP, that something needs to be reconsidered.

11 June 2012 · at 33%

"Winning data" is readable by a programmer. "Losing data" can be consumed only by machines.

11 June 2012 · at 34%

For the first time, at scale, we have software being integrated with other software with extreme regularity.

11 June 2012 · at 34%

Ever since the Jurassic period, these two execution-flow paradigms formed the foundation of software development.

11 June 2012 · at 35%

Andrew Doran

Really!

22:16 · 11 June 2012

The DiSo project is a major catalyst in bringing relevant parties to the table, across APIs, in order to distill more consumable social data

11 June 2012 · at 36%

resin statues of horses

12 June 2012 · at 37%

Andrew Doran

WTF.

09:03 · 12 June 2012

the URLs obtained from forms that use get are unique (and dependent on submitted values), whereas the ones obtained with post are not.

12 June 2012 · at 38%

The definitive source for the video is the project's Google Code page: http://code.google.com/radiohead

12 June 2012 · at 41%

Andrew Doran

No longer. Try http://www.youtube.com/watch?v=8nTFjVm9sTQ&feature=youtube_gdata_player

22:19 · 12 June 2012

I highly recommend you visit http://processing.org/ and check it out. As far as I'm concerned, it is the best programming language for artists, designers, or anyone interested in dynamic data visualization.

14 June 2012 · at 41%

Perfection is an admirable goal, but not always the most creative.

14 June 2012 · at 43%

Partner up with people who are more talented than you are, and your project will benefit enormously.

14 June 2012 · at 45% · addn2x, Marvin and Matthew Bostock liked this

it was important to make the information more findable by creating a data-first user interface. Data first means that it's possible to start with a broad visual overview, and narrow down search results by type, time, or geography.

14 June 2012 · at 47%

Where there is potential ambiguity—for example, date-first "/crimes/2009-01-09/Robbery" versus type-first "/crimes/Robbery/2009-01-09" or singular "/crime/Robbery" versus plural "/crimes/Robbery"—we introduce an HTTP redirect to the proper, canonical form.

14 June 2012 · at 48%

Anscombe's Quartet, a collection of statistically similar data sets illustrating the use of visualization to aid understanding.

14 June 2012 · at 50%

Andrew Doran

This is excellent. Would be great to use this as part of an A-Level statistics course.

18:47 · 14 June 2012

The most useful tool was Tableau, a database visualization system.

14 June 2012 · at 51%

familiar dimensions such as geography and time enable users to quickly look for themselves (or people like them) in the data and form narratives.

14 June 2012 · at 51%

Stacked graphs show aggregate patterns clearly and comfortably support interactive filtering, but do so at the cost of obscuring individual trends—perception of a trend is biased by the contour of the series stacked beneath it.

14 June 2012 · at 52%

An animation duration of ~1 second provided transitions that viewers could follow without slowing down the analysis process.

14 June 2012 · at 52%

using ColorBrewer (http://colorbrewer.org) to determine color choices

14 June 2012 · at 52%

The stacked graphs and population pyramid were built using the open source prefuse toolkit (http://prefuse.org).

14 June 2012 · at 53%

The tendency to create a story out of noise is sometimes dubbed the narrative fallacy.

15 June 2012 · at 56%

Andrew Doran

I see this every day and it annoys the hell out of me. "Asian markets fell on weak US manufacturing data..." – we can't prove cause and effect here, all we know us that (a) Asian markets fell and that (b) there was weak US manufacturing data. The financial press seems to always present possible relationships as fact I'm their commentaries.

08:19 · 15 June 2012

Andrew Doran

In, not I'm!

09:29 · 15 June 2012

Consider, also, the loose causal quips thrown around by financial journalists: "the Dow dropped 100 points on fear of rising unemployment."

15 June 2012 · at 56%

Andrew Doran

Exactly! Annoying.

08:19 · 15 June 2012

Karl Popper described an asymmetry in how we use data to answer questions: while no number of results in support of a hypothesis will ever confirm it, a single contradictory result will disprove it.

15 June 2012 · at 56% · Damien Ryan liked this

As the great ex-programmer Bill Gates once said, "Measuring programming progress by lines of code is like measuring aircraft building progress by weight."

18 June 2012 · at 64% · Henrik Berggren, Dirk Geurs, Christof Dorner and Daniel Schildt liked this

Henrik Berggren

Classic!

09:08 · 18 June 2012

we never want software to be the rate-limiting step in a workflow.

18 June 2012 · at 68%

Sequencescape's project management and laboratory information management tools are open source, and available to download from http://www.sanger.ac.uk.

18 June 2012 · at 69%

While the raw data is not backed up (restoring from tape would take three months),

18 June 2012 · at 69%

Andrew Doran

Another example in he book of natural limits for current computing. I love these challenges where the team has to really question the usual way of doing things.

20:03 · 18 June 2012

there is no accepted standard approach for saying "this number is a bit dodgy."

18 June 2012 · at 71%

Second Life may seem an odd choice for a scientific visualization environment.

18 June 2012 · at 73%

We love exploring big data sets. Rather than confirm prebaked hypotheses, we'll search for interesting patterns and correlations.

18 June 2012 · at 75%

When there is an overload of data, scatterplots can be misleading. One way to deal with this is to smooth the data, by plotting an estimated distribution rather than the points themselves (see Figure 17-6). We use a standard technique called kernel density estimation

18 June 2012 · at 77%

It's perhaps surprising how extremely gendered words such as "handsome," "gamer," "Bubbly," and "slut" are. They appear with their gender almost all of the time.

19 June 2012 · at 79%

Andrew Doran

Why is that surprising? I'm not surprised.

07:55 · 19 June 2012

Inflation-adjusted house prices in 2003 dollars (black), and unadjusted prices (gray).

19 June 2012 · at 82%

Andrew Doran

The ePub version shows red and blue.

07:56 · 19 June 2012

even the word "statistics" reveals the connection of data collection for and about the state.

19 June 2012 · at 86%

graphs not as beautiful standalone artifacts but rather as tools to help us understand beautiful reality.

19 June 2012 · at 86%

gerrymandering

19 June 2012 · at 86%

Andrew Doran

Fascinating word. Never knew that was the definition (or what it really meant, if I'm honest!) http://www.etymonline.com/index.php?term=gerrymander&allowed_in_frame=0

17:38 · 19 June 2012

here's a pretty big list (but a very small sample) of what's out there:

19 June 2012 · at 89%

Andrew Doran

Great list of publicly-available data sets.

17:55 · 19 June 2012

consider what would happen if public data from hundreds of sources could be combined and we could search for connections between things. What would we find?

19 June 2012 · at 89%

Andrew Doran

Reminds me of a blog post by JP Rangaswami (@jobsworth on Twitter) from a couple of years ago. He quotes Danah Boyd who says "Just because something is publicly accessible does not mean that people want it to be publicized. Making something that is public more public is a violation of privacy." Lots to think about when trying to join up big sets of data. See http://confusedofcalcutta.com/2010/05/23/why-we-share-a-sideways-look-at-privacy/

19:58 · 19 June 2012

This is often called the "information silo problem," referring to the fact that information is cleanly separated and largely inaccessible—like grain in a silo (yes, I always thought that metaphor was a bit of a stretch).

19 June 2012 · at 90%

I believe that taking advantage of the full network of data is the key to solving this matching problem. The idea is embodied in a series of techniques called collective reconciliation or collective entity resolution.

19 June 2012 · at 91%