design

You are currently browsing the archive for the design category.

from O’reilly’s “What is Data Science” by Mike Loukides.

Making data tell its story

A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte’s Visual Display of Quantitative Information is the classic for data visualization, and a foundational text for anyone practicing data science. But that’s not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in analysis. Hilary Mason says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you’ve gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.

There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Ben Fry’s Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.

Nathan Yau’s FlowingData blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart over time. And this is one place where “art” comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn’t just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give us new insights into how economies work? That’s not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It’s the kind of question we now ask routinely.

Data scientists

Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:

… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization 3

Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.

Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database — but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.

This is the heart of what Patil calls “data jiujitsu” — using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable — see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.

Hiring trends for data science

It’s not easy to get a handle on jobs in data science. However, data from O’Reilly Research shows a steady year-over-year increase in Hadoop and Cassandra job listings, which are good proxies for the “data science” market as a whole. This graph shows the increase in Cassandra jobs, and the companies listing Cassandra positions, over time.

Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. hHer job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they’re all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs.

Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”

The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:

The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.

I’m very excited about this talk. OCAD is running a great series of speakers who give unconventional perspectives on design, for free! I’ve been reading about Natalie Jeremijenko and I am smitten! She has a though training in many science and engineering practices but finds way to use these skills to produce objects that delight in the way art and design can, and also level the playing feild for individuals who want to understand their information environment:

What I’m most interested in is: how do we characterize systems of which we know very little, and have very poor information? Knowledge is very partial, very incomplete, and yet decisions are made. So, I specifically try to design information systems that measure urban environmental interactions.

For instance, I put a camera in Fresh Kills landfill, just a little networked web cam. It went on whenever the background radiation flipped above the so-called safe level.

What was interesting was that Staten Island has a hospital on it, which was also measuring environmental radiation. Medical facilities are required to do that. So they had their dosimeter, I had my dosimeter. We’re both gathering the same data and it’s not that different.

But mine’s triggering a web cam. So instead of presenting me with information so that it looks like science, like a little graph, it’s clips. Every time the background radiation fluctuates above a certain level, you get two seconds of video.

When you look at that, you start to see things you were not looking for. Seagulls are always going past when this is being triggered. Something happens at sundown, there’s a truck going past. That becomes interesting.

This issue of radioactive seagulls?there’s only one other paper on it. I wasn’t looking for radioactive seagulls. I had no idea about radioactive seagulls, or the concentration of radioactive diets that go on within the gullet of a seagull. It has actually been partially documented by some Greenpeace science groups in England, in Sellafield. But there are no publications on it here.

So, I was seeing something I wasn’t expecting to see. That’s discovery. That’s what I call data mining. Not taking corporate databases, and going through people’s social security numbers, classic data mining. What is interesting is having open systems that can tell you something. You learn something.

- http://www.worldchanging.com/archives/001450.html

Another great interview here in Salon’s The Artist as Mad Scientist.

If I go to the talk (fingers cross) I’ll report back here.

Tags: , , ,

This is a really exciting example of ways properly applied visualization techniques can help users make sense and use of government collected data. All DC, USA based. Now we need some Canadian examples.

Do you have any?

Bike Map

Tags: , ,

I like simple aphorisms that make me happy when I say them out loud. I’m not alone in this. I’ve noticed a few sites in my world that started with a simple but idealistic precept — that individual change can be fun and in aggregate can be funner(tm) — have matured and born fruit over the past few years. One is the self-explanitory site changeeverything.ca. Funded by the best damn credit union/bank in the world, Vancity, the evolution of this online community/blog was guided by Kate Dugas.

Then there is the Learning to Love you More website, associated with a book of the same name, instigated by Miranda July (the writer/actor of that excellent Me You and Everyone We Know film). The exercises suggested by the site are designed to encourage participants to engage in simple but intimate ways with their neighbourhoods, the physical place which includes plants, animals and other people. Trish Mau introduced me to this site and she has been completing some of the exercises judiciously.

Then I went to the Creative Activism show last night which was also the inaugural opening of the Toronto Free Gallery (just down the street from me). I got to make a ‘city repair’ request of Urbane Repairs representative Martin Reis, who was sitting behind a desk typing up request slips, claiming to be able to make the ‘city fun’ and ‘do in a week what takes the city 5 years’. I hope my request for a bike lane and local traffic only on St. Clarens between College and Bloor gets some prompt attention.

And just this morning I ran across this lovely site (again associated with a book) Things I Have Learned in my Life So Far which uses typography and video as a creative frame for recording social/environmental interventions that demonstrate what the site contributors have learned so far. Beautiful, thoughtful videos are the result.

So to conclude, there is a trend here: The pithy aphorism comes off the page or out of someone’s mouth. It becomes action, it touches others, touches a place. Then gets recorded, uploaded. Finally it inspires someone else to try her own hand at a living action and to share what she has learned, adding to the community cultural bank.

A very good model indeed.

Tags: , ,

Khoi Vinh sums it up very well for me in his article.

Still, even these basics are the second step, in my view. The prerequisite for doing something meaningful with any of these skills — HTML, CSS, Flash or whatever — is first embracing the medium as something different from print. Indeed, there’s no point in learning these skills unless as a print designer you’ve made a prior shift in your understanding of how design works in digital media. Specifically, come to grips with the fact that, on the Web, design is not a method for implementing narrative, as it is in print, but rather it’s a method for making behaviors possible.

More often than not, the reflexive approach that I’ve seen print designers take on the Web is to treat it as a vehicle for print-based design practices: fixing type sizes, specifying typefaces, ignoring usability and expediency, and perhaps most notoriously making the assumption that, over time, users will come around to a print-focused way of consuming content.