from O’reilly’s “What is Data Science” by Mike Loukides.

Making data tell its story

A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte’s Visual Display of Quantitative Information is the classic for data visualization, and a foundational text for anyone practicing data science. But that’s not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in analysis. Hilary Mason says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you’ve gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.

There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Ben Fry’s Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.

Nathan Yau’s FlowingData blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart over time. And this is one place where “art” comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn’t just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give us new insights into how economies work? That’s not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It’s the kind of question we now ask routinely.

Data scientists

Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:

… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization 3

Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.

Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database — but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.

This is the heart of what Patil calls “data jiujitsu” — using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable — see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.

Hiring trends for data science

It’s not easy to get a handle on jobs in data science. However, data from O’Reilly Research shows a steady year-over-year increase in Hadoop and Cassandra job listings, which are good proxies for the “data science” market as a whole. This graph shows the increase in Cassandra jobs, and the companies listing Cassandra positions, over time.

Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. hHer job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they’re all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs.

Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”

The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:

The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.

Have you received an email from a techie recently that looks like this?

“Had a hard time getting the X installed but is working fine now. Will be back in 10 minutes. Happy to hear responses then.”

Dropped your ‘I’? Here, let me get that for you.

There is a trend in techie communication where people are dropping the pronoun , ‘I’  ( They are also using plus and minus signs to indicate agreement  ++ , but that’s the subject of another blog post.)

I first noticed this trend in a friend of mine who is an early adopter of many things online. I was confused at first, it seemed that he – who I consider to be a very responsible worker – was abdicating responsibility for his words. I wondered if the pressure of delivering such consistently good work had gotten to him. But then I noticed it springing up in other email lists. This act of dropping the ‘I’ was attractive to many people suddenly, maybe because they noticed other people doing it, and it really annoyed me!

Then I watched a CBC documentary about how and why people lie. One computer program had been developed by someone at MIT to scan people’s emails to determine if they were truthful or not. It turns out people who lie drop their pronouns in personal correspondence! They are not taking responsibility for their words and it shows.

The same program explained micro-expressions – a phenomenon where the human face will express for a fraction of a second the true emotion someone is feeling, even if they are able to maintain a false expression the majority of the time. They showed a series of micro-expressions and I was able to guess them each time. So maybe I’m just acutely perceptive to sincerity cues?

And in closing, some imperative statements from This Blogger:

LIfe is short. Let’s stand behind our words. And if that’s too much of a burden, let us speak and write less.

Dawn

I’m very excited about this talk. OCAD is running a great series of speakers who give unconventional perspectives on design, for free! I’ve been reading about Natalie Jeremijenko and I am smitten! She has a though training in many science and engineering practices but finds way to use these skills to produce objects that delight in the way art and design can, and also level the playing feild for individuals who want to understand their information environment:

What I’m most interested in is: how do we characterize systems of which we know very little, and have very poor information? Knowledge is very partial, very incomplete, and yet decisions are made. So, I specifically try to design information systems that measure urban environmental interactions.

For instance, I put a camera in Fresh Kills landfill, just a little networked web cam. It went on whenever the background radiation flipped above the so-called safe level.

What was interesting was that Staten Island has a hospital on it, which was also measuring environmental radiation. Medical facilities are required to do that. So they had their dosimeter, I had my dosimeter. We’re both gathering the same data and it’s not that different.

But mine’s triggering a web cam. So instead of presenting me with information so that it looks like science, like a little graph, it’s clips. Every time the background radiation fluctuates above a certain level, you get two seconds of video.

When you look at that, you start to see things you were not looking for. Seagulls are always going past when this is being triggered. Something happens at sundown, there’s a truck going past. That becomes interesting.

This issue of radioactive seagulls?there’s only one other paper on it. I wasn’t looking for radioactive seagulls. I had no idea about radioactive seagulls, or the concentration of radioactive diets that go on within the gullet of a seagull. It has actually been partially documented by some Greenpeace science groups in England, in Sellafield. But there are no publications on it here.

So, I was seeing something I wasn’t expecting to see. That’s discovery. That’s what I call data mining. Not taking corporate databases, and going through people’s social security numbers, classic data mining. What is interesting is having open systems that can tell you something. You learn something.

- http://www.worldchanging.com/archives/001450.html

Another great interview here in Salon’s The Artist as Mad Scientist.

If I go to the talk (fingers cross) I’ll report back here.

Well I’ve never trusted cell phones, and people who try to get a hold of me using my cell number quickly realize I never recharge it and and I almost never take it out. It’s a resource of last resort – something I bought when I first moved to Toronto and needed a phone number, that modern stamp of legitimacy before I even had a permanent home.

LG 150

So when I did occasionally use my  $50 LG 150 (the cheapest I could get ) – and my head felt warm from the cell phone – I was not surprised. What did surprise me was how even the most tech-suspicious people I knew eventually embraced constant cell phone use (getting rid of their very expensive landlines) and chided me for my non-conformance. Well now LG 150 has ‘voluntarily’ recalled this line of phones that according to Health Canada are emitting too much radiation:

Testing by Industry Canada has revealed that the LG 150 does not meet the radiofrequency exposure limits established by the federal government, i.e. Safety Code 6, and referenced in the regulations of the Radiocommunications Act. An independent accredited certification body has revoked the certification for the LG 150 model and thus it is no longer eligible to be manufactured, imported and sold in Canada. Consequently, Industry Canada has removed the LG 150 from its Radio Equipment List.

Health Canada is of the opinion, based on the review of test results and its assessment of current science, that the past and current use of the LG 150 should not pose immediate or long-term health concerns. While test results exceeded the exposure limits of Safety Code 6, they were well below the threshold at which harmful health effects might occur. Nevertheless, Health Canada supports the recall and encourages all consumers to return LG 150 mobile phones to their service providers for a no-cost replacement.

It sure would be nice to know more about the test results and by how much they exceeded ‘the exposure limits of Safety Code 6′.  So Virgin Mobile tells me I’ll be getting a free Samsung replacement. Which I will continue to use for 5 minutes every three months. If you read the comment section below this article on the recall you’ll see my concerns are shared by other people. It’s just common sense to ask – how safe are these things – and so far we haven’t heard any official replies.

by Alan on Wed 28 Jan 2009 12:42 AM EST | Permanent Link
The LG150’s seem to be the lower, if not the lowest priced and graded cell that LG has to offer (through Telus anyway), how can I be so sure that the next grade, the LG860 doesn’t have the same excessive exposure? I feel as if my health is at risk by continuing to use this products that cause tissue damage!

Wendy Mesley at CBC recendly did some investigation into the risks of cell phone use for children and this parenting site has further discussions on the issue. Health Canada is not taking a precautionary approach to cell phone use in Canada. The precautionary approach to cell phone use recommended by Ronald Herberman, director of the University of Pittsburgh Cancer Institute is cited in both this Scientific American blog entry and this UK Observer article“We shouldn’t wait for a definitive study to come out, but err on the side of being safe rather than sorry later,” said Dr Herberman in the UK Observer article continuing,

“I am convinced that there are sufficient data to warrant issuing an advisory to share some precautionary advice on cell phone use.”

His warning came even though no major academic study has yet found any evidence that exposure to mobile phone signals affects brain function and the US Food and Drug Administration has said that, if there is a risk, it is probably very small.

Dr Heberman, however, said there was a “growing body of literature” which linked long-term mobile phone use with adverse health effects, including cancer.

Of course officially some departments in Canadian government still don’t admit there is much risk to human health from from asbestos or nanotechnology, that’s why sunscreen and lipstick that contain nanoparticles carry no warning labels in Canada (although that might be changing in Febrruary 2009) .

This other UK Observer article parallels the dangers between asbestos and nanoparticles, both (at times) highly profitable industries for Canada:

Professor Anthony Seaton, from the University of Aberdeen, said that titanium oxide was harmless in its ordinary form, but had been shown to have a toxic effect on cells in its very fine, nanoparticle form.

He did not predict however that the technology’s effects on health would be severe.

“If you burn toast you are producing nanoparticles. I produce them regularly,” he said.

But Brendan Barber, the TUC general secretary, said the danger to workers of breathing in particles and fibres was a real concern.

“Asbestos is still killing people 100 years on,” he said. “We must learn from this tragedy and ensure that a regulated nanotechnology industry can make products that are useful and innovative but safe to workers and consumers.”

10cffac8-e998-11dd-9e64-000255111976 Blog_this_caption

Read the section15.ca original story here

Tags: , , ,

I seem to have landed in the middle of some exciting, new (to me) theories around online news and network theory.

Last week while in Montreal I met Claude G. Théoret who is part of Exvisu which offers “Strategic Network Intelligence”. What is that? They make maps of conversations occurring on the web, noting the number of links between blogs and reoccurring terms. That’s what I understand at this preliminary phase. They are offering their services to companies and politicians who want to know what the hot button issues among the people they need to please, as well why kind of language is being used to talk about issues. The outcome of this research aims to be similar to what pollsters claim to do. I imagine using both techniques together will produce the most fruitful results.

Claude lent me his copy of Linked by Albert-László Barabási – a well written explanation of how network theory developed and how it is being used by the ‘new cartographers’.

And today I came upon a Dutch site – Issuenetwork.org – that offers tools and information about issue networking on the web. I was particularly interested in this 2004 paper on the way news devlivery will be (is being) transformed by the evolution of network technology and particularly this section “six arguments against news“, a provocative sub heading in my circles, but is meant to cricise mainstream news delivery techniques.

Another article on the same site, The News about Networks 2: Making Issues into Rights - introduces a 2004 workshop where the aim is to get media activists using the Issue Crawler tools  (which in their aims seem similar to the tools used by Exvisu) being developed at the de Balie Center for Culture and Politics, Amsterdam.

Much of the workshop will revolve around using the Issue Crawler, server-side software, developed with OneWorld International (London), Aguidel (Paris) and Recognos (Cluj-Napoca) that locates, analyses and visualises networks on the Web. We also will make use of novel techniques to monitor and analyse the news through Google News and RSS readers. Textual, semantic and other data analyses may be undertaken.

They questions the were asking at the conference were:

  • What are my networks? What is my relative standing within these networks?
  • Which types of organisations, agendas and terms dominate these networks?
  • Do the organisations in these networks recognise each other’s work and issues?
  • Which parts of the networks hold together if one takes out funders? Do they hold together if one takes out other agenda-setters, be it (big) media or intergovernmental organisations?

This is a really exciting example of ways properly applied visualization techniques can help users make sense and use of government collected data. All DC, USA based. Now we need some Canadian examples.

Do you have any?

Bike Map

Tags: , ,

I love this, rainbows, libraries and whimsy – and it’s a grad project by Valérie Madill a master’s student at my old Alma Matter.. Emily Carr University.

where is the magic
“Why is it then that the magic, mystery, adventure and knowledge is not sensed when entering a library? It is disgraceful that a library should be considered dull and stuffy. After observing Academic libraries, how they function, who uses them, how they are used, when they are used, what they look like, what expectations are, what frustrations are… I discovered many things, and decided the most notable design solution would be one applicable to all libraries, big, small, regardless of the physical shape and structure. In terms of library architecture, I came across some brilliant spaces, I even designed my own ultimate library in terms of placement, aesthetic and although I had some very important key factors established, I quickly realized the varying shape and infinite possibility of library architecture was part of the magic I did not want to lose.”

Valérie Madill

libray bar code

(Also note she built her site with Built with Indexhibit – lots of artists seem to like it).

Tags: , , , ,

Many people I know who reads online are as annoyed as I am by websites that paginate their content. We all know this decision is usually motivated by a desire to create more pageviews for advertisers rather than by any regard for the reading pleasure of their public. Salon.com has done it since early on and I often find myself leaving the article rather than clicking to continue or if i do really like the story I will simply hit the print button to real the whole story sans images and ads.

But you don’t have to take the word of a web developer with 9 years of experience, instead recognize that if people are going to the trouble to create browser plugins that recombine these pages into one there is a market for single page articles. Here is a great lifehacker article on the subject. Also check out the revealing opinions in the comment section below.

Tags: ,

I like simple aphorisms that make me happy when I say them out loud. I’m not alone in this. I’ve noticed a few sites in my world that started with a simple but idealistic precept — that individual change can be fun and in aggregate can be funner(tm) — have matured and born fruit over the past few years. One is the self-explanitory site changeeverything.ca. Funded by the best damn credit union/bank in the world, Vancity, the evolution of this online community/blog was guided by Kate Dugas.

Then there is the Learning to Love you More website, associated with a book of the same name, instigated by Miranda July (the writer/actor of that excellent Me You and Everyone We Know film). The exercises suggested by the site are designed to encourage participants to engage in simple but intimate ways with their neighbourhoods, the physical place which includes plants, animals and other people. Trish Mau introduced me to this site and she has been completing some of the exercises judiciously.

Then I went to the Creative Activism show last night which was also the inaugural opening of the Toronto Free Gallery (just down the street from me). I got to make a ‘city repair’ request of Urbane Repairs representative Martin Reis, who was sitting behind a desk typing up request slips, claiming to be able to make the ‘city fun’ and ‘do in a week what takes the city 5 years’. I hope my request for a bike lane and local traffic only on St. Clarens between College and Bloor gets some prompt attention.

And just this morning I ran across this lovely site (again associated with a book) Things I Have Learned in my Life So Far which uses typography and video as a creative frame for recording social/environmental interventions that demonstrate what the site contributors have learned so far. Beautiful, thoughtful videos are the result.

So to conclude, there is a trend here: The pithy aphorism comes off the page or out of someone’s mouth. It becomes action, it touches others, touches a place. Then gets recorded, uploaded. Finally it inspires someone else to try her own hand at a living action and to share what she has learned, adding to the community cultural bank.

A very good model indeed.

Tags: , ,


The impact of the Internet is that it’s pulling the froth of commentary and debate off the top of first-generation news gathering, leaving newspapers with only a first-generation role for themselves, which is not enough for them to sustain readers, and so they’re losing young readers. By and large, excusing the fact that there are some first-generation journalists going out and acquiring new information directly for the Web, the vast majority of the Internet is reaction and debate and commentary — some of it brilliant. But I don’t run into a lot of Internet reporters at council meetings and in courthouses.

David Simon interviewed in Salon.com

Tags: , ,

The first appearance of archy

One morning Don Marquis arrived in
his office to find the following
message on his typewriter, all in
lower case. Archy, a cockroach
reincarnated from a poet, had laboriously
typed the message to Don by climbing upon
the typewriter and jumping on the keys,
one at a time. The message is all in
lower case, because Archy could not
operate the shift key.

The Coming of Archy:

expression is the need of my soul
i was once a vers libre bard
but i died and my soul went
into the body of a cockroach
it has given me a new outlook on life

i see things from the under side now
thank you for the apple peelings in the wastepaper basket
but your paste is getting so stale i can’t eat it
there is a cat here called mehitabel i wish you would have
removed she nearly ate me the other night why don’t she
catch rats that is what she is supposed to be for
there is a rat here she should get without delay

most of these rats here are just rats
but this rat is like me he has a human soul in him
he used to be a poet himself
night after night i have written poetry for you
on your typewriter
and this big brute of a rat who used to be a poet
comes out of his hole when it is done
and reads it and sniffs at it
he is jealous of my poetry
he used to make fun of it when we were both human
he was a punk poet himself
and after he has read it he sneers
and then he eats it

i wish you would have mehitabel kill that rat
or get a cat that is onto her job
and i will write you a series of poems
showing how things look
to a cockroach
that rats name is freddy
the next time freddy dies i hope he won’t be a rat
but something smaller i hope i will be a rat
in the next transmigration and freddy a cockroach
i will teach him to sneer at my poetry then

don’t you ever eat any sandwiches in your office
i havent had a crumb of bread
for i dont know how long
or a piece of ham or anything but apple parings
and paste leave a piece of paper in your machine
every night you can call me archy

Khoi Vinh sums it up very well for me in his article.

Still, even these basics are the second step, in my view. The prerequisite for doing something meaningful with any of these skills — HTML, CSS, Flash or whatever — is first embracing the medium as something different from print. Indeed, there’s no point in learning these skills unless as a print designer you’ve made a prior shift in your understanding of how design works in digital media. Specifically, come to grips with the fact that, on the Web, design is not a method for implementing narrative, as it is in print, but rather it’s a method for making behaviors possible.

More often than not, the reflexive approach that I’ve seen print designers take on the Web is to treat it as a vehicle for print-based design practices: fixing type sizes, specifying typefaces, ignoring usability and expediency, and perhaps most notoriously making the assumption that, over time, users will come around to a print-focused way of consuming content.

Alan Rusbridger reports on a conversation between facebook, flickr, Slate creator and NY Times and MSN editors. A divide in understanding still exists:

A distinguished magazine editor finally broke through the cosy bonding by denying that we could all have “both/ and”. It was “either/or.” We couldn’t run away from the fact that there wasn’t yet a credible economic model for old media owners to be dabbling around with the new kids on the block. So choices had to be made.

Yes, well. Safer to talk about the “soft” issues of community and blogging. A blogging entrepreneur drew a useful distinction between old mainstream media (MSM) which had attention deficit disorder and the best bloggers, who were obsessive compulsive. Newspapers started out on stories or campaigns and then got bored. Bloggers never got bored of their own subjects.

Read: Davos 07: ADD vs OCD : The future of newspapers is a bit like climate change: there are now far fewer ‘old-media’ deniers.