Wednesday, 07 April 2010 02:26

Pete Warden vs. Facebook: a case of too much data access


Pete Warden had a really great idea: to map the friendship interactions of Facebook users to aid with geospatial analysis of user relationships.  Facebook's lawyers had a different view.

The entire saga is laid out as irregular snippets on Pete Warden's blog but to assist readers, we will extract the salient details, mostly in Pete's own words.  Our analysis is at the end of this report.

A couple of years ago, Pete Warden created a start-up company called Mailana (short for Mail Analysis).  "My goal has been to use the data sitting around in all our inboxes to help us in our day-to-day lives." 

Some time later, the focus changed.

"I'd already applied the same technology to Twitter to produce graphs showing who people talked to, and how their friends were clustered into groups. I set out to build that into a fully-fledged service, analyzing people's Twitter, Facebook and webmail communications to understand and help maintain their social networks.

"It offered features like identifying your inner circle so you could read a stream of just their updates, reminding you when you were falling out of touch with people you'd previously talked to a lot, and giving you information about people you'd just met."

On March 17th, Warden announced that due to the threat of legal action from Facebook, he had destroyed all accumulated Facebook-derived data linked to the analysis of social networks.

His summary of events continues: "As you can imagine I'm not very happy about this, especially since nobody ever alleged that my data gathering was outside the rules the web has operated by since crawlers existed. I followed their robots.txt directions, and was even helped by microformatting in the public profile pages. Literally hundreds of commercial search engines have followed the same path and have the same data. You can even pull identical information from Google's cache if you don't want to hit Facebook's servers. So why am I destroying the data? This area has never been litigated and I don't have enough money to be a test case."

Despite the ill-will over the entire incident, Warden does express some sympathy for Facebook noting that he was about to release data that they had no idea was available.

"I know from my time at Apple that reaching for the lawyers is a tempting first option when there's a nasty surprise like that. If I had to do it all over again, I'd try harder not to catch them off-guard."

Despite this setback, he observes that it isn't particularly difficult for anyone to obtain data from Google and he is already working on a similar analysis on publicly-available data linked to Google Buzz.

Getting back to the history of the saga: "I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos."

Pete next describes the hell that broke loose upon him.

Once he had around six months' data, he started to do some simple analysis - the initial publication of this was met with crickets and tumbleweeds (as he described the amazing amount of interest shown!).

Next, in a fit of boredom he took to the data with some colouring tools to highlight some of the simple patterns he'd noticed.

Within 2 days, his blog post had received something like 200,000 hits via YCombinator and Reddit. 

Next his mobile phone rang with Facebook's attorney on the other end.

"He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process.

"They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission

"Obviously this isn't the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney.

"They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. That meant I had to persuade the other startups I'd shared samples with to remove their copies, but finally in mid-March I was able to sign the final agreement."

"I'm just glad that the whole process is over. I'm bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), and a bit frustrated that people don't understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I'm just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup.

I really appreciate everyone's support, stay tuned for my next project!"

An analysis of these events is on the final page.

This has set a dangerous (albeit privately agreed) precedent; one which is diametrically opposed to the judgement handed down in Google vs. Field where the judgement specifically referenced the robots.txt file stating that the plaintiff (Blake Field) was fully aware of the proper use of the robots.txt file and yet chose not to make use of it to bar his copyrighted works from being crawled and cached by Google. 

Field's legal action against Google was regarded rather poorly by the judge: "Field decided to manufacture a claim for copyright infringement against Google in the hopes of making-money from Google's standard practice."  The judgement also noted that "author granted operator implied licence to display 'cached' links to web pages containing his copyrighted works."

Although appearing to support Facebook's claims of doing all they can to protect the privacy of their subscribers, the action against Warden seems aimed more at protecting a valuable resource and revenue-stream; exactly the same motivator the company has for constantly fiddling with the privacy dashboards in the hope users won't notice the changes that make more and more data available.

Facebook should also check with the courts to ensure they haven't misunderstood the current legal standing of the robots.txt file!



Subscribe to ITWIRE UPDATE Newsletter here

Now’s the Time for 400G Migration

The optical fibre community is anxiously awaiting the benefits that 400G capacity per wavelength will bring to existing and future fibre optic networks.

Nearly every business wants to leverage the latest in digital offerings to remain competitive in their respective markets and to provide support for fast and ever-increasing demands for data capacity. 400G is the answer.

Initial challenges are associated with supporting such project and upgrades to fulfil the promise of higher-capacity transport.

The foundation of optical networking infrastructure includes coherent optical transceivers and digital signal processing (DSP), mux/demux, ROADM, and optical amplifiers, all of which must be able to support 400G capacity.

With today’s proprietary power-hungry and high cost transceivers and DSP, how is migration to 400G networks going to be a viable option?

PacketLight's next-generation standardised solutions may be the answer. Click below to read the full article.


WEBINAR PROMOTION ON ITWIRE: It's all about webinars

These days our customers Advertising & Marketing campaigns are mainly focussed on webinars.

If you wish to promote a Webinar we recommend at least a 2 week campaign prior to your event.

The iTWire campaign will include extensive adverts on our News Site and prominent Newsletter promotion and Promotional News & Editorial.

This coupled with the new capabilities 5G brings opens up huge opportunities for both network operators and enterprise organisations.

We have a Webinar Business Booster Pack and other supportive programs.

We look forward to discussing your campaign goals with you.


David Heath

David Heath has had a long and varied career in the IT industry having worked as a Pre-sales Network Engineer (remember Novell NetWare?), General Manager of IT&T for the TV Shopping Network, as a Technical manager in the Biometrics industry, and as a Technical Trainer and Instructional Designer in the industrial control sector. In all aspects, security has been a driving focus. Throughout his career, David has sought to inform and educate people and has done that through his writings and in more formal educational environments.

Share News tips for the iTWire Journalists? Your tip will be anonymous




Guest Opinion

Guest Interviews

Guest Reviews

Guest Research

Guest Research & Case Studies

Channel News