Zyxt: You See.
Study of ~1.3 Billion URLs: ~22% of Web Pages Reference Facebook

Key Findings

Based on a study of ~1.3 billion URLs crawled by Common Crawl in 2012:

  • 22% of Web pages contain Facebook URLs
  • 8% of Web pages implement Open Graph tags
  • Among ~500m hardcoded links to Facebook, only 3.5 million are unique
  • These are primarily for simple social integrations


Intro

Over the past two months, I’ve become very excited about what I call “social discovery”: leveraging what a person’s social network already knows to help them find and discover new ideas, places and things; and new technologies that make social networking more useful to learning about the world.

Based on some of this work, I believe the nature of the Internet is undergoing a paradigm shift:

  • From unstructured to structured content
  • From Web sites/pages to entities
  • From links to connections

In the old world, Web pages and sites were “about” entities, and search engines used calculations based on the link as a key signal of network-wide relevance. In the new world, Web pages are becoming ever more incidental; instead, nodes in the network (Facebook pages, LinkedIn profiles, Freebase entities, Wikipedia articles, etc.) represent entities, and are connected to each other and to people through meaningful connections (“like”, “worked with”, “is accurate”, “recommended”, etc.).

Increasingly, people and organizations will seek to write themselves not to Web sites, but to the big “platforms” (APIs) like Facebook and Twitter. And more and more, Web sites are being rewoven into those social networks, whether by simple inclusions of “like” or “+1” buttons, or through more complex reflections of social connection.

When I look at the data, it’s pretty amazing. On Lucky Oyster, which is an alpha application for social discovery, it’s not uncommon for an occasional user of Facebook with only 20 friends to be intricately connected with upwards of 20 thousand entities. Active users with around a thousand friends are consistently connected to well over a hundred thousand entities. I believe that reading this graph is different than reading the Web; it requires both a new mental model and new technology.

Social networking isn’t solely about the people (although Wall Street may value it that way): it’s about the interconnections between structured content, entities, and people. And in the long term, that fabric will be the legacy of social networking and the battleground for digital meaning. I call this emergent state of affairs “the Web turned upside down”: the social fabric isn’t incidental to the graph of the Web, but will increasingly be ascendent over it in terms of the richness and utility it can provide to us. Its structure and metaphors will eventually reshape the Web into something completely different.


The Experiment

Yeah. All of this is cloudy, at best, even if it sounds cool. So I started with a very simple question:

To what extent is the Web interconnected with the social “fabric”?

We can look anywhere online to see the evidence of this, from on-site elements such as the “like” button, the +1 glyph, or on-site widgets from companies like Disqus. But what are the numbers? I started with the big guy on the block: Facebook.

These days, answering this question is MUCH easier than ever, especially for a lone engineer working out of a studio in his back yard. Using Amazon Web Services, the data set from Common Crawl, and a little custom code, I examined roughly 1.3 billion URLs from an ongoing 2012 Web crawl. Without embellishment, here are the results:

Summary
----------------------------------------------------------------
Crawl age:                              2012
URLs examined:                          1,297,373,773
TEXT/HTML examined:                     1,117,415,954
Pages that include Facebook URLs:       242,063,274 (21.66%*)
Pages that include Open Graph Tags:     84,453,177 (7.56%*)

* = percentage of TEXT/HTML documents
	
Detail - Facebook URLs
----------------------------------------------------------------
Pages referencing Facebook URLs:        242,063,274
Total Facebook URLs:                    471,013,958
Average Facebook URLs/page:             1.94
Unique Facebook URLs:                   3,556,668
Top Facebook URLS:
	/plugins/like                   15.69%
	/2008/fbml                      13.45%
	/sharer                         9.54%
	/share                          6.69%
	/plugins/likebox                4.60%
	/profile                        1.30%
	/home                           1.25%
	/group                          0.96%
	/widgets/like                   0.73%
	/plugins/activity               0.43%
	/plugins/recommendations        0.35%

Detail - Open Graph Tags
----------------------------------------------------------------
Pages with Open Graph Tags:             84,453,177
Top Open Graph Types:
	hotel                           1.55%
	movie                           1.31%
	activity                        1.14%
	song                            0.76%
	game                            0.72%
	book                            0.67%
	band                            0.66%
	restaurant                      0.52%
	actor                           0.40%
	profile                         0.38%

On a more mechanical note, here’s a bit about how I gathered the above data. After some tinkering with Amazon’s Elastic Map Reduce (EMR), I determined that something even simpler would suit my needs. It was clear that writing a bit of worker coordination code would be far more efficient and easier to debug than learning the ins and outs of the EMR product (read: Hadoop). So I built a very simple boss/worker system. Workers were essentially AWS EC2 nodes with a worker AMI (machine image); on startup, each machine would create two worker processes, each of which would pull tasks from a master queue. For the queue, I used beanstalkd, as the number of Arcfile (Web crawl archive) paths to be served was relatively small. Finally, the workers would post statistics to the same master server that ran the queue, sending data, counts, and tons of extracted snippets (in json) to a Web service built on Sinatra. All of this was written in Ruby, in 296 lines of code.

Huge thanks to Pete Warden for some of the Arcfile streaming code; to Web Data Commons for the queue suggestion; and to the folks at Common Crawl for providing an amazingly valuable service and data set. Here are a few tidbits about the work:

Queue service:                          beanstalkd
Web service:                            Sinatra
Cloud computing:                        Amazon Web Services (AWS)
Arcfiles processed:                     351,171
Total processing time:                  2,634 hours
Worker EC2 machine type:                c1.medium
Worker EC2 cost/hour:                   $0.165
Total worker cost:                      $434.61
Average processing time per Arcfile:    15.90 seconds
Maximum worker node count:              99
Total calendar time:                    ~1.5 days
Maximum coordinated throughput:         2.3 million URLs per minute 

The key lesson I’ve learned from the exercise is that given the tools and data available today, either for free, or at very low cost, it’s possible for anyone to work with relatively Big Data without too much weeping and gnashing of teeth. I feel it’s also really important to note that from start to finish on this experiment, I spent a total of about 4 days, which is truly de minimus. From here, the system can be much improved, and the questions evolve in much more complex directions: working through the data is as simple as modifying the worker code, resetting the master server, filling the queue, and firing off a hundred machines….


Noodling

1. Although I’m tempted to make a statement about whether I think that roughly 22% of the Web (based on this sample) references Facebook is high or low, to be honest, it’s beside the point. This experiment can be seen as the first of a set of snapshots I’d like to conduct every few months, to see whether that percentage is going up, down, or staying the same. My guess is that it’s rising, and fast. If nothing else, it’s taken roughly a decade for Facebook to not only accrue roughly a billion users, but to entangle itself in about a fifth of the Web. That’s pretty interesting by itself.

2. More significant, in my opinion, than the extent to which the Web references Facebook, is the number of Web pages that actually implement Open Graph tags (~8%). This is a deeper level of integration, not really one that’s required (anymore), and in essence provides a basis for Facebook to achieve indexable visibility into the broader Web, as well as to introduce to it some of the core concepts it uses to define entities. Much the same way that the Google Toolbar and its caching mechanism gave the search giant live glimpses of the Web as it was consumed by people, these snippets effectively position the Web as a live, visible extension of the entities Facebook is seeking to have users define (through pages and applications). If I had to guess, I’d say that someone realized that turning the broader Web into entity pages was futile, especially if people could be convinced to use the Facebook platform to make their own pages in addition to (or in lieu of) Web sites.

3. The integrations are, expectedly, filthy. Welcome to the Web, right? The Open Graph types are all over the landscape, with crazy variations rampant. This isn’t really surprising, but it does point to something Facebook is going to need to deal with in terms of doing more than name search for entities: categorization. Here’s a gem of an Open Graph type: “bestylish:shoes ecommerce”. The Open Graph type “article” has about twenty different variations, including some very interesting misspellings. As another example, consider that right now, Facebook allows people to classify a page as either an “author” or a “writer”. Let me know if you can help me understand that one….

4. Open Graph types are wildly diffuse. The largest category is “hotel”, at 918,205. This says a great deal about the well-known aggression of online travel publishers, as that number represents many many more times than the total number of hotels in the world. At some point, though, if Facebook is to build its value on the richness of the entity, they will need to solve the suturing problem: determining which of those pages (Web or other) actually point to the same thing in the world.

5. The most chronic users of creative Open Graph types were Mahalo and Zillow, followed by the cluster of Mapmyrun, Maymyride, and Mapmywalk.

6. Some interesting (and amusing) top references to Facebook pages turned up (go Kev?):

/merriamwebster      676071 (0.14%)
/kevjumba            651389 (0.14%)	
/placeformusic       618963 (0.13%)	
/lyricskeeper        517999 (0.11%)	
/kayak               465179 (0.10%)	
/ugodotcom           335088 (0.07%)	
/kev                 325907 (0.07%)	
/pcmagazine          315192 (0.07%)	
/twitter             281882 (0.06%)	
/cnet                260189 (0.06%)

7. There are about a quarter of a million inbound references to the Facebook search function (/search).

8. Although about a fifth of the Web (based on this sample) references Facebook, and despite there being close to half a billion references to Facebook URLs, there are only 3.5 million unique URLs in the sample set. The bulk of these are for Facebook-specified integrations (those that add social dimension to a Web site), as opposed to specific inbound URLs. My key takeaway here is that although Facebook may know about a sizable portion of the Web, the Web barely knows anything about what’s inside of Facebook….


What’s Next

This was just a first step. There are a number of improvements and other studies I’m itching to make. If anyone wants to contribute funds for processing time, I’d welcome it! Here are a few:

  • Study other social platforms. Compare and contrast penetration levels.
  • Look at the variety of Facebook integrations.
  • Index the Open Graph data in its own search engine, just for kicks.
  • Make the workers more efficient. My gut tells me there’s room for 1.5x improvement in throughput.
  • Run at lower cost by using spot instances. It’s already pretty cheap though.
  • Fix the character encoding errors that caused some of the processes to hiccup during the run.
  • Run this study semi-regularly, to look at changes over time.
For more information, feel free to contact me at matthew at zyxt dot com.
  1. thierryratsiz reblogged this from zyxtlabs and added:
    Nice study on how Facebook ecosystem influences the Web.
  2. turing-machine reblogged this from zyxtlabs
  3. macdiva reblogged this from zyxtlabs
  4. zyxtlabs posted this
Blog comments powered by Disqus