Thursday, December 08, 2005

Downloading the Web in an Afternoon

That's the equivalent of what Caltech recently accomplished.

Bandwidth is a measure of how much data can be sent across a network in a set amount of time. It's what you experience online when waiting for a web page to load.

But recently, Caltech reached a stunning milestone in how fast data can be sent, transferring 475 terabytes of data in 24 hours. That's fast enough to download the entire Web as indexed by Google in an afternoon. (Cool idea too. Can you imagine having your own personal cache of the web, updated every couple of hours in the background while you surf with no page load waits?)

OK, for math heads, here's the figures I used rounded to the nearest sanest numbers:

475 terabytes (the amount of data transferred in 1 day by Caltech), divided by 156 terabytes (the size of Google's web) = (roughly) 3.

More math? OK here is how I determined Google's size:

I did a Google search with my preferences set to return 100 pages. I searched for everything using *.* (Note: This search query doesn't work anymore. Use a creative seach of your own with common words like "home" or "welcome").
From those 100 hits, I added up the size of each page for a total of:
1721 k
and then divided that by 100 to get the average size of each page.
That gave me an average of 17 K per page
Multiply that by the number of pages on Google (9,180,000,000)
and that gives 156,060,000,000,000 Bytes or 156 terabytes.

How to Visualize Data

Did you know that 1 CD can hold the entire Human genome? That a Gigabyte of text printed out would fill the bed of a pickup truck with paper? That it takes 20,000 trees to print out 400 Gigabytes on paper? That 100 Terabytes is a high-side guess of Human brain storage space, slightly less than one days Web Page views from Google in 2009.

I've been collecting these data bits for a couple of years now, finding gems and doing my best math to see if they fall in line with everything else. It's pretty close. Close enough to give you some useful comparisons.

Bits


2 bits: any 2 choice decision. Yes/No Run/Stop...
3 bits: A color pixel. Only 3 bits are needed to represent any colour. Red, Blue and Green, from which all other colors are derived.
8 bits: 1 byte

Bytes


1 byte: Any character on your computer keyboard (not including your cat).
5 bytes: The average English word

Kilobyte


1,024 bytes often rounded to 1,000 bytes
1 Kilobyte: A joke or a couple of paragraphs
2 Kilobytes: Typewritten page
3.2 Kilobytes: The amount of data in the H1N1 Swine Flu virus
3.5 Kilobytes: The size of the first web page
5 Kilobytes: A Desktop Icon
10 Kilobytes: A page out of an encyclopedia
17 Kilobytes: The size of an average Web Page
50 Kilobytes: A (roughly) 4 by 6 inch image
100 Kilobytes: A low-resolution photograph
750 kilobytes: The file size necessary to categorize the entire range of human experience and interest. (11 categories and about 450 unique sub-categories) as indicated by the Yahoo! directory on November 3rd 2007. (I wrote a small program to fetch all the headings and categories, save it to a text file, then view the text file size.)

Megabyte


1,048,576 bytes,
1 Megabyte: Small novel, 3-1/2 inch diskette
2 Megabytes: 12 Megapixel Digital Photo, high resolution
3 Megabytes: The average mp3 song. (Rough rule of thumb: an mp3 plays about 1 megabyte per minute.)
4 Megabytes: A Non illustrated King James Bible. I downloaded a text version from Project Gutenberg and viewed the file size).
5 Megabytes: 1 minute of a YouTube High Quality video.
10 Megabytes: (Roughly) 1 minute MPEG movie.
20 Megabytes: Typical hard drive in the first desktop PCs
100 Megabytes: Roughly the text info contained in 1 meter (3 feet) of bookshelf
750 Megabytes: 1 CD. The Human genome (props to Max for his comment below)

Gigabyte


1,073,741,824 bytes,
1 Gigabyte: The bed of a pickup truck filled with paper.
7 Gigabytes: 1 DVD
10 Gigabytes: A 1 inch stack of CD's
28 Gigabytes: Tweets Per Day on Twitter as of Jan 1, 2011
30 Gigabytes: From My Life In A Terabyte - Roughly the entire collection of Gordon Bell's Gordon Bell's articles, books, correspondence (letters and email), CD's, memos, papers, photos, pictures, presentations, home movies, videotaped lectures, and voice recordings by 2003.
400 Gigabytes: 20,000 trees made into paper and printed.
500 Gigabytes: 100 DVD Movies

Terabyte


1 Terabyte: 1000 gigabytes
An 8 foot stack of CD's or about 150 DVD's. It would hold all 350 episodes of The Simpsons or all 238 episodes of Friends. About 2 years non stop MP3s. About 50,000 trees made into paper would be needed to print out a Terabyte of data. 250 million pages printed both sides, over 10 miles high. Roughly 250,000 MP3s (2 years non-stop listening). About 2 weeks of non-stop DVD movies. 500,000 digital camera pictures.
4 Terabytes: The YouTube record of U.S. user names and IP addresses including every record of every video watched by them as of 2008.
10 Terabytes: Enough to store everything you look at for a year, and could include a heart monitor, personal GPS, everything you type and every move of your mouse. From Charlie's Diary Shaping the Future.
45 Terabytes: All the videos on YouTube as of Aug 2006
100 Terabytes: High guess of Human brain storage space. The monthly growth of The Internet Archive in 2009. From this Google search
122 Terabytes: The size of one days Web Page views from Google in 2009 (7.2 Billion daily page views) X 17 Kilobytes (the size of the average web page).
150 Terabytes: Estimated size of all Web pages indexed by Google on Dec 8th 2005 (not including databases or video). (See this article for the figures I used)

Petabyte


1 Petabyte: 1 thousand Terabytes. Storage at this level signals the dawn of a new era with powerful implications to the sciences and Artificial Intelligence.
About 100 years of television. The amount of data storage space the Internet Archive had in 2004.
Roughly the amount of new video added to youTube every day in 2007
A stack of CD's 3 kilometers high
2 Petabytes: The amount of data Google processed every day in 2008
3.5 Petabytes: 2007 Estimated capacity of Google's Data centers in a box.
4 Petabytes: Estimated amount of Internet data stored in RAM by Google in 2006.
4.5 Petabytes: The capacity of The Internet Archive as of 2009. From this Google search
15 Petabytes: The amount of data the Large Hadron Collider (LHC) generates per year as of 2008.
20 Petabytes: Google daily total workload in January 2008. The storage capacity of all hard drives produced in 1995.
60 Petabytes: Estimated total size of Flickr photos by December 2011
200 Petabytes: The estimated amount of data contained within the Googleplex in 2006.

Exabyte


1000 Petabytes.
2.2 Exabytes: According to Charles Stross, all data recorded by our species in 2003
246 Exabytes: Total storage of the Internet in 2006.

Zettabyte


1000 Exabytes
1.8 Zettabytes: Estimated amount of total electronic data in existence by 2011

Yottabyte


1000 Zetabytes.