Roger Acuna - SEO Consultant on December 30th, 2009

Like most respectable geeks, we at Factual get pretty excited about data. And sometimes we get so excited about something that we want to make sure our data geek brethren are aware of it. Today we have something that falls into that category. CommonCrawl.org, a non-profit web crawler, provided a data set of about 4 million websites (primarily hosted at Top Level Domains as well as some popular subdomains) with 30 various attributes. That’s about 350MB — not a shabby corpus of data to be made available to the public. The attributes on these 4 million websites include information on what’s on the page (i.e., “contains a Twitter link”), what technology was used (i.e., “server”), and what crawling rules are set-up (i.e., “excludes GoogleBot”). The websites come from the CommonCrawl repository, which consists of over 3 billion URLs, and is a reasonable representation of the Internet, not to mention an interesting slice of what’s happening on the Web.

Continue reading about Very Large Websites Table Now on Factual