Things I'm Doing
Common Crawl Foundation
For the past 17 years, the Common Crawl Foundation has offered an invaluable, freely available crawl of the Internet to researchers, scientists, and innovators worldwide. Our mission to democratize access to high-quality web data has positioned Common Crawl as a linchpin in the AI ecosystem—especially as demand for extensive, diverse datasets has surged with the rise of generative AI.
Today, Common Crawl powers 70-90% of the tokens used in training data for nearly all of the world’s large language models (LLM’s), making us perhaps the most universally relied-upon resource for LLMs in production. With our open content dataset, we support a future where AI innovation remains accessible beyond the reach of a few large companies.