Want to be as smart as Google’s BERT or Facebook’s LLaMA? Well then, you should keep reading this blog, as it was used to help train them.

With so much attention being paid to the current generation of AI trained on large language models, such as ChatGPT, most of us know little about the text used to train them.

Now, The Washington Post has lifted the cover off this black box. Working with the Allen Institute for AI, it analyzed Google’s C4 data set, “a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs,” including Google’s T5 and Facebook’s LLaMA.

It then categorized all of those websites (journalism, entertainment, etc.) and ranked them based on how many “tokens” appeared from each data set — with tokens being the bits of text used to process the disorganized information.

In addition to analyzing all these sites, it then created a searchable database of all the websites in Google’s dataset. As it turns out, this blog is one of them.

LawSites blog ranked 63,769 of all sites used to train the dataset, providing 290,000 tokens, or 0.0002% of all tokens in the dataset.

Of course, LawSites was hardly the only law-related site used to train the data. Based on searches for words such as law, legal, court and case, I found some of the other legal sites that were used. Here is a sampling, listed by their ranks:

You can go in and search for your favorite legal sites and see where they rank. But, clearly, the bottom line is that you should keep reading this blog.