By , April 21, 2023.

Inside the secret list of websites that make AI like ChatGPT sound smart — The Washington Post’s Kevin Schaul, Szu Yu Chen and Nitasha Tiku take a close and visual look at the sources of works used to train many high profile English-language large language models. Worth noting: “Also high on the list: No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.” The article also includes a search function to see what individual websites were included in the dataset, so I had to check…

Search prompt for the websites in Google's C4 dataset with as input. Search results show 1 domain begins with "", with a rank of 48,596, 360k tokens, and 0.0002% of all tokens. Originally from

What the Online Piracy Data Tells Us About Copyright Policymaking — Researcher Michael D. Smith summarizes the peer-reviewed empirical literature on piracy in this article, which he says supports three broad conclusions: “digital piracy harms creators by reducing their ability to make money from their creative efforts”, “digital piracy harms society by reducing the economic incentives for investment in creative output,” and “legislative interventions implemented worldwide have been effective in reversing these harms.”

Update: 4 Copyright Claims Board Cases to Watch — PlagiarismToday’s Jonathan Bailey reviews four of the over 400 claims that have been filed at the newly created US copyright small claims tribunal, which is still under a year old. These cases present interesting or notable facts and parties.

At London Book Fair Tuesday: Copyright Under Attack — “Too many times, the best-intended publishing stalwarts—you may know some, yourself—have consoled themselves and others that no one in nearby industries (education, entertainment, communications) could possibly be willing to do anything that might undermine the essential value of copyright protection. What’s more, it’s easy to think that one market’s struggles with a rewritten piece of legislation or a foray into popular misconceptions about copyright will stay in that market.”

The US Supreme Court’s Warhol case; what is the fuss about? — Bill Patry on the anticipated decision: “In an era when partisan hyperbole passes for ordinary discussion, one must get used to headlines like ‘The Supreme Court may force us to rethink 500 years of art’. Given that the first American copyright law is from 1790 and did not even begin to take shape with respect to fair use until a judicial opinion in 1841, this seems a few centuries off even in hyperbole.”