The Atlantic creates searchable database of music used to train AI
Atlantic reporter Alex Reisner uncovered four datasets containing millions of tracks used to train AI models and made them publicly searchable. The datasets include artists like Lady Gaga, Radiohead, and Bruce Springsteen.

The Atlantic reporter Alex Reisner has identified four music datasets used to train artificial intelligence models and made them fully searchable for the public. Two of the sets are enormous, containing 12 million and 9 million tracks respectively. The other two are smaller but still significant, each with over 100,000 songs.
According to Reisner, the datasets have been downloaded thousands of times, and while it's impossible to know exactly who has used them, both Google and Stability have confirmed their use in research papers. Some sources, such as the Free Music Archive dataset, are free to stream for personal use but require licensing for commercial applications.
Although the datasets are theoretically freely available online, using them as training data is not as simple as downloading a ZIP file and feeding it to an AI model. As Reisner explains: three of the four datasets are distributed as lists of links to songs on YouTube or Spotify. AI developers download the actual audio using tools that automate the job, some of which allow developers to bypass logins, advertisements, and mechanisms that might generate revenue or subscribers for creators. Such tools violate the terms of service of these platforms.
The dataset includes names ranging from pop stars like Lady Gaga and Fred Again.., to Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental composer Hainbach. Anyone can visit The Atlantic's AI Watchdog site and search through the songs, books, and other media being used to train the world's AI models.

