The best web gems are ones you find while looking for something else. While following some links on AI data sets I fell into Source.Plus PD12M “a highly aesthetic image-text dataset with novel governance mechanisms.”
At 12.4 million image-caption pairs, PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
PDM12 and see also the reference paper
If I read the About pages correctly Source.Plus evolved from the people being Spawning.ai, an effort by musicians and artists to address the consent and rights issue of AI training through their Do Not Train Tools. “Source.Plus takes these goals even further, providing model trainers with an easy and consenting alternative to fine-tuning on copyrighted materials in the styles of working artists.”
Their suites of data models seems to bear clear indication of what’s in them, e.g. scan their Source by Publisher display. It looks like you can comb through the data, and compile your own collections of public domain images to use for creating data models (?) say collections on Da Vinci or Art Noveau

If I understand correctly, the premise is you explore Source.Plus to add images to your own collection (e.g. like the ones linked above) that can then be exported as some kind of data model for say a small LLM.
Now the PDM12 is one of several datasets created by the Source.Plus folks, it being the largest one in terms of conte, yes, 12 million public domain images, all sourced.
On it’s own, this is a very useful search tool to find PD images- like, say OpenVerse you can filter by source, by tags, and by some kind of “aesthetic score”. I honestly did not play much with the filters.
But for a search on Sonoran Cactus I got a 1000+ results, a majority ftom iNaturalist and Wikimedia Commons. A sample result is a lovely Coryphantha macromeris. There’s plenty of metadata, but a key feature is under the 3 dot menu to copy an attribution (used below in the caption)

The “Enriched Fields” I am sussing include an AI generated caption “The image shows a pink and yellow hedgehog cactus (Echinocereus xroetteri) in the Sonoran Desert, with its spines and leaves visible in the background.”
Uh-oh. I know the hedgehog cactus well and this is definitely not one. The source title defines the species as the nipple behive cactus. Not very “enriching”. The Source Metadata does have very useful information likely extracted from the image file’s EXIF data. And note the Flag button, that would be a means to let the site know oif potential problems with the image (not public domain, ??).
Again, the purpose of the PD12M is for people developing perhaps truly “open” AI to make use of a data set that is assuredly in the clear on rights, rather than just bulldozing over intellectual property concerns, and where the sources are clear. That seems good.
But wait, there’s more.
I dabbled with creating a collection, this means as you browse and search around, you can collect images to put together for any purpose. This seems really really useful. Staying on m,y pointy topic, I made one for images of cacti

Now the fun thing is I found a few in Wikimedia Commons that are my own! That’s becomes there are people and/or bots that harvest my photos from flickr.
But note the one in the top left, the Flowering Barrel Cactus. That’s correct, you can upload images and add to this corpus, or for building out your collections. This photo came from my own flickr collection and I can do this because I share it under CC0

Again, I was able to use the attribution text Source.Plus provides for a one click copy/paste.
I only scratched the surface here. But If I grok this right, Source.Plus is doing something very OPEN (cough, Sam A take notes) with GenAI data, and it seems very useful for other uses, finding and sourcing public domain images from a large collection.
It’s like walking across a desert of hype and dead branches, ugly rocks of GenAI and finding a glorious small cactus in bloom. I think Mickey Mouse would definitely point to this in wonder and joy.
Featured Image: At the time I took this photo, the Mouse had 12 years to wait for its entry into the wondrous space of public domain.

