A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as “in the style of Greg Rutkowski”. We introduce a benchmark for prompted-artist recognition: predicting which artist names were invoked in the prompt from the image alone. The dataset contains 1.95M images covering 110 artists and spans four generalization settings: held-out artists, increasing prompt complexity, multiple-artist prompts, and different text-to-image models. We evaluate feature similarity baselines, contrastive style descriptors, data attribution methods, supervised classifiers, and few-shot prototypical networks. Generalization patterns vary: supervised and few-shot models excel on seen artists and complex prompts, whereas style descriptors transfer better when the artist’s style is pronounced; multi-artist prompts remain the most challenging. Our benchmark reveals substantial headroom and provides a public testbed to advance the responsible moderation of text-to-image models. We release the dataset and benchmark to foster further research.
We construct a structured dataset of 1.95M images to benchmark different methods on predicting prompted artist names from generated images. To help disentangle the effect of prompting given an artist name, we query a text-to-image model given the same content prompt (rows), but insert different artist names (columns). Our dataset consists of images generated by SDXL, SD1.5, PixArt-Σ, and Midjourney, with both complex and simple prompts. Each artist’s style tends to become less visually prominent when the prompt becomes more complex, especially when the prompt calls for additional styles and adjectives or specifies the content that may not be in the distribution of content in the artist’s work (e.g., 2nd row, “a village carved into a red canyon rock wall”). The variance in the visibility of the artist’s style across different prompts and models demonstrates the difficulty of the prompted artist identification task.
The prompted artist identification benchmark employs a structured dataset of 1.95M images to evaluate vision methods across four axes of generalization. We collect 110 of the most frequently prompted artists, split into 100 seen and 10 held-out artists (1st chart). Next, we collect 1,000 complex prompts and 500 simple prompts in which artist names are inserted (2nd chart). For seen artists, we use a separate set of prompts for testing. For held-out artists, we further divide the test prompts to generate a set of reference images used during inference. Then, we generate single artist-prompted images with SDXL, SD1.5, and PixArt-Σ, and collect Midjourney images (3rd chart). Finally, we evaluate how well methods generalize to multiple artists in the prompt by generating datasets of SDXL images prompted with 2 artists and 3 artists (4th chart).
In the example images shown, we observe that the effect of adding each artist’s name in the prompt diminishes as more artists are added. Each row shows a set of images generated with the same prompt and generation seed, and the number of artists inserted into the prompt increases from left to right. For each prompt, the image changes less with each additional artist, and images generated with complex prompts generally change less than those generated with simple prompts. This shows that identifying multiple prompted artist names can be even harder than identifying a single prompted artist name from each image.
We compare the prompted artist classification accuracy of various visual representation methods. We test within our seen artists set (100-way classification, x-axis) and on held-out artists (10-way classification, y-axis). We also test on images generated with different text-to-image models (SDXL, SD1.5, PixArt, and Midjourney), and complex and simple prompts. Although prototypical networks, the trained vanilla classifier, and CSD surpass the CLIP, DINOv2, and AbC methods, attaining high accuracy across all scenarios remains difficult.
We evaluate visual representation methods on the multi-artist classification task, where the input image is prompted with multiple artists’ names. We test on SDXL-generated images with 2 artists and 3 artists in the prompt, and report ranked mAP@10 on seen artists (x-axis) and held-out artists (y-axis). All methods generally exhibit reduced performance on images generated from complex prompts compared to those generated from simple prompts. As expected, the prototypical network achieves the highest performance across the datasets due to its training on multi-artist prompted images. However, its performance falls short of saturation.
We thank Maxwell Jones and Kangle Deng for their helpful discussion and comments. This work started when Grace Su was an Adobe intern. Grace Su is supported by the NSF Graduate Research Fellowship (Grant No. DGE2140739). This project was partially supported by NSF IIS-2403303, Adobe Research, and the Packard Fellowship. The website template is taken from Custom Diffusion (which was built on DreamFusion's project page).
@article{su2025identifying,
author = {Su, Grace and Wang, Sheng-Yu and Hertzmann, Aaron and Shechtman, Eli and Zhu, Jun-Yan and Zhang, Richard},
title = {Identifying Prompted Artist Names from Generated Images},
journal = {arXiv preprint arXiv:2507.18633},
year = {2025},
}