AI-Generated Images for Speech Pathology — An Exploratory Application to Aphasia Assessment & Intervention Materials

John E. Pierce

With thanks to Jon Hunt and Robin Keech from Cuespeak for their input

Author PDF doi.org/10.1044/2023_AJSLP-23-00142

Prompt:

Four AI-generated images of shoes and apples in various positions

Images are a key part of aphasia assessment and treatment (and plenty of other areas of practice!). But it can be really difficult and/or expensive to find high quality images. Artificial Intelligence can now generate high quality images from a text prompt - I was interested in whether this might be useful to create customised images quickly and affordably.

Method

I selected 80 nouns, 80 verbs and 40 sentences at random from aphasia assessments and treatment software. I used DALL-E 2 to try to create a suitable image for each target. I entered up to six prompts and stopped if I judged that the image represented the target.

Results

You can view all of the prompts and resulting images on this page (takes a while to load them all).

Some of the good images

Some of the passable, but flawed, images

❗️Even though many images represented the target word, they looked a bit off. These imperfections could be distracting or unsettling for users.

Some of the frankly bizarre images

Obviously, these were not considered successful. But they are funny.
See the full set of strange images here.

Why is this work important?

If you're a clinician, you can probably just use Google image search without thinking about copyright. But for anything that will be published, such as assessments, treatment materials or aphasia-friendly information, finding the right images is hard.

This brief exploration shows that AI image generation could become a low cost, rapid source of copyright-free images that can be heavily customised. This is a very new technology that will only get better. In fact, in the short time since I ran this test, Dall-E and many other generators have improved substantially. Combining AI images and manual editing (e.g. Dall-E 2 + photoshop) will probably be the new normal.

Current Limitations:

Complexity/frequency: Unfortunately, the most complex and low frequency items were the most difficult to generate. It's easy to find an image of a glass of juice from the internet but very hard to find an image of 'the cat chases the dog'. Yet AI seems to struggle more with the latter, at present.

Syntax: Part of the difficulty with more complex scenes was that Dall-E 2 was not accurately parsing the syntax. Prepositions and adjectives were inconsistently applied and often to the wrong noun. This is a known limitation.

Results of prompt, “A ballerina runs behind a policeman, high speed shutter”. The preposition is not consistently applied to the correct noun.

Bias: Dall-E 2 produced a diverse range of races without prompting, but it has been intentionally programmed to do so because early models defaulted to white-looking people. AI is not biased itself but reflects and exaggerates the human-generated data used to train it. Most AI image generators have been trained on English-language, Western culture images and captions, meaning generating images for other cultures may be substantially less accurate and efficient.

What's next?

We need to see what people think about these images - I plan to investigate the acceptability and accuracy of AI-generated images compared to 'human generated' images.

Pierce, J. E. (2023). AI-generated images for speech pathology—An exploratory application to aphasia assessment and intervention materials. American Journal of Speech-Language Pathology. https://doi.org/10.1044/2023_AJSLP-23-00142

If you'd like to try using Dall-E 2 yourself, I recommend this excellent prompt guide.

You may also want to compare my prompts with the results to see what worked best.

Below, I have summarised what I learned about prompts. Note that each AI image generator (Dall-E, Midjourney, Photoshop, Stable Diffusion...) will have its own quirks. Dall-E 2 does involve a cost.

Tips for enhancing results in DALL-E 2

Expect randomness – the same prompt will produce better and worse results across multiple attempts.

Picture the type of result you want before creating the prompt. This encourages a more specific prompt.

Prompt as if you are captioning an existing image in a newspaper. Read stock photograph descriptions to get a feel for wording and style as DALL-E 2 was trained on image-caption pairs. Present tense seems to work best.

Multiple clauses can be used to specify additional requirements:

Medium
A ballpoint pen lying on a desk, stock photograph
Portrait of a king wearing a golden crown, head and shoulders, renaissance painting

Source
An astronaut spacesuit in a museum, tourist's photograph
Photograph of a family listening to the record player, wide shot, life magazine 1970

Lighting
A whole green cucumber and slices, studio lighting
A croquet game on a green lawn, warm outdoor photograph, calm

Camera attributes
Closeup of a wooden lattice, garden visible in background, shallow depth of field
A man bowling at a ten pin bowling alley, action shot

Specify camera zoom and angle as DALL-E 2 often defaults to closeups
Wide shot of a restaurant, diners and wait staff visible
Full shot of a man in fireman's uniform and hat, studio lighting, stock photograph
Chicken schnitzel closeup

Adjectives can be very effective but are not consistently applied to the correct noun - keep trying!

Duplication has been reported to be effective at focusing on a particular description and improving its quality, e.g. A smiling girl is tickled, laughing, bright lighting, happy