Data poisoning: how artists are sabotaging AI to take revenge on image generators

Over the break we read and loved this article from The Conversation, originally published on 18 December 2023. We hope you do too!

T.J. Thomson, Author provided

T.J. Thomson, RMIT University and Daniel Angus, Queensland University of Technology

Imagine this. You need an image of a balloon for a work presentation and turn to a text-to-image generator, like Midjourney or DALL-E, to create a suitable image.

You enter the prompt: “red balloon against a blue sky” but the generator returns an image of an egg instead. You try again but this time, the generator shows an image of a watermelon.

What’s going on?

The generator you’re using may have been “poisoned”.

What is ‘data poisoning’?

Text-to-image generators work by being trained on large datasets that include millions or billions of images. Some generators, like those offered by Adobe or Getty, are only trained with images the generator’s maker owns or has a licence to use.

But other generators have been trained by indiscriminately scraping online images, many of which may be under copyright. This has led to a slew of copyright infringement cases where artists have accused big tech companies of stealing and profiting from their work.

This is also where the idea of “poison” comes in. Researchers who want to empower individual artists have recently created a tool named “Nightshade” to fight back against unauthorised image scraping.

The tool works by subtly altering an image’s pixels in a way that wreaks havoc to computer vision but leaves the image unaltered to a human’s eyes.

If an organisation then scrapes one of these images to train a future AI model, its data pool becomes “poisoned”. This can result in the algorithm mistakenly learning to classify an image as something a human would visually know to be untrue. As a result, the generator can start returning unpredictable and unintended results.

Symptoms of poisoning

As in our earlier example, a balloon might become an egg. A request for an image in the style of Monet might instead return an image in the style of Picasso.

Some of the issues with earlier AI models, such as trouble accurately rendering hands, for example, could return. The models could also introduce other odd and illogical features to images – think six-legged dogs or deformed couches.

The higher the number of “poisoned” images in the training data, the greater the disruption. Because of how generative AI works, the damage from “poisoned” images also affects related prompt keywords.

For example, if a “poisoned” image of a Ferrari is used in training data, prompt results for other car brands and for other related terms, such as vehicle and automobile, can also be affected.

Nightshade’s developer hopes the tool will make big tech companies more respectful of copyright, but it’s also possible users could abuse the tool and intentionally upload “poisoned” images to generators to try and disrupt their services.

Is there an antidote?

In response, stakeholders have proposed a range of technological and human solutions. The most obvious is paying greater attention to where input data are coming from and how they can be used. Doing so would result in less indiscriminate data harvesting.

This approach does challenge a common belief among computer scientists: that data found online can be used for any purpose they see fit.

Other technological fixes also include the use of “ensemble modeling” where different models are trained on many different subsets of data and compared to locate specific outliers. This approach can be used not only for training but also to detect and discard suspected “poisoned” images.

Audits are another option. One audit approach involves developing a “test battery” – a small, highly curated, and well-labelled dataset – using “hold-out” data that are never used for training. This dataset can then be used to examine the model’s accuracy.

Strategies against technology

So-called “adversarial approaches” (those that degrade, deny, deceive, or manipulate AI systems), including data poisoning, are nothing new. They have also historically included using make-up and costumes to circumvent facial recognition systems.

Human rights activists, for example, have been concerned for some time about the indiscriminate use of machine vision in wider society. This concern is particularly acute concerning facial recognition.

Systems like Clearview AI, which hosts a massive searchable database of faces scraped from the internet, are used by law enforcement and government agencies worldwide. In 2021, Australia’s government determined Clearview AI breached the privacy of Australians.

In response to facial recognition systems being used to profile specific individuals, including legitimate protesters, artists devised adversarial make-up patterns of jagged lines and asymmetric curves that prevent surveillance systems from accurately identifying them.

There is a clear connection between these cases and the issue of data poisoning, as both relate to larger questions around technological governance.

Many technology vendors will consider data poisoning a pesky issue to be fixed with technological solutions. However, it may be better to see data poisoning as an innovative solution to an intrusion on the fundamental moral rights of artists and users.

T.J. Thomson, Senior Lecturer in Visual Communication & Digital Media, RMIT University and Daniel Angus, Professor of Digital Communication, Queensland University of Technology

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Library strategy and Artificial Intelligence

by Dr Andrew M Cox, Senior Lecturer, the Information School, University of Sheffield.

This post was originally published in the National Centre for AI blog, owned by Jisc. It is re-printed with permission from Jisc and the author.

On April 20th 2023 the Information School, University of Sheffield invited five guest speakers from across the library sectors to debate “Artificial Intelligence: Where does it fit into your library strategy?”

The speakers were:

  1. Nick Poole, CEO of CILIP
  2. Neil Fitzgerald, Head of Digital Research, British Library
  3. Sue Lacey-Bryant, Chief Knowledge Officer; Workforce, Training and Education Directorate of NHS England
  4. Sue Attewell, Head of Edtech, JISC
  5. John Cox, University Librarian, University of Galway

A capacity 250 people had signed up online, and there was a healthy audience in the room in Sheffield.

Slides from the event can be downloaded here . These included updated results from the pre-event survey, which had 68 responses.

This blog is a personal response to the event and summary written by Andrew Cox and Catherine Robinson.

Impact of generative AI

Andrew Cox opened the proceedings by setting the discussion in the context of the fascination with AI in our culture from ancient Greece, movies from as early as the start of the C20th, through to current headlines in the Daily Star!

Later on in the event, in his talk John Cox quoted several authors saying AI promised to produce a profound change to professional work. And it seemed to be agreed amongst all the speakers that we had entered a period of accelerating change, especially with Chat GPT and other generative AI.

These technologies offer many benefits. Sue Lacey-Bryant shared some examples of how colleagues were already experimenting with using Chat GPT in multiple ways: to search, organise content, design web pages, draft tweets and write policies. Sue Attewell mentioned JISC sponsored AI pilots to accelerate grading, draft assessment tasks, and analyse open text NSS comments.

And of course wider uses of AI are potentially very powerful. For example Sue Lacey-Bryant shared the example of how many hours of radiologists time AI was saving the NHS. Andrew Cox mentioned how Chat GPT functions would be realised within MS Office as Copilot. Specifically for libraries, from the pre-event survey it seemed that the most developed services currently were library chatbots and Text and Data Mining support; but the emphasis of future plans was “Promoting AI (and data) literacy for users”.

But it did mean uncertainty. Nick Poole compared the situation to the rise of Web2.0 and suggested that many applications of generative AI were emerging and we didn’t know which might be the winners. User behaviour was changing and so there was a need to study this. As behaviour changed there would be side effects which required us to reflect holistically, Sue Attewell pointed out. For example if generative AI can write bullet point notes, how does this impact learning if writing those notes was itself how one learned? She suggested that the new technology cannot be banned. It may also not be detectable. There was no choice but to “embrace” it.

Ethics

The ethics of AI is a key concern. In the pre-event survey, ethics were the most frequently identified key challenge. Nick Poole talked about several of the novel challenges from generative AI, such as what is its implication for intellectual freedom? What should be preserved from generative AI (especially as it answers differently to each iteration of a question)? Nick identified that professional ethics have to be:

  • “Inclusive – adopting an informed approach to counter bias
  • Informed & evidence-based – geared towards helping information users to navigate the hype cycle
  • Critical & reflective – understanding our own biases and their impact
  • Accountable – focused on trust, referencing and replicability
  • Creative – helping information users to maximise the positive benefits of AI augmented services
  • Adaptive – enabling us to refresh our skills and expertise to navigate change”

Competencies

In terms of professional competencies for an AI world, Nick said that there was now wider recognition that critical thinking and empathy were key skills. He pointed out that the CILIP Professional Knowledge and Skills Base (PKSB) had been updated to reflect the needs of an AI world for example by including data stewardship and algorithmic literacy. Andrew Cox referred to some evidence that the key skills needed are social and influencing skills not just digital ones. Skills that respondents to the pre-event survey thought that libraries needed were:

  •        General understanding of AI
  •        How to get the best results from AI
  •        Open-mindedness and willingness to learn 
  •        Knowledge of user behaviour and need
  •        Copyright
  •        Professional ethics and having a vision of benefits

Strategy

John Cox pointed to evidence that most academic library strategies were not yet encompassing AI. He thought it was because of anxiety, hesitancy, ethics concerns and inward looking and linear thinking. But Neil explained how the British Library is developing a strategy. The process was challenging, akin to ‘Flying a plane while building it”. Sue Attewell emphasised the need for the whole sector to develop a view. The pre-event survey suggested that the most likely strategic responses were: to upskill existing staff, study sector best practice and collaborate with other libraries.

Andrew Cox suggested that some key issues for the profession were:

  • How do we scope the issue: As about data/AI or a wider digital transformation?
    • How does AI fit into our existing strategies – especially given the context of institutional alignment?
    • What constitutes a strategic response to AI? How does this differ between information sectors?
  • How do we meet the workforce challenge?
    • What new skills do we need to develop in the workforce?
    • How might AI impact equality and diversity in the profession?

Workshop discussions

Following the presentations from the speakers, those attending the event in person were given the opportunity to further discuss in groups the professional competencies needed for AI. Those attending online were asked to put any comments they had regarding this in the chat box. Some of the key discussion points were:

  • The need for professionals to rapidly upskill themselves in AI. This includes understanding what AI is and the concepts and applications of AI in individual settings (e.g. healthcare, HE etc.), along with understanding our role in supporting appropriate use. However, it was believed this should go beyond a general understanding to a knowledge of how AI algorithms work, how to use AI and actively adopting AI in our own professional roles in order to grow confidence in this area.
  • Horizon scanning and continuous learning – AI is a fast-paced area where technology is rapidly evolving. Professionals not only need to stay up-to-date with the latest developments, but also be aware of potential future developments to remain effective and ensure we are proactive, rather than reactive.
  • Upskilling should not just focus on professional staff, but all levels of library staff will require some level of upskilling in the area of AI (e.g. library assistants).
  • Importance of information literacy and critical thinking skills in order to assess the quality and relevance of AI outputs. AI should therefore be built into professional training around these skills.
  • Collaboration skills – As one group stated, this should be more ‘about people, not data’. AI requires collaboration with:
    • Information professionals across the sector to establish a consistent approach; 
    • Users (health professionals, students, researchers, public etc.) to establish how they are using AI and what for;
    • With other professionals (e.g. data scientists).
  • Recruitment problems were also discussed, with it noted that for some there had been a drop in people applying for library roles. This was impacting on the ability to bring in new skillsets to the library (e.g. data scientist), but on the ability to allow existing staff the time to upskill in the area of AI. It was discussed that there was the need to promote lifestyle and wellbeing advantages to working in libraries to applicants.

Other issues that came up in the workshop discussions centered around how AI will impact on the overall library service, with the following points made:

  • There is the need to expand library services around AI, as well as embed it in current services;
  • Need to focus on where the library can add value in the area of AI (i.e. USP);
  • Libraries need to make a clear statement to their institution regarding their position on AI;
  • AI increases the importance of and further incentivises open access, open licencing and digitisation of resources;
  • Questions over whether there is a need to rebrand the library.

The attendees also identified that the following would useful to help prepare the sector for AI:

  • Sharing of job descriptions to learn about what AI means in practice and help with workforce planning. Although, it was noted how the RL (Research Libraries) Position Description Bank contains almost 4000 position descriptions from research libraries primarily from North America, although there are many examples from RLUK members; 
  • A reading list and resource bank to help professionals upskill in AI;
  • Work shadowing;
  • Sharing of workshops delivered by professionals to users around the use of AI;
  • AI mailing lists (e.g. JISCmail);
  • Establishment of a Community of Practice to promote collaboration. Although it was noted that AI would probably change different areas of library practice (such as collecting or information literacy) so was likely to be discussed within the professional communities that already existed in these areas.

Workshop outcome

Following the workshop Andrew Cox and Catherine Robinson worked on a draft Working paper which we invite you to comment on @ Draft for comment: Developing a library strategic response to Artificial Intelligence: Working paper.