It took J.K. Rowling years to write the first Harry Potter book, but the next hit young adult fiction might be churned out in minutes – by a computer. Right now, media companies use AI for sports scores, financial reports, and even to cover breaking news in the right voice for the right audience. Operators “train” the AI on linguistic and journalistic conventions by feeding tens or hundreds of thousands of stories into a database, so that the AI can learn to intake new information and report on it instantly. That same process could generate new works of more creative content like paintings, music, or novels. But what are the rules if an AI-powered assistant uploads hundreds of thousands of existing works of fiction into the AI database; analyzes popular trends, themes, characters, and other creative elements; and then produces a new work?

To protect an enterprise that ingests creative works for analysis from claims of copyright infringement, a good place to start is fair use. The Copyright Act identifies four key (but non-exclusive) factors for analysis: (1) the purpose and character of the use of copyrighted materials; (2) the nature of the copyrighted work; (3) the amount and substantiality of the taking in relation to the copyrighted work as a whole; and (4) the impact on potential licensing revenues, or market harm, for the underlying work. Creators and artists are indisputably inspired by one another, but even the bound of permissible inspiration has recently come into question. How well would fair use cover AI ingesting hundreds of thousands of existing works in order to create a new commercial work? The major legal risks come up at two pressure points: (i) when the existing works are ingested into the new database; and (ii) when the new works are created.

Ingesting Existing Copyrighted Works into an AI Database

The first high-risk point happens as the existing copyrighted works are brought into the AI database. A fair use defense at this stage would be extremely strong and heavily supported not only by the four-factor analysis, but also more specifically by recent cases upholding fair use with regard to the use of copyrighted works in similar large databases – most notably, the Google Books litigation.

  • For the first factor, a court would likely see the purpose and character of an AI database as highly transformative – the hallmark of fair use – just as the Second Circuit recently viewed the Google Books project, which made digital copies of millions of books to create a searchable database. Our hypothetical new database would be even more transformative than Google’s, essentially using existing works as the “raw material” to create completely different new works. The first factor also looks to whether the resulting work is used for commercial gain, so a fair use defense would be bolstered by letting the public use the end product free of charge.
  • The second factor analyzes the nature of the underlying work. Our hypothetical AI-powered author likely would be ingesting works that are creative, and therefore entitled to greater copyright protection than factual works. On the other hand, the AI author probably is analyzing works that have already been published, which weighs in favor of fair use. So this factor, which tends to receive little weight in the analysis, is not likely to fall squarely in favor of one side or the other.
  • The third factor looks at the amount and substantiality of the underlying work used in the new work. Our hypothetical AI-powered author would be taking in the entirety of the underlying work – but if the resulting work is transformative and new, based on the collective analysis of hundreds of thousands of existing works and not on any particular, single existing work, a court probably would discount this factor – or even find it weighs in favor of fair use.
  • In analyzing the fourth factor, our hypothetical publisher likely would argue that a database inflicts no harm on the market for the existing works it uses: collecting the existing works for internal use in the database does not really affect the copyright owner’s ability to license the works for display, performances or other potentially protected and creative uses. This is especially true if the database doesn’t ever show the underlying works to its users. Beyond that, analysis of the fourth factor might depend on a variety of variables: Is the database available to third-party users who then have to subscribe or otherwise pay to use it? Is it advertising-supported? Is it restricted to non-commercial uses or available/intended for commercial uses as well? A more commercial model would weigh against fair use, but might still in the end be outweighed by the transformativeness of this technology.

Using Existing Works’ Protectable Elements in New, AI-Generated Works

Our hypothetical database also faces copyright risk when it spits out the new works based on its analysis. There’s a chance these new works could accidentally contain similar elements of protectable expression as some of the existing works the database imports at the start – for example, maybe a new, AI-generated young adult novel contains a few sentences that strongly resemble a passage from Harry Potter and the Chamber of Secrets. In that case, the fair use argument may be weaker and would depend on the particular circumstances (like how much of the existing work is alleged to have been taken, how similar the two passages are, and how that portion/element/aspect is alleged to have been used in the new work). But fair use still would be a viable defense, and we can make some general observations:

  • A court’s analysis of the first and second factors would consider many of the same points described above with respect to the ingestion phase. The hypothetical database’s use of the existing works would be highly transformative, presumably breaking down thousands or even hundreds of thousands of existing works to analyze selected components and create entirely new content not based off any specific existing work. This factor would weigh even more heavily in favor of the AI creator if third-party users could access the software and create new content at no charge, though a fee would not ruin the AI creator’s fair use defense. A court might find the second factor weighed in the AI creator’s favor where the existing works used by the database were published before and/or factual.
  • Factor three weighs even more heavily in favor of fair use at this second risk point. Unlike ingesting existing works into our hypothetical AI database, creating new content would likely mean that only a miniscule amount of any one existing work’s protectable content overlaps with the new work, strongly supporting fair use.
  • In analyzing the fourth factor, our hypothetical publisher could argue persuasively that the newly created work doesn’t harm the market for the existing works: collecting them for the database doesn’t hurt the market to license the works for performances or other creative uses. This especially is true when the overlapping protectable expression between the pre-existing and new works is minimal. And like the other factors, a court’s analysis of the fourth factor may also be influenced by whether the AI creator or publisher chooses to provide the AI and new works for free or at a cost to users.

On the whole, fair use provides a significant defense to copyright lawsuits arising from the AI’s creation of new works. That’s not to say that companies can be certain the defense is strong, though, as courts are notoriously hard to predict on their interpretations of fair use. And sometimes, the law creates new rights, like sync licensing – or protections, like the Sony Betamax case – as tech moves forward. If you want to train AI to create the next Starry Night or Harry Potter series, proceed with caution and keep an eye on the courts.