Heart of the Matter: Demystifying Copying in the Training of LLMs - DATAVERSITY

Heart of the Matter: Demystifying Copying in the Training of LLMs – DATAVERSITY

Source Node: 2466191

Reflecting on the past 15 months, the progress made in generative AI and large language models (LLMs) following the introduction and availability of ChatGPT to the public has dominated the headlines. 

The building block for this progress was the Transformer model architecture outlined by a team of Google researchers in a paper entitled “Attention Is All You Need.” As the title suggests, a key feature of all Transformer models is the mechanism of attention, defined in the paper as follows:

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

A characteristic of generative AI models is the massive consumption of data inputs, which could consist of text, images, audio files, video files, or any combination of the inputs (a case usually referred to as “multi-modal”). From a copyright perspective, an important question (of many important questions) to ask is whether training materials are retained in the large language model (LLM) produced by various LLM vendors. To help answer that question, we need to understand how the textual materials are processed. Focusing on text, what follows is a brief, non-technical description of exactly that aspect of LLM training. 

Humans communicate in natural language by placing words in sequences; the rules about the sequencing and specific form of a word are dictated by the specific language (e.g., English). An essential part of the architecture for all software systems that process text (and therefore for all AI systems that do so) is how to represent that text so that the functions of the system can be performed most efficiently. Therefore, a key step in the processing of a textual input in language models is the splitting of the user input into special “words” that the AI system can understand. Those special words are called “tokens.” The component that is responsible for that is called a “tokenizer.” There are many types of tokenizers. For example, OpenAI and Azure OpenAI use a subword tokenization method called “Byte-Pair Encoding (BPE)” for their Generative Pretrained Transformer (GPT)-based models. BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token, until a certain number of tokens or a vocabulary size is reached. The larger the vocabulary size, the more diverse and expressive the texts that the model can generate.

Once the AI system has mapped the input text into tokens, it encodes the tokens into numbers and converts the sequences that it processed as vectors referred to as “word embeddings.” A vector is an ordered set of numbers – you can think of it as a row or column in a table. These vectors are representations of tokens that preserve their original natural language representation that was given as text. It is important to understand the role of word embeddings when it comes to copyright because the embeddings form representations (or encodings) of entire sentences, or even paragraphs, and therefore, in vector combinations, even entire documents in a high-dimensional vector space. It is through these embeddings that the AI system captures and stores the meaning and the relationships of words from the natural language. 

Embeddings are used in practically every task that a generative AI system performs (e.g., text generation, text summarization, text classification, text translation, image generation, code generation, and so on). Word embeddings are usually stored in vector databases, but a detailed description of all the approaches to storage is beyond the scope of this post as there are a wide variety of vendors, processes, and practices in use.

As mentioned, almost all LLMs are based on the Transformer architecture, which invokes the attention mechanism. The latter allows the AI technology to view entire sentences, and even paragraphs, as a whole rather than as mere sequences of characters. This allows the software to capture the various contexts within which a word can occur, and as these contexts are provided by the works used in training, including copyrighted works, they are not arbitrary. In this way, the original use of the words, the expression of the original work, is preserved in the AI system. It can be reproduced and analyzed, and can form the basis of new expressions (which, depending on the specific circumstances, may be characterized as “derivative work” in copyright parlance). 

LLMs retain the expressions of the original works on which they have been trained. They form internal representations of the text in purpose-built vector spaces and, given the appropriate input as a trigger, they could reproduce the original works that were used in their training. AI systems derive perpetual benefits from the content, including copyrighted content, used to train the LLMs upon which they are based. LLMs recognize the context of words based on the expression of words in the original work. And this context cumulatively benefits the AI system across thousands, or millions, of copyrighted works used in training. These original works can be re-created by the AI system because they are stored in vectors – vector-space representations of tokens that preserve their original natural language representation – of the copyrighted work. From a copyright perspective, determining whether training materials are retained in LLMs is at the heart of the matter, and it is clear that the answer to that question is yes.

Time Stamp: