Inspirative Text Prediction
Introduction
Machine learning, and more specifically, deep learning, is shaping how we write. From academic papers to class materials, emails to text messages, we are constantly using technologies powered by deep learning to compose our texts. Moreover, studies have shown that predictive text influences what we write [1,2,4]. Currently, most text prediction technology uses a model that looks at the previously typed words and the surrounding text to generate a list of likely next words or phrases. It ranks each of them based on their probabilities and presents the most likely ones to users as suggestions. However, not only may those suggestions be biased, but they may also affect how users write and what they write, thereby taking away their authorship and autonomy. Could text prediction models instead serve as a source of inspiration for users, encouraging their writing process instead of suggesting what to write?
In this blog post, I will explore the possibility of using text prediction to inspire users to write more original texts. I will define what it means to be inspirational and then present a preliminary approach to collecting example data and evaluating the current large language models (LLMs) to determine their likelihood of predicting subordinating conjunctions. I will then discuss the challenges and opportunities of using text prediction to inspire users to write more original texts.
Exploratory Data Analysis
import plotly.io as pio
import plotly.express as px
import pandas as pd
import requests
import spacy
= "plotly_white"
pio.templates.default
spacy.prefer_gpu()= spacy.load("en_core_web_sm") nlp
Collection
Let’s start by defining a function to download a book from Project Gutenberg. To accomplish this, we will use Gutendex to retrieve the book’s metadata and then download the book using the URL to the plain text version of the book provided in the metadata. For the purpose of this blog post, we will only download books in English.
#| code-fold: True
def download_book(book_id: int) -> tuple[str, str]:
"""Download a book from Project Gutenberg
Arg:
book_id: The Project Gutenberg ID of the book to download
Returns:
A tuple containing the book title and the book text
"""
= f"https://gutendex.com/books/{book_id}/"
gutendex_url
try:
= requests.get(gutendex_url)
response
response.raise_for_status()= response.json()
data
= data["languages"]
book_language
# Only download books in English
if "en" in book_language:
= data["title"]
book_title
# Only download books in plain text
= ["text/plain", "text/plain; charset=us-ascii"]
mime_types
for mime_type in mime_types:
if mime_type in data["formats"]:
= data["formats"][mime_type]
book_url break
if book_url is None:
raise Exception("The book is not available in plain text.")
= requests.get(book_url)
response
response.raise_for_status()
return book_title, response.text
else:
raise Exception("The book is not in English.")
except requests.exceptions.HTTPError as err:
raise Exception(err)
For this EDA, we will download The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson.
# Book ID for The Strange Case of Dr. Jekyll and Mr. Hyde
= 43
book_id
# Download the book and store it in a DataFrame
= [download_book(book_id)]
book_data = pd.DataFrame(book_data, columns=["title", "text"]) book_data
Wrangling
Let’s take a look at the downloaded text:
#| echo: false
# Print the first 256 characters of the book
print(book_data["text"].iloc[0][:256].strip() + "\n\n...\n")
# Print the last 256 characters of the book
print(book_data["text"].iloc[0][-256:].strip(), end="")
It looks like the text contains some extra information which we do not wish to include in our analysis. Let’s remove the extra information and save the cleaned text in a new column.
Specifically, we will use the markers provided by Project Gutenberg to remove the extra information. These markers appear as follows:
*** START OF THE PROJECT GUTENBERG EBOOK …
*** END OF THE PROJECT GUTENBERG EBOOK …
#| code-fold: true
def sanitize_text(text: str) -> str:
"""Remove extra information from the text
Arg:
text: The text to sanitize
Returns:
The sanitized text
"""
= "***"
start_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_marker
# Index of the second occurrence of the start marker
= text.find(start_marker, text.find(start_marker) + 1)
start_index
# Index of the first occurrence of the end marker
= text.find(end_marker)
end_index
# Remove the extra information based on the marker indices
if start_index != -1 and end_index != -1:
= text[start_index + len(start_marker) : end_index].strip()
text
return text
# Sanitize the text and store it in a new column
"clean_text"] = book_data["text"].apply(sanitize_text) book_data[
Let’s take a look at the cleaned text:
#| echo: false
# Print the first 256 characters of the book
print(book_data["clean_text"].iloc[0][:256].strip() + "\n\n...\n")
# Print the last 256 characters of the book
print(book_data["clean_text"].iloc[0][-256:].strip())
This looks much better! Our next step is to split the text into sentences to analyze it at the sentence level. We will use spaCy to do this:
#| code-fold: true
def sentence_spliter(text: str) -> list[str]:
"""Split the text into sentences
Arg:
text: The text to split
Returns:
A list of sentences
"""
= ["ner", "lemmatizer", "textcat"]
pipe_disable
# Remove line breaks and split the text into sentences
= nlp.pipe([text.replace("\r\n", " ")], disable=pipe_disable)
doc
# Return a list of sentences without leading and trailing whitespace
return [sent.text.strip() for doc in doc for sent in doc.sents]
# Split the text into sentences and store them in a DataFrame
= sentence_spliter(book_data["clean_text"].iloc[0])
sentences = pd.DataFrame(sentences, columns=["sentence"])
sentences
sentences.tail()
How many sentences are there in the book?
#| echo: false
print(f"There are {len(sentences)} sentences in the book.")
How many sentence use subordinating conjunctions? In order to answer this question, we will use spaCy’s part-of-speech tagger to identify sentences that contain subordinating conjunctions:
#| code-fold: true
def doc_pipe(sentence: str):
= ["ner", "lemmatizer", "textcat"]
pipe_disable return list(nlp.pipe([sentence], disable=pipe_disable))
def has_sconj(sentence: str):
"""Check if a sentence contains a subordinating conjunction
Arg:
sentence: The sentence to check
Returns:
A Pandas Series containing a boolean value indicating whether the sentence contains a subordinating conjunction and the subordinating conjunction if it exists
"""
= doc_pipe(sentence)
doc
# Check if the sentence contains a subordinating conjunction
for token in doc[0]:
if token.pos_ == "SCONJ":
return pd.Series([True, token.text])
return pd.Series([False, None])
#| code-overflow: wrap
# Check if the sentence contains a subordinating conjunction and store the result in a new column
"has_sconj", "sconj"]] = sentences["sentence"].apply(has_sconj)
sentences[[
# Sanity check
assert sentences["has_sconj"].value_counts().sum() == len(sentences)
sentences.tail()
How many of the sentences contain subordinating conjunctions? How many of the sentences do not contain subordinating conjunctions?
#| echo: false
print(
f"There are {len(sentences[sentences['has_sconj']])} sentences with a subordinating conjunction,\nand {len(sentences[~sentences['has_sconj']])} sentences without a subordinating conjunction."
)
Visualization
Let’s try visualizing one of the sentences that contains a subordinating conjunction:
Figure 1. Visualization of a Sentence That Contains a Subordinating Conjunction
#| code-fold: true
# Grab a sentence that contains a subordinating conjunction
= 1149
sentence_id = nlp(sentences["sentence"].iloc[sentence_id])
doc
# Visualize the sentence using displaCy
="dep", jupyter=True, options={"distance": 110}) spacy.displacy.render(doc, style
What about the distribution of subordinating conjunctions in the book?
#| code-fold: true
# Lower case the subordinating conjunctions and count them
= sentences["sconj"].str.lower().value_counts().reset_index()
sent_sconj
# Plot the distribution of subordinating conjunctions
= px.bar(
fig
sent_sconj,="sconj",
x="count",
y="<b>Figure 2.</b> Distribution of Subordinating Conjunctions",
title={"sconj": "Subordinating Conjunction", "count": "Count"},
labels=px.colors.qualitative.Safe
color_discrete_sequence
)
fig.show()
Analysis
This result is somewhat surprising to me. I did not expect “that” to be the most common subordinating conjunction in the book. I had expected “because” to be more common when compared to the other subordinating conjunctions since I personally use “because” frequently in my writing. This might suggest that there could be a different distribution of subordinating conjunctions that are more commonly used based on the writing context. Furthermore, this result does not provide any information about which subordinating conjunctions are more useful than others, particularly in the context of text prediction. Our next step is to evaluate the current large language models (LLMs) to determine their likelihood of predicting subordinating conjunctions.
Preliminary Modeling
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.nn.functional import softmax, cross_entropy
from datasets import load_dataset
import pandas as pd
import numpy as np
import random
import torch
import spacy
= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
spacy.prefer_gpu()= spacy.load("en_core_web_sm") nlp
Load the Model
We will use the Llama-2-7b-chat-hf model to evaluate an LLM’s likelihood of predicting subordinating conjunctions. Unfortunately, running the model is computationally expensive on most machines. Therefore, we used AutoAWQ to quantize the model into 4-bit precision2. This reduces the amount of computational resources required to run inference on the model while still maintaining a high level of accuracy. We have provided our code for quantizing the model in the Appendix. In the meantime, you can access our quantized model here: CalvinU/Llama-2-7b-chat-hf-awq.
= "CalvinU/Llama-2-7b-chat-hf-awq"
model_name
= AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoModelForCausalLM.from_pretrained(model_name).to(device) model
Load the Data
We have also described in the Appendix a scalable approach to collecting and processing data from Project Gutenberg. In the meantime, you can access our dataset here: CalvinU/project-gutenberg.
= "CalvinU/project-gutenberg"
dataset_name
= load_dataset(dataset_name, split="train")
dataset = pd.DataFrame(dataset) dataset_df
The dataset contains 10 random books downloaded from Project Gutenberg. These books have already been sanitized and split into sentences based on their book_id
and title
. Therefore, each row in the dataset represents an ordered sentence from one of the books. Let’s take a look at the dataset:
dataset_df.tail()
Wrangling
Since we already have an ordered list of sentences, we can apply the same approach we used in the EDA section to identify sentences that contain subordinating conjunctions:
#| code-fold: true
def doc_pipe(sentence: str):
= ["ner", "lemmatizer", "textcat"]
pipe_disable return list(nlp.pipe([sentence], disable=pipe_disable))
def has_sconj(sentence: str):
"""Check if a sentence contains a subordinating conjunction
Arg:
sentence: The sentence to check
Returns:
A Pandas Series containing a boolean value indicating whether the sentence contains a subordinating conjunction and the subordinating conjunction if it exists
"""
= doc_pipe(sentence)
doc
# Check if the sentence contains a subordinating conjunction
for token in doc[0]:
if token.pos_ == "SCONJ":
return pd.Series([True, token.text])
return pd.Series([False, None])
# Check if the sentence contains a subordinating conjunction and store the result in a new column
"has_sconj", "sconj"]] = dataset_df["sentence"].apply(has_sconj)
dataset_df[[
# Sanity check
assert dataset_df["has_sconj"].value_counts().sum() == len(dataset_df)
dataset_df.tail()
Number of sentences and number of sentences that contain subordinating conjunctions for each book:
= dataset_df.groupby(["book_id", "title"], as_index=False).agg(
summary =("sentence", "count"),
num_sents=("has_sconj", "sum"),
num_sconj
)
10) summary.head(
Analysis
Suppose the general structure of a sentence with a subordinating conjunction is:
<sentence-with-SCONJ> ::= <subordinate-clause> <independent-clause> |
<independent-clause> <subordinate-clause>
Note that a <subordinate-clause>
is a dependent clause that contains a subordinating conjunction and cannot stand alone as a sentence, while an <independent-clause>
is a main clause that can stand alone as a sentence.
In order to evaluate the likelihood of an LLM predicting subordinating conjunctions, we will investigate the following behaviors:
How does the cross-entropy and perplexity change when we provide the context exactly as it appears in the book, versus when we randomly shuffle the context?
And for each case, what is the probability spectrum at the subordinating conjunction? What is the cross-entropy and perplexity of the text after the subordinating conjunction?
When Context Is Provided Exactly as It Appears in the Book
To get started, let’s select one of the books from the dataset titled _ The Adventures of a Dog, and a Good Dog Too_ by Alfred Elwes:
#| code-fold: true
# Book ID for The Adventures of a Dog, and a Good Dog Too
= 20741
book_id
= dataset_df[dataset_df["book_id"] == book_id].reset_index(drop=True)
selected_book selected_book.tail()
How many sentences are there in the book? How many of the sentences contain subordinating conjunctions?
#| echo: false
print(
f"There are {len(selected_book)} sentences in the book. There are {len(selected_book[selected_book['has_sconj']])} sentences with a subordinating conjunction"
)
It appears that this book uses a significant number of subordinating conjunctions! Let’s choose one of the last sentences that includes a subordinating conjunction and select a maximum of 100 sentences preceding it as the context:
= selected_book[selected_book["has_sconj"]].index[-3]
last_sconj_index
= selected_book.iloc[max(last_sconj_index - 100, 0) : last_sconj_index][
context "sentence"
].tolist()
= " ".join(context)
context
= selected_book.iloc[last_sconj_index]["sentence"] sentence
Let’s take a small peek at the context:
#| echo: false
# Print the first 50 characters of the context
print(context[:100].strip() + " ... ", end="")
# Print the last 50 characters of the context
print(context[-100:].strip(), end="\n\n")
Let’s take a look at the sentence with the subordinating conjunction:
#| echo: false
# Print the sentence
print(sentence)
Let’s tokenize the context and the sentence, and then feed them into the model to get the predicted logits:
#| code-fold: true
# Tokenize the context (to be used later, not as an input sequence)
= tokenizer(context, return_tensors="pt").to(device)
context_tokenized = context_tokenized.input_ids
context_input_ids
# Tokenize the sentence (to be used later, not as an input sequence)
= tokenizer(sentence, return_tensors="pt").to(device)
sentence_tokenized = sentence_tokenized.input_ids
sentence_input_ids
# Tokenize the context and the sentence as an input sequence
= tokenizer(context + sentence, return_tensors="pt").to(device)
prompt_tokenized = prompt_tokenized.input_ids
prompt_input_ids
# Get the predicted logits for the input sequence
= model(prompt_input_ids).logits model_logits
#| code-fold: true
# Get the subordinating conjunction from the book
= selected_book.iloc[last_sconj_index]["sconj"]
sconj
# Decode the context as a string, excluding the first token
= [tokenizer.decode(token) for token in context_input_ids[0]][1:]
context_decoded
# Decode the sentence as a string, excluding the first token
= [tokenizer.decode(token) for token in sentence_input_ids[0]][1:]
sentence_decoded
# Index of the subordinating conjunction token in the input sequence
= len(context_decoded) + sentence_decoded.index(sconj)
sconj_token_index
# Index of the subordinating conjunction in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(sconj)
sconj_index
# Figure out which type of clause comes first
if sconj not in sentence[:sconj_index]:
= sentence[:sconj_index]
independent_clause = sentence[sconj_index:]
subordinate_clause else:
= sentence[sconj_index:]
independent_clause = sentence[:sconj_index]
subordinate_clause
# Index of the independent clause in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(
independent_clause_index
independent_clause
)
# Index of the subordinate clause in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(
subordinate_clause_index
subordinate_clause
)
# Tokenize the independent clause
= tokenizer(
independent_clause_tokenized ="pt"
independent_clause, return_tensors
).to(device)
# Tokenize the subordinate clause
= tokenizer(
subordinate_clause_tokenized ="pt"
subordinate_clause, return_tensors ).to(device)
Given that we have fed in the context exactly as it appears in the book, let’s take a look at the top k probability spectrum at the subordinating conjunction:
#| code-fold: true
def probability_spectrum_at(logits, input_ids, i, k=6):
"""Given an input sequence, get the top k probability spectrum at the given index
Args:
logits: predicted logits for the input sequence
input_ids: input sequence token IDs
i: index to get the probability spectrum at
k: top k, default is 6
Returns:
A Pandas DataFrame containing the top k probability spectrum at the given index
"""
# Predicted logits for an input sequence, excluding the last element
= logits[0, :-1]
adjusted_logits
# Input sequence, starting from the second element
= input_ids[0, 1:]
adjusted_input_ids
# Get the probability distribution predicted by the model
= softmax(adjusted_logits[i], dim=0)
probability_distribution
# Get the top k probabilities and their respective indices, default k=6
= probability_distribution.topk(k)
top_probability_distribution, top_indices
# Get the top k probability spectrum as a DataFrame
= pd.DataFrame(
probability_spectrum
{"token": [tokenizer.decode(token) for token in top_indices.tolist()],
"probability": top_probability_distribution.tolist(),
}
)
# Decode the input sequence as a string
= tokenizer.decode(adjusted_input_ids[i])
matching_token
# Highlight the matching string in the probability spectrum
def highlight_prompt_at(x):
if x["token"] == matching_token:
return ["background-color: #6495ED"] * len(x)
else:
return [""] * len(x)
return probability_spectrum.style.apply(highlight_prompt_at, axis=1)
def cross_entropy_at(logits, input_ids, i):
"""Given an input sequence, get the cross entropy at the given index
Args:
logits: predicted logits for the input sequence
input_ids: input sequence token IDs
i: index to get the cross entropy at
Returns:
The cross entropy at the given index
"""
# Predicted logits for an input sequence, excluding the last element
= logits[0, :-1]
adjusted_logits
# Input sequence, starting from the second element
= input_ids[0, 1:]
adjusted_input_ids
# Get the cross entropy per input sequence
= cross_entropy(
cross_entropy_seq ="none"
adjusted_logits, adjusted_input_ids, reduction
)
return cross_entropy_seq[i].item()
def cross_entropy_per_token(logits, input_ids, matching_sequence_tokenized):
"""Given a matching sequence, get the cross entropy for each token in the matching sequence
Args:
logits: predicted logits for the input sequence
input_ids: input sequence token IDs
matching_sequence_tokenized: tokenized matching sequence
Returns:
A Pandas DataFrame containing the cross entropy for each token in the matching sequence
"""
# Predicted logits for an input sequence, excluding the last element
= logits[0, :-1]
adjusted_logits
# Input sequence, starting from the second element
= input_ids[0, 1:]
adjusted_input_ids
# Get the cross entropy per input sequence
= cross_entropy(
cross_entropy_seq ="none"
adjusted_logits, adjusted_input_ids, reduction
)
# Decode the tokenized matching sequence as a string
= [
matching_sequence_token for token in matching_sequence_tokenized.input_ids[0]
tokenizer.decode(token)
]
# Decoded matching sequence token, starting from the second element
= matching_sequence_token[1:]
adjusted_matching_sequence_token
return pd.DataFrame(
{"token": adjusted_matching_sequence_token,
"cross_entropy": cross_entropy_seq[
-len(adjusted_matching_sequence_token) :
].tolist(),
} )
#| code-fold: true
= probability_spectrum_at(
prob_spectrum
model_logits, prompt_input_ids, sconj_token_index
)
prob_spectrum
Let’s also look at the cross-entropy and perplexity of the sentence with the subordinating conjunction:
#| code-fold: true
# Cross-entropy of the sentence with the subordinating conjunction (entire input sequence)
= cross_entropy_per_token(
sentence_per_token_cross_entropy
model_logits, prompt_input_ids, sentence_tokenized
)
# Mean cross-entropy of the sentence with the subordinating conjunction (entire input sequence)
= sentence_per_token_cross_entropy["cross_entropy"].mean()
mean_sentence_cross_entropy
# Perplexity of the sentence with the subordinating conjunction (entire input sequence)
= np.exp(mean_sentence_cross_entropy)
sentence_perplexity
# Cross-entropy of the independent clause
= cross_entropy_per_token(
independent_clause_per_token_cross_entropy
model_logits, prompt_input_ids, independent_clause_tokenized
)
# Mean cross-entropy of the independent clause
= independent_clause_per_token_cross_entropy[
mean_independent_clause_cross_entropy "cross_entropy"
].mean()
# Perplexity of the independent clause
= np.exp(mean_independent_clause_cross_entropy)
independent_clause_perplexity
# Cross-entropy of the subordinate clause
= cross_entropy_per_token(
subordinate_clause_per_token_cross_entropy
model_logits, prompt_input_ids, subordinate_clause_tokenized
)
# Mean cross-entropy of the subordinate clause
= subordinate_clause_per_token_cross_entropy[
mean_subordinate_clause_cross_entropy "cross_entropy"
].mean()
# Perplexity of the subordinate clause
= np.exp(mean_subordinate_clause_cross_entropy)
subordinate_clause_perplexity
# Cross-entropy of the subordinating conjunction
= cross_entropy_at(
subordinating_conjunction_cross_entropy
model_logits, prompt_input_ids, sconj_token_index
)
# Perplexity of the subordinating conjunction
= np.exp(subordinating_conjunction_cross_entropy) subordinating_conjunction_perplexity
#| echo: false
# Print in the order of the sentence structure
if independent_clause_index > subordinate_clause_index:
print("Structure:")
print("\t<sentence-with-SCONJ> ::= <subordinate-clause> <independent-clause>")
print("\nMetrics:")
print(f"\t<sentence-with-SCONJ> cross-entropy: {mean_sentence_cross_entropy}")
print(f"\t<sentence-with-SCONJ> perplexity: {sentence_perplexity}\n")
print(f"\t<subordinate-clause> cross-entropy: {mean_subordinate_clause_cross_entropy}")
print(f"\t<subordinate-clause> perplexity: {subordinate_clause_perplexity}")
print(f"\t<SCONJ> cross-entropy: {subordinating_conjunction_cross_entropy}")
print(f"\t<SCONJ> perplexity: {subordinating_conjunction_perplexity}")
print(f"\t<independent-clause> cross-entropy: {mean_independent_clause_cross_entropy}")
print(f"\t<independent-clause> perplexity: {independent_clause_perplexity}")
else:
print("Structure:")
print("\t<sentence-with-SCONJ> ::= <independent-clause> <subordinate-clause>")
print("\nMetrics:")
print(f"\t<sentence-with-SCONJ> cross-entropy: {mean_sentence_cross_entropy}")
print(f"\t<sentence-with-SCONJ> perplexity: {sentence_perplexity}\n")
print(f"\t<independent-clause> cross-entropy: {mean_independent_clause_cross_entropy}")
print(f"\t<independent-clause> perplexity: {independent_clause_perplexity}")
print(f"\t<SCONJ> cross-entropy: {subordinating_conjunction_cross_entropy}")
print(f"\t<SCONJ> perplexity: {subordinating_conjunction_perplexity}")
print(f"\t<subordinate-clause> cross-entropy: {mean_subordinate_clause_cross_entropy}")
print(f"\t<subordinate-clause> perplexity: {subordinate_clause_perplexity}")
When Context Is Randomly Shuffled
Let’s shuffle the context and feed it into the model to get the predicted logits:
= selected_book[selected_book["has_sconj"]].index[-3]
last_sconj_index
= selected_book.iloc[max(last_sconj_index - 100, 0) : last_sconj_index][
context "sentence"
].tolist()
42).shuffle(context)
random.Random(
= " ".join(context)
context
= selected_book.iloc[last_sconj_index]["sentence"] sentence
Let’s take a small peek at the context:
#| echo: false
# Print the first 50 characters of the context
print(context[:100].strip() + " ... ", end="")
# Print the last 50 characters of the context
print(context[-100:].strip(), end="\n\n")
Let’s take a look at the sentence with the subordinating conjunction:
#| echo: false
# Print the sentence
print(sentence)
Same steps as before:
#| code-fold: true
# Tokenize the context (to be used later, not as an input sequence)
= tokenizer(context, return_tensors="pt").to(device)
context_tokenized = context_tokenized.input_ids
context_input_ids
# Tokenize the sentence (to be used later, not as an input sequence)
= tokenizer(sentence, return_tensors="pt").to(device)
sentence_tokenized = sentence_tokenized.input_ids
sentence_input_ids
# Tokenize the context and the sentence as an input sequence
= tokenizer(context + sentence, return_tensors="pt").to(device)
prompt_tokenized = prompt_tokenized.input_ids
prompt_input_ids
# Get the predicted logits for the input sequence
= model(prompt_input_ids).logits model_logits
#| code-fold: true
# Get the subordinating conjunction from the book
= selected_book.iloc[last_sconj_index]["sconj"]
sconj
# Decode the context as a string, excluding the first token
= [tokenizer.decode(token) for token in context_input_ids[0]][1:]
context_decoded
# Decode the sentence as a string, excluding the first token
= [tokenizer.decode(token) for token in sentence_input_ids[0]][1:]
sentence_decoded
# Index of the subordinating conjunction token in the input sequence
= len(context_decoded) + sentence_decoded.index(sconj)
sconj_token_index
# Index of the subordinating conjunction in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(sconj)
sconj_index
# Figure out which type of clause comes first
if sconj not in sentence[:sconj_index]:
= sentence[:sconj_index]
independent_clause = sentence[sconj_index:]
subordinate_clause else:
= sentence[sconj_index:]
independent_clause = sentence[:sconj_index]
subordinate_clause
# Index of the independent clause in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(
independent_clause_index
independent_clause
)
# Index of the subordinate clause in the sentence (not the input sequence, but from the book)
= selected_book.iloc[last_sconj_index]["sentence"].find(
subordinate_clause_index
subordinate_clause
)
# Tokenize the independent clause
= tokenizer(
independent_clause_tokenized ="pt"
independent_clause, return_tensors
).to(device)
# Tokenize the subordinate clause
= tokenizer(
subordinate_clause_tokenized ="pt"
subordinate_clause, return_tensors ).to(device)
Given that we have fed in a randomly shuffled context, let’s take a look at the top k probability spectrum at the subordinating conjunction:
= probability_spectrum_at(
prob_spectrum
model_logits, prompt_input_ids, sconj_token_index
)
prob_spectrum
Let’s also look at the cross-entropy and perplexity of the sentence with the subordinating conjunction:
#| code-fold: true
# Cross-entropy of the sentence with the subordinating conjunction (entire input sequence)
= cross_entropy_per_token(
sentence_per_token_cross_entropy
model_logits, prompt_input_ids, sentence_tokenized
)
# Mean cross-entropy of the sentence with the subordinating conjunction (entire input sequence)
= sentence_per_token_cross_entropy["cross_entropy"].mean()
mean_sentence_cross_entropy
# Perplexity of the sentence with the subordinating conjunction (entire input sequence)
= np.exp(mean_sentence_cross_entropy)
sentence_perplexity
# Cross-entropy of the independent clause
= cross_entropy_per_token(
independent_clause_per_token_cross_entropy
model_logits, prompt_input_ids, independent_clause_tokenized
)
# Mean cross-entropy of the independent clause
= independent_clause_per_token_cross_entropy[
mean_independent_clause_cross_entropy "cross_entropy"
].mean()
# Perplexity of the independent clause
= np.exp(mean_independent_clause_cross_entropy)
independent_clause_perplexity
# Cross-entropy of the subordinate clause
= cross_entropy_per_token(
subordinate_clause_per_token_cross_entropy
model_logits, prompt_input_ids, subordinate_clause_tokenized
)
# Mean cross-entropy of the subordinate clause
= subordinate_clause_per_token_cross_entropy[
mean_subordinate_clause_cross_entropy "cross_entropy"
].mean()
# Perplexity of the subordinate clause
= np.exp(mean_subordinate_clause_cross_entropy)
subordinate_clause_perplexity
# Cross-entropy of the subordinating conjunction
= cross_entropy_at(
subordinating_conjunction_cross_entropy
model_logits, prompt_input_ids, sconj_token_index
)
# Perplexity of the subordinating conjunction
= np.exp(subordinating_conjunction_cross_entropy) subordinating_conjunction_perplexity
#| echo: false
# Print in the order of the sentence structure
if independent_clause_index > subordinate_clause_index:
print("Structure:")
print("\t<sentence-with-SCONJ> ::= <subordinate-clause> <independent-clause>")
print("\nMetrics:")
print(f"\t<sentence-with-SCONJ> cross-entropy: {mean_sentence_cross_entropy}")
print(f"\t<sentence-with-SCONJ> perplexity: {sentence_perplexity}\n")
print(f"\t<subordinate-clause> cross-entropy: {mean_subordinate_clause_cross_entropy}")
print(f"\t<subordinate-clause> perplexity: {subordinate_clause_perplexity}")
print(f"\t<SCONJ> cross-entropy: {subordinating_conjunction_cross_entropy}")
print(f"\t<SCONJ> perplexity: {subordinating_conjunction_perplexity}")
print(f"\t<independent-clause> cross-entropy: {mean_independent_clause_cross_entropy}")
print(f"\t<independent-clause> perplexity: {independent_clause_perplexity}")
else:
print("Structure:")
print("\t<sentence-with-SCONJ> ::= <independent-clause> <subordinate-clause>")
print("\nMetrics:")
print(f"\t<sentence-with-SCONJ> cross-entropy: {mean_sentence_cross_entropy}")
print(f"\t<sentence-with-SCONJ> perplexity: {sentence_perplexity}\n")
print(f"\t<independent-clause> cross-entropy: {mean_independent_clause_cross_entropy}")
print(f"\t<independent-clause> perplexity: {independent_clause_perplexity}")
print(f"\t<SCONJ> cross-entropy: {subordinating_conjunction_cross_entropy}")
print(f"\t<SCONJ> perplexity: {subordinating_conjunction_perplexity}")
print(f"\t<subordinate-clause> cross-entropy: {mean_subordinate_clause_cross_entropy}")
print(f"\t<subordinate-clause> perplexity: {subordinate_clause_perplexity}")
Results and Conclusion
Our analysis section demonstrates that the cross-entropy and perplexity of the sentence with the subordinating conjunction change based on the context provided to the model. Furthermore, we observed that the probability of the subordinating conjunction is also affected by the context. This suggests that the context provided to the model is important for predicting subordinating conjunctions. In other words, the context provided to the model can influence the likelihood of the model predicting subordinating conjunctions. Moreover, we have also observed that, despite the change in the context, the cross-entropy and perplexity around the subordinate clause did not change as much as around the independent clause. Although this warrants more thorough investigation, it suggests that there is a certain kind of subordinating conjunction that appears to be more useful, even to the LLM (as it was still likely to construct the same subordinate clause even with the contextual change).
Limitations
Our work is not without limitations. Firstly, we have only analyzed the LLM with one book, which is not representative of different kinds of writing contexts. Furthermore, our approach is currently only able to parse subordinate clauses that position the subordinating conjunction in the middle of the sentence. There are certain edge cases related to the positioning of the subordinating conjunctions that we have not considered.
Future Work
A natural extension of this work is to evaluate the LLM’s likelihood of predicting subordinating conjunctions with a more diverse and representative sample of data. Furthermore, we can also evaluate the LLM’s likelihood of predicting other kinds of conjunctions, such as coordinating conjunctions. Moreover, we can also evaluate the LLM’s likelihood of predicting subordinating conjunctions in different kinds of writing contexts, such as academic writing instead of books. Another way to extend this work is to evaluate the relationship between the LLM’s hyperparameters and its likelihood of predicting subordinating conjunctions.
Appendix
AutoAWQ Quantization
In this section, we have documented our approach to quantizing the Llama-2-7b-chat-hf model using AutoAWQ into 4-bit precision. This reduces the amount of computational resources required to run inference on the model while still maintaining a high level of accuracy.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AwqConfig
= "meta-llama/Llama-2-7b-chat-hf"
model_name = "Llama-2-7b-chat-hf-awq"
quantized_model_path
= AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoAWQForCausalLM.from_pretrained(model_name, **{"low_cpu_mem_usage": True}) model
# Setup AutoAWQ quantization configuration
= {
quant_config "zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
# Quantize the model
=quant_config) model.quantize(tokenizer, quant_config
# Setup Transformer compatible quantization configuration
= AwqConfig(
quantization_config =quant_config["w_bit"],
bits=quant_config["q_group_size"],
group_size=quant_config["zero_point"],
zero_point=quant_config["version"].lower(),
version
).to_dict()
# Pass the new quantization configuration to the model
= quantization_config
model.model.config.quantization_config
# Save the quantized model weights
tokenizer.save_pretrained(quantized_model_path) model.save_quantized(quantized_model_path)
To promote reproducibility of this work, we have uploaded our quantized model to Hugging Face repositories. You can access our quantized model here: CalvinU/Llama-2-7b-chat-hf-awq.
Scalable Data Collection
In the EDA, we have only looked at one book. However, in a language modeling task, we would likely need a sample of data that is diverse and representative of different kinds of writing. In this section, we have documented a scalable approach to collecting and processing data from Project Gutenberg.
import pandas as pd
import requests
import random
import spacy
= spacy.load("en_core_web_sm") nlp
Here are the functions we have used. All of them were already defined in the EDA section, except for download_books
, which is a wrapper function for download_book
that downloads multiple books instead of just one:
Code
def download_book(book_id: int) -> tuple[str, str]:
"""Download a book from Project Gutenberg
Arg:
book_id: The Project Gutenberg ID of the book to download
Returns:
A tuple containing the book title and the book text
"""
= f"https://gutendex.com/books/{book_id}/"
gutendex_url
try:
= requests.get(gutendex_url)
response
response.raise_for_status()= response.json()
data
= data["languages"]
book_language
# Only download books in English
if "en" in book_language:
= data["title"]
book_title
# Only download books in plain text
= ["text/plain", "text/plain; charset=us-ascii"]
mime_types
for mime_type in mime_types:
if mime_type in data["formats"]:
= data["formats"][mime_type]
book_url break
if book_url is None:
raise Exception("The book is not available in plain text.")
= requests.get(book_url)
response
response.raise_for_status()
return book_title, response.text
else:
raise Exception("The book is not in English.")
except requests.exceptions.HTTPError as err:
raise Exception(err)
def download_books(n: int) -> list[tuple[int, str, str]]:
"""Download n books from Project Gutenberg
Arg:
n: The number of books to download
Returns:
A list of downloaded books
"""
= requests.get("https://gutendex.com/books/").json()["count"]
max_book_count
= []
books
= 0
i while i < n:
= random.randint(1, max_book_count)
book_id
try:
= download_book(book_id)
book_title, book_text
books.append((book_id, book_title, book_text))+= 1
i except Exception as e:
continue
return books
def sanitize_text(text: str) -> str:
"""Remove extra information from the text
Arg:
text: The text to sanitize
Returns:
The sanitized text
"""
= "***"
start_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_marker
# Index of the second occurrence of the start marker
= text.find(start_marker, text.find(start_marker) + 1)
start_index
# Index of the first occurrence of the end marker
= text.find(end_marker)
end_index
# Remove the extra information based on the marker indices
if start_index != -1 and end_index != -1:
= text[start_index + len(start_marker) : end_index].strip()
text
return text
def sentence_spliter(text: str) -> list[str]:
"""Split the text into sentences
Arg:
text: The text to split
Returns:
A list of sentences
"""
= len(text)
nlp.max_length
= ["ner", "lemmatizer", "textcat"]
pipe_disable
# Remove line breaks and split the text into sentences
= nlp.pipe([text.replace("\r\n", " ")], disable=pipe_disable)
doc
# Return a list of sentences without leading and trailing whitespace
return [sent.text.strip() for doc in doc for sent in doc.sents]
Download 10 random books from Project Gutenberg:
= 10
n_books
= pd.DataFrame(
books10
download_books(n_books), =["book_id", "title", "text"]
columns
)
assert len(books10) == n_books
Clean the texts:
"clean_text"] = books10["text"].apply(sanitize_text) books10[
Split the texts into sentences:
= []
books10_sentences
# For each book, split the text into sentences
for i in range(0, len(books10)):
books10_sentences.append(
("book_id"].iloc[i],
books10["title"].iloc[i],
books10["clean_text"].iloc[i]),
sentence_spliter(books10[
) )
Create a new DataFrame with the sentences:
# For each sentences in each id, create a new row
= [
books10_sentences id, title, sent) for id, title, sents in books10_sentences for sent in sents
(
]
= pd.DataFrame(
books10_sentences =["book_id", "title", "sentence"]
books10_sentences, columns )
To promote reproducibility of this work, we have saved the data we have collected and processed using this approach as a parquet file. You can view and access our dataset here: CalvinU/project-gutenberg.