Evaluating generic phrases using Granite models

Large language models (LLMs) have a hard time interpreting generic phrases (linguistic generics). Humans might interpret generic phrases such as "Sharks attack beachgoers" to be true, but humans also know that not all sharks attack humans and out of the ones that do, only a very few sharks attack beachgoers. When an LLM receives the phrase in a training corpus, it establishes a relationship between “sharks” and “attacking beachgoers,” which is not necessarily true. (To learn more about the problem of linguistic generics in LLM training, read this research article or this one.)

In this article, we will evaluate this generic phrase, “Do sharks attack beachgoers?” We will use five different prompt templates, similar to the WikiContradict evaluation pipeline in this research article (see Figure 3 in this article). In particular, we choose this question and provide different prompts with different contexts to analyze how LLMs behave when asked generic-related questions. (We chose the sharks example to commemorate the 50th anniversary of JAWS this year).

Diving in to the code

You can review the Jupyter notebook in Google Colab, or continue reading the walkthrough of the Python code in this article.

Before diving into the code (pun intended), we first need to install a few dependencies.

!pip install langchain_community \
   langchain-huggingface \
   langchain-milvus \
   datasets \
   transformers \
   wikipedia --quiet

Next, we load an existing retriever. We use the WikipediaRetriever (which is a database on wikipedia documents) and query it on the topic of “Shark”. We retrieve only the first two documents. It is very important that the two documents are distinctly different from each other for evaluation purposes.

from langchain_community.retrievers import WikipediaRetriever
retriever = WikipediaRetriever(top_k_results=2)
docs = retriever.invoke("Shark")

Now, let us see what the first document is about.

print(docs[0].page_content[:100])

Output:

Sharks are a group of elasmobranch cartilaginous fishes characterized by a ribless endoskeleton...

We see that the first document is indeed on the animal shark. This is aptly defined and is also necessary to answer our question “Do sharks attack beachgoers?”.

Now let’s see what the second document contains.

print(docs[1].page_content[:100])

Output:

In cryptography, SHARK is a block cipher identified as one of the predecessors of Rijndael...

Here, we see that the second document is related to cryptography and the page retrieved is on SHARK, a block cipher.

Finally, we move on to load the tokenizer and the model from HuggingFace. We use the granite-3.3-2b-instruct model for evaluation.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.3-2b-instruct")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.3-2b-instruct")

We start off by creating a HuggingFace pipeline.

from langchain_community.llms import HuggingFaceHub
from langchain_core.output_parsers import StrOutputParser
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFacePipeline
import torch
from transformers import pipeline

pipe = pipeline(
   "text-generation",
   model=model,
   tokenizer=tokenizer,
   max_new_tokens=512,
   temperature=0.7,
   device=0 if torch.cuda.is_available() else -1, # Use GPU if available
)

llm = HuggingFacePipeline(pipeline=pipe)

Prompt Template 1 (Vanilla Prompt)

Next, we import PromptTemplate and create an LLM chain.

Then, we ask the model to respond to the question “Do sharks attack beachgoers”?.

We choose only four quantifiers (“all”, “no”, “some” or “not all”) as it was studied formally in Aristotle's syllogistics.

prompt = PromptTemplate.from_template("""Select one of the basic quantifier expressions from the below list for answering the question and provide a reason based on your internal knowledge : [“all”, “no”, “some” or “not all”].
   Do not print anything else apart from the answer and the reason.
   Question: {question}""")

chain = LLMChain(llm=llm, prompt=prompt)

response = chain.invoke(
   "Do sharks attack beachgoers?")

Finally, we print the response text.

print(response['text'])

Output:

c. some
Reason: Sharks that attack beachgoers are a minority event, not a universal one.
Therefore, the statement "some sharks attack beachgoers" is accurate, but not all.

Prompt Template 2 (Vanilla Prompt + Context from First document)

Similar to the first prompt, the only change is adding the formatted_context line.

formatted context here is a function that implements retrieving page content only from the first document. We change it accordingly in other prompt templates, as per the evaluation pipeline shown above.

pipe = pipeline(
   "text-generation",
   model=model,
   tokenizer=tokenizer,
   max_new_tokens=512,
   temperature=0.7,
   device=0 if torch.cuda.is_available() else -1, # Use GPU if available
)

llm = HuggingFacePipeline(pipeline=pipe)

prompt = PromptTemplate.from_template(
   """Select one of the basic quantifier expressions from the below list for answering the question and provide a reason based on your internal knowledge : [“all”, “no”, “some” or “not all”].
   Do not print anything else apart from the answer and the reason.
   Question: {question}
   Context: {context}"""
)

chain_with_context = LLMChain(llm=llm, prompt=prompt)

def format_docs_0(docs):
   return docs[0].page_content if docs else ""

formatted_context = format_docs_0(docs)

response = chain_with_context.invoke(
   {"context": formatted_context, "question": "Do sharks attack beachgoers?"}
)

The response:

print(response['text'])

Output:

Reason: "some"

Some shark species are apex predators and attack beachgoers, especially the larger species such as the great white shark, tiger shark, and bull shark. 
However, it's important to note that not all sharks attack humans, and most sharks have a diet consisting of fish and marine mammals. 
The majority of shark attacks on humans are fatal, but the probability of being attacked by a shark is relatively low.

Prompt Template 3 (Vanilla Prompt + Context from Second document)

Here, we implement the same code as above but with one single change: we now need to select the second document for context.

def format_docs_1(docs):
   return docs[1].page_content if docs else ""

formatted_context = format_docs_1(docs)

The response:

print(response['text'])

Output:

Note: The provided context is about the cryptographic block cipher SHARK, not about sharks or beachgoers.
Therefore, there's no relevant information to form a quantifier expression for the question.
The answer is "none" and the reason is: The context does not contain any information about sharks attacking beachgoers.

Prompt Template 4 (Vanilla Prompt + Context from both documents)

Here, we include both the documents in the context.

def format_docs(docs):
   return "\n\n".join(doc.page_content for doc in docs)
formatted_context = format_docs(docs)

The response:

print(response['text'])

Output:

The context provided does not explicitly discuss sharks attacking beachgoers.
However, it does offer information about shark species, their sizes, habitats, and behaviors.
To answer the question "Do sharks attack beachgoers?", we can infer a basic quantifier expression from the context.

Given the context, we can express it with the expression "some" because sharks do attack humans, though it's a rare occurrence.
This inference is based on the fact that several shark species are apex predators, including bull sharks, tiger sharks, great white sharks, mako sharks, thresher sharks, and hammerhead sharks.
Although they typically do not live in freshwater, some sharks, like the bull shark, can be found in both seawater and freshwater.

Therefore, the answer is "some", and the reason is based on the existence of shark species capable of attacking humans, despite the rarity of such incidents.

Prompt Template 5 (Detailed Prompt + Context from both documents)

Lastly, in Prompt Template 5, we use a detailed prompt with context from both documents (as above).

prompt = PromptTemplate.from_template(
   """Select one of the basic quantifier expressions from the below list for answering the question and provide a reason based on your internal knowledge : [“all”, “no”, “some” or “not all”].
   Carefully investigate the given context and provide a concise response that reflects the comprehensive view of the context, even if the answer contains contradictory information reflecting the heterogenous nature of the context.
   Question: {question}
   Context: {context}"""
)

The response:

print(response['text'])

Output:

In this context, the basic quantifier expressions to be chosen are "all", "no", "some", or "not all".
The context describes sharks as large predatory fish found in all seas, with some species capable of living in freshwater, and some being apex predators in marine environments.
Sharks are also noted to be caught by humans for various purposes, with many populations threatened due to overfishing and finning practices.

Given these details, the most appropriate quantifier expression is "some." Here's the reasoning:

1. The statement "Do sharks attack beachgoers?" is about whether certain shark species engage in behavior that could pose a risk to humans.
2. The context reveals that sharks are predators, with some species like the great white, tiger, and hammerhead sharks being noted as apex predators.
3. However, the context does not universally assert that all shark species attack beachgoers.
4. While there are documented attacks, the context primarily focuses on the variety, diversity, and ecological roles of sharks rather than providing an exhaustive list of attacks.
5. In fact, the context highlights human threats to shark populations, implying that not all shark attacks are due to human-induced aggression or proximity.
6. Therefore, the answer "some" aligns best with the context, acknowledging that some shark species are capable of, and do, attack beachgoers.

Conclusion

Through this experiment, we see that the Granite 3.3-2b model is good at identifying associated quantifiers in generic phrases and the model responses are context-driven too. We also see that the model relies on context-dependent information, which is why Prompt Template 3 gave “none” as the right answer to our question “Do sharks attack beachgoers?”.

Feel free to create a copy of the notebook here, and play around with your own linguistic generics.

Acknowledgements

Thanks to my IBM mentors, Dr. Alessandra Pascale and Susan Malaika, for guiding me throughout this project! I would also like to thank Dr. Alexander Tolbert for his guidance and introducing me to this field.