October 18, 2023

Leveraging Large Language Models to make businesses around the world more sustainable

How we managed to build NLP models to classify customer product descriptions to estimate their impact on the environment.

With rising global temperatures, the world is facing more and more natural disasters in the form of extreme drought and subsequent fires, as well as extreme rainfall and flooding. Amidst this climate crisis, it is the responsibility of every single organization to implement sustainable business practices. One company that aids organizations in becoming a more sustainable version of themselves is Metabolic. 

Metabolic supports companies in a variety of ways, one of which is systems mapping, where the entire business and its impacts on the environment are mapped. As part of systems mapping, large quantities of product descriptions (with different levels of detail, different languages, and different contexts) are processed to estimate their environmental impact. Central Product Classification (CPC) codes and Life Cycle Assessment (LCA) classes are used as standardized methods for environmental impact assessment. This process, however, involves a lot of manual work and is often difficult due to the variety in the level of detail, the languages, and the context of the product description.

This raises the question: how can we automatically translate our diverse product descriptions to CPC codes and LCA classes? One option Fruitpunch AI was eager to try out was leveraging the upcoming field of LLMs and zero-shot learning. Curious to see how? Read on! 

AI for Impact Assessment

To carry out this challenge, three teams were established with a total of 20 AI engineers. Each team set out to explore different parts of the solution. All teams tried out different applications of LLM’s to see how they could be most valuable and effective.

The following teams worked on this challenge:

  • Data enrichment: enrich the product texts, translate texts, and design features that are useful for the prediction models.
  • Supervised learning: apply supervised learning using descriptions labeled by Metabolic consutants, to the enriched dataset to predict CPC codes and LCA classes, potentially providing Metabolic with the top 5 prediction scores.
  • Zero-shot learning: apply zero-shot learning to the enriched dataset to predict CPC codes and LCA classes.

We worked with three different subsets of data, divided into different sheets, that presented information extracted from client invoices. Two of the subsets had columns with text representing the product that was listed in the invoice, and matching CPC codes or LCA names, and one of them was not matched beforehand. 

fig 1. Example of a dataset with CPC and LCA matches
fig 2. Example of CPC taxonomy

The Central Product Classification (CPC) uses a hierarchical and decimal-based coding system. It's structured into different levels: sections (identified by the first digit), divisions (identified by the first two digits), groups (identified by the first three digits), classes (identified by the first four digits), and subclasses (identified by all five digits together).

For instance, the codes for the sections range from 0 to 9. Each section can be divided into up to nine divisions. At the second digit of the code, each division can, in turn, be further divided into up to nine groups. This pattern can continue with groups being divided into classes and classes into subclasses.

In total, there are 10 sections, 71 divisions, 329 groups, 1,299 classes, and 2,887 subclasses in the CPC system.

Data Enrichment

Data Preprocessing

The first step in any data science project is data preprocessing. Data preprocessing is crucial to ensure the dataset's cleanliness and uniformity. This involved:

  • Cleaning: irrelevant data was removed, ensuring that the text was free from unnecessary clutter, which could have otherwise impacted the quality of our features.
  • Transformation: All text within the dataset was transformed to lowercase. This not only standardized the text but also simplified subsequent processing steps, ensuring consistency in the feature extraction process.
  • Expansion: To enhance the comprehensibility of the dataset, we expanded acronyms and abbreviations. This step aimed to make the product descriptions more informative and standardized, thus providing a more robust foundation for feature extraction.

Feature Extraction

With the dataset preprocessed, the team moved on to feature extraction, valuable dimensions were added to the dataset using ChatGPT-generated content (GPT-3.5 Turbo). The following features were extractedL 

  1. Translations: one of the features involved translating the product descriptions into multiple languages. This expansion served two purposes: it diversified the dataset linguistically, potentially revealing cross-language insights, and broadened the dataset's reach to a global audience.
  2. Keywords: keywords were extracted from the product descriptions. These keywords encapsulated the essence of each product, enabling advanced categorization and search capabilities. A maximum of 8 keywords were generated per product description.
  3. Descriptions: to provide richer context, descriptions were added to the dataset. These new descriptions aimed to offer comprehensive overviews, potentially uncovering subtle details missed in the original product descriptions.
  4. Combined Feature: finally, all these elements, including the original product descriptions, were combined into a single feature. This composite feature brought together the richness of translations, keywords, and descriptions. This comprehensive feature was then used to match CPC codes.

An example of the resulting dataset can be found in figure 3.

Fig. 3 an example of the enriched dataset

Enriching Target Labels

In addition to enhancing our dataset's features, we also enriched our target labels, specifically the Central Product Classification v2.1 taxonomy, with explanatory notes. This taxonomy comprises over 4000 labels, and for many of these labels, explanatory notes were added. These notes offered detailed context and clarifications about the target labels, ensuring that our machine-learning models had a better understanding of the classification system. We assumed that this enrichment not only improved model interpretability but also aided in more accurate predictions and analyses.

Supervised learning

Once the dataset was enriched, it could be fed into supervised learning methods to predict the CPC codes. For supervised learning, there are a variety of potential models, such as ClaudeAI, ChatGPT Code Interpreter, and Hierarchical models. However, due to the time constraint of this challenge, the team focussed their experiments mainly on Claude.AI

Claude.AI is a language model with one remarkable feature - a large context window. This unique characteristic allows Claude.AI to process and understand large files, making it an invaluable tool for businesses dealing with extensive textual data. To put its capabilities to the test, three intriguing experiments were conducted: 

  • Zero-shot Classification: Can Claude.AI classify client text into CPC (Cooperative Patent Classification) codes without any added context? In this experiment, client text from the test file is uploaded to Claude.AI without any prior information.
  • Train-test Classification: This experiment seeks to determine whether Claude.AI can classify unseen client text after being trained on different examples from a similar problem. The training file is uploaded first, followed by client text from the test file.
  • Standards Document -> Test Classification: Can Claude.AI classify client text when provided with the description of what the CPC standard is, without examples of how to match them to client text? This experiment involves uploading the CPC standards document followed by client text from the test file.

Claude.AI has an impressive 80% accuracy rate in matching client text to the correct CPC title when using the train/test classification. When using the zero-shot method, it may provide reasonable information but is prone to making errors. The good news is that it can handle large batches of data ranging from 100 to 200 records efficiently. Interestingly, the response size doesn't have a significant impact on errors and hallucinations; it's the method that plays a more crucial role.

Zero-shot learning

Besides the more traditional supervised method, our teams also experiment with zero-shot learning. In zero-shot learning, the AI is fed data that belongs to classes that were not observed during training and is asked to predict these classes. The zero-shot team took a model-centric focus. Rather than modifying the data, they focused on finding the best ways to use existing models to process untreated data. This approach typically makes sense at the beginning of a project because it requires the least changes to the data and model. In a typical project lifecycle, this would be followed by data enrichment and fine-tuning. In our experiments, we looked at two approaches: prompt engineering GPT 3.5 and text retrieval using embeddings.


We used GPT-3.5-turbo, a conversational LLM, prompts to describe the task, and desired output and provide an example. The biggest limitation of this approach was the number of classes with lower CPC levels. In our experiments, the level 3 classes already exceeded the context window limit. As context windows in future models increase, this may be less of an issue.

fig 4. Prompting approach using ChatGPT 3.5 turbo
Figure 5 shows the results of the zero-shot classification efforts. The results are split out on CPC level as well as whether a first, second or partial match was made. The findings demonstrate that prompting may face challenges when dealing with a high number of classes and limited context windows. The potential for built-in data enrichment is promising, and further refinement of prompt engineering could lead to more robust results.
fig 5. Results of the Zero-shot experiments


Some variations of turning the CPC classes into vector embeddings and comparing their cosine similarity with the client texts was also attempted. The results of this section could still be useful as a baseline and alternative methodology. The experiments revealed that matching text to all classes outperforms matching only to level 1 classes. The observed discrepancies between client text and class embeddings underscore the need for more precise alignment strategies, such as splitting embeddings by level. Data augmentation was shown to enhance correct embedding matching, particularly in fine-grained classifications

Lastly, while averaging the more frequently occurring classes in the top matches showed promise in selecting the closest match from a pool of candidates, it was unable to rectify higher-level misclassifications.

Potential impact

The extensive experimentation will help the engineers from Metabolic in the development of their new Impact Assessment platform: Link. Automating impact assessment will provide quick and meaningful insights to companies that really want to implement change and contribute to a sustainable future. We are very happy that we could contribute to this cause.

Special thanks…

We could not have done this Challenge without the amazing efforts from Alma Liezenga, Bram Cals, Rizdi Aprilian, Vivek V, Sabelo Makhanya, Jathin SN, Saloni Sharma, Dave Parr, Genrry Hernández, Mariano Lazarte, Obiageli Umeugochukwu, Yuri Shlyakhter, Graeme Harris, Muhammad Yahiya, Teodora Bujaroska, Dennis Beemsterboer, Freek Boelders, Justin Zarb

Deep Learning
Challenge results
Subscribe to our newsletter

Be the first to know when a new AI for Good challenge is launched. Keep up do date with the latest AI for Good news.

* indicates required
Thank you!

We’ve just sent you a confirmation email.

We know, this can be annoying, but we want to make sure we don’t spam anyone. Please, check out your inbox and confirm the link in the email.

Once confirmed, you’ll be ready to go!

Oops! Something went wrong while submitting the form.