Case Study: Using AI as a research assistant to develop Milton's Climate Action Plan

As a member of Milton’s Climate Action Planning Committee, I’m working with a team of committed, passionate volunteers and town staff to develop the roadmap for Milton to rapidly reduce its town-wide fossil fuel pollution and prevent the most damaging impacts of climate change.

Massachusetts’s climate goal to achieve net-zero emissions by 2050 is at once breathtakingly ambitious and barely adequate to prevent dire consequences for future generations. Writing a climate action plan is challenging because it necessarily involves bold, unprecedented action across every aspect of the local economy, as almost everything we do uses energy from fossil fuels in one way or another. Since their goals are unprecedented, no one has ever successfully implemented one and there is no well-blazed path to follow. But, there are also plenty of solutions and strategies for us to consider. At least 20 other communities in Massachusetts that we are aware of have adopted Climate Action Plans, with many more currently developing one.

Even for volunteers who are passionate about this topic, reading and digesting dozens of other towns' plans is not an easy task. Generally, we want to focus on one topic area at a time (e.g. reducing residential building emissions). We have questions like, “What are towns near us doing to increase the supply of renewable electricity for their residents?”, but the way the plans are published makes it time-consuming to extract answers to such questions. They are typically available as slickly formatted PDFs to be downloaded from the town website. I was desparate for a single spreadsheet with all the towns' actions, strategies, and goals that I could filter and search quickly, but I quickly ran out of patience trying to put one together by hand.

After the release of ChatGPT last year and the rapid development of new programming frameworks to integrate them into software, I saw an opportunity to automate my way out of this difficulty. These models can extract structured data from prose documents like climate action plans, making it easier to review and evaluate them systematically. As a Data Scientist, I was also looking for an opportunity to get my hands dirty with the latest developments in AI, and this seemed like a good use case for learning.

Long story short, this was indeed a good task for AI, and I was able to extract a dataset of 617 goals and 2437 action items) from 21 climate action plans published by municipalities in the Northeast. Feel free to use it! Compared to spending tiring hours to do this for one document by hand, the pipeline took at most a few minutes per document and had a total cost of under $20. I had to overcome a few common technical challenges which are discussed in more detail below. The full project code is available on GitHub under an MIT license.

Developing the Solution

My goal was to build a data pipeline that would process a collection of PDF documents into a table of goals and actions.

As I do for all my projects, I used the reproducible data science project structure I’ve written about elsewhere. I used LangChain to integrate LLMs into my pipeline. It is a general-purpose programming framework that integrates with a variety of LLM model providers (e.g. OpenAI, HuggingFace) and provides a range of enhanced functionality to integrate LLM results into an application.

The Prompt

Much of the introductory content about LLMs focuses on chatbots or Q&A workloads, where the LLM is expected to produce a brief, relevant answer to a statement from the user. This information extraction use case is a little different, in that I wanted the LLM to exhaustively review lengthy source material and provide a comprehensive list of all the relevant details it contains (in this case, goals and actions).

I expressed my use case to the LLM with the following prompt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
You are an expert municipal climate action planner.

Read the provided Climate Action Plan document and extract the following information
described in the text:

1. Goals;

2. Action items to achieve the goals;

If the text does not explicitly describe a goal or action, do not make up an answer. Only
extract relevant information from the text. However, be exhaustive, and extract all goals
and actions that are explicitly mentioned in the text.

There will often be multiple goals and actions mentioned on each page.

If the value of an attribute you are asked to extract is not present in the text, return
null for the attribute's value.

The instructions to “extract all goals and actions that are explicitly mentioned in the text; there will often be multiple goals and actions mentioned on each page” guided the LLM to produce an exhaustive list of goals and actions in its output. I further reinforced this through the output data structure definitions described below.

Structured Information Extraction

Langchain’s information extraction capability seemed perfectly suited to my task and was my key reason for selecting it. It transforms user-defined pydantic data structures into LLM prompts that guide the model to return results in structured formats like JSON or YAML that can be analyzed systematically. This LangChain feature allows you to define detailed attributes of any entity information you want to extract and also provide semantic guidance to the LLM on what to look for. For example, here is how I directed the LLM to identify a climate action: