Hey there! If you’ve been following my blog, you know how fascinated I am with Generative AI and its growing applications. Recently, I stumbled upon a tricky problem: parsing unstructured data automatically. Despite my best efforts to get cleaner data from the source, I had to find another solution. That’s when Generative AI came to mind.
In this post, I’ll take you through my journey of creating a proof of concept to tackle this challenge. I’ll run though my choice of Semantic Kernel, I’ve dabbled with various Generative AI frameworks in my spare time. Although the OpenAI SDK impressed me, I wanted something more robust for production. I’ll show you the steps I took and my thought process behind them. And I’ll touch on some advanced features like using multiple models and creating and calling functions.
Here are the goals I set for my proof of concept:
- Extract structured data from tough-to-parse text.
- Make the process fast, accurate, and cost-effective.
- Bonus: Use real-time currency conversion functions to streamline imports.
Curious to see how it all worked out? Stick around, and I’ll walk you through it. All the code samples are available in this GitHub repository.
The Problem: Parsing a Difficult File Format
We’ve all been there—assigned to integrate with a third-party system, only to discover that the data we need is in a format that’s a nightmare to parse. Let me show you an example similar to what I was dealing with:
|
|
Well, 💩, I had no idea how to handle this. My first instinct was to contact a developer on the other end and ask if we could make the data more manageable.
So, I reached out to the third party, hoping they could provide the data in a structured format. Unfortunately, they’re a small company, manually creating these files from various internal systems. They had no plans to change the format or help me parse it.
Here I was, stuck with an almost unparseable file that I needed to extract structured data from. While a human could easily understand this file, I needed to automate the process. I had to extract the make, model, mileage, manufacture date, and price of each car, along with the seller’s contact information. Doing this manually wasn’t an option since the file was growing in size and I needed to do this regularly. Automation was the only way forward.
So, I turned to Generative AI to see if it could help me out. I had some experience with the OpenAI SDK, but I needed something more robust for production. That’s where Microsoft Semantic Kernel came in.
Why Choose Microsoft Semantic Kernel
As developers, we often search for tools that simplify our work without adding complexity. While the OpenAI SDK is quite effective, there are times when a more robust solution tailored for production environments is necessary. This is where the Microsoft Semantic Kernel comes into play.
What I really like about the Microsoft Semantic Kernel is how it improves on the Azure OpenAI client library by simplifying interactions. It streamlines API communication, reducing repetitive coding and helping maintain a clean codebase.
The flexibility of the tool is also a major plus. It’s straightforward to adapt it with your own models, or to integrate multiple models, as I’ve done here. You can also easily incorporate your data sources and workflows, making it ideal for customized project needs. It includes support for advanced features like agents, functions—which you can see in action here, caching, and memory management, all of which can significantly ease development challenges.
Support is another strong point. There’s a wealth of documentation and tutorials available. Being part of the Microsoft ecosystem also means access to a vast community of developers and a plethora of tools and libraries, which is incredibly beneficial.
With its robust features and solid support, Microsoft Semantic Kernel provides a reliable foundation for building and deploying AI solutions in production environments. After recognizing its capabilities, I decided to use it for my project on parsing unstructured data, and it has proven to be a dependable choice.
Implementing the Solution
With some familiarity with GPT LLM models, there are a few considerations to keep in mind when implementing this solution. Firstly, the file may be large, so avoiding the token limit is crucial. Secondly, choosing the right model is important. As of now, the best and largest model is GPT-4-Turbo with a 128,000 token context at €0.010 per 1,000 tokens, potentially costing up to €1.28 per request. That’s quite expensive. However, for this task, the full power of GPT-4-Turbo may not be necessary. I’ll start with a more cost-effective model, GPT-3.5-Turbo-0125, at €0.0005 per 1,000 tokens, but I need to be mindful of its 16,000 token limit.
Firstly, to access the GPT models through our cloud provider, Azure, you’ll need to request access for your Azure subscription. You can apply for access on Azure’s OpenAI service page, usually approved within 24 hours.
Once you have access, create a new Azure OpenAI resource in the Azure portal. After setup, you’ll receive an API key and an endpoint URL. Don’t forget to deploy the model and note down the deployment name.
Now, let’s start by creating a new project and adding the Microsoft.SemanticKernel package:
|
|
Next, set up the kernel in your Program.cs and configure it to use the Azure OpenAI chat completion service:
|
|
Replace placeholders with your actual deployment name, endpoint, and API key.
Please be aware that replacing these values with your actual credentials works for examples like this, but it’s not recommended for production. Instead, consider using environment variables or a configuration file to store sensitive information securely. If you are unsure about how to handle sensitive information in your application, consult the Microsoft documentation for best practices. Here is a good example.
As a start, let’s try splitting the file into individual car listings. We’ll read the file contents into a string for simplicity (note, this is not recommended for production):
|
|
In this setup, importFile.txt
should be included in the project with the CopyToOutputDirectory
property set to Always
, ensuring it’s always available at runtime.
Upon running, the output will display each car listing structured uniformly:
|
|
Great! The file was successfully split into individual car listings and formatted uniformly. However, to achieve consistent automated parsing, we need to revise the prompt. This adjustment not only caters to the non-deterministic nature of GPT models but also provides more grounding, helping to stabilize the outputs.
|
|
As you can see, we’ve structured the prompt a bit more by moving the file contents from string interpolation to an argument, enhancing clarity and order in the code. Additionally, by refining the prompt we ensure the output can be easily parsed automatically using System.Text.Json.JsonSerializer
. Using JSON as the output format from Large Language Models (LLMs) like Microsoft’s Semantic Kernel is highly effective because it provides a structured, predictable format that facilitates clear task definitions, which is crucial for obtaining accurate and useful responses from the model. However, I encourage you to experiment and find what works best for your specific needs.
The output now, visually segmented for clarity:
|
|
Nice, we now have a list of car listings, each presented as a separate string. You might wonder why I introduced this intermediate step. The reason is straightforward: parsing individual car listings is expected to be more complex, and limiting the context for an LLM like this reduces the risk of the model deviating and producing inaccurate content, known as “hallucinating.” Additionally, breaking down a large task into smaller segments allows for parallel processing of each listing, which speeds up the overall process. Another significant advantage is that this method nearly eliminates the risk of exceeding the model’s token limit.
Now, let’s dive into parsing an individual car listing. The process mirrors the initial file splitting, but we’ll refine our prompt to ensure we capture all necessary details:
|
|
This structured approach ensures that each listing is meticulously parsed, maintaining the integrity of the data while streamlining the extraction process.
As we want to call this prompt for each individual car listing, I created a wrapper function that takes the contents of a single car listing and returns a Listing
object. The Listing
object is a simple record that represents the structured data we want to extract from the car listing. The Listing
record is defined as follows:
|
|
This ensures a clean and organized structure, making each listing easy to handle and further process or display.
I also encapsulated the entire execution of the in a Execute
method, which orchestrates the parsing and processing of car listings from unstructured text data:
|
|
In this code, we start by calling the ExtractListings
function to split the text file into individual car listings. We then enhance performance by initiating parallel processing of each listing to convert them into structured data using the AsParallel
method. This approach utilizes multiple threads, speeding up the process significantly. Once all parallel tasks are completed, synchronized with Task.WhenAll
, the structured data is then printed to the console.
When you run this code, it produces a neatly formatted output, displaying detailed information about each car:
|
|
The final step in our data processing pipeline involves ensuring we do not hit the token limit for the ExtractListings
method, having already managed this risk for the ExtractListing
method by limiting each invocation to a single listing. To achieve this, we split the initial large file into manageable chunks before sending each to the model. This is a simple yet crucial process and requires a bit of code to implement correctly. Let me show you how I tackled this:
|
|
The GetChunks
function breaks down the file contents into chunks based on a specified size. The chunkSize
parameter sets the number of lines in each chunk, and overlapLines
adds a buffer by overlapping lines between chunks. This overlapping ensures that no car listing is inadvertently split across two chunks, thus maintaining the integrity of the data despite the risk of potential duplicate entries.
The updated implementation of the Execute
function is outlined below:
|
|
In this approach, we first split the file into chunks using the GetChunks
function. These chunks are then processed in parallel with ExtractListings
to extract individual car listings. Finally, Task.WhenAll
synchronizes the parallel tasks, ensuring that all data is processed before printing the structured listings to the console. This method not only minimizes the risk of exceeding token limits but also maintains a high processing speed.
By executing this code, you can achieve the same detailed output as before while effectively managing larger data volumes. We have successfully transformed unstructured data into structured data, ready for further processing or analysis using more traditional methods. So now we have a system that can handle large volumes of unstructured data, and parse it into structured data. This system is not only efficient but also cost-effective, as it uses the right model for the right task, ensuring optimal performance without unnecessary expense. There are a few more things we can do to enhance this system further. If you are curious about the additional capabilities of Microsoft’s Semantic Kernel, let’s explore some advanced features next.
Creating and calling functions
As previously discussed, to make the structured data from car listings more usable, we need further processing. For instance, standardizing the price to a common currency, parsing dates into DateTime
objects, or converting the odometer readings from various units to a uniform integer in kilometers, are some enhancements that could greatly improve usability. To facilitate these transformations, we can create functions dedicated to handling each specific task.
However, integrating these functions directly into the flow controlled by the LLM poses a challenge. Currently, there is no direct method to have the model invoke external functions as part of its processing pipeline using the basic model invocation methods we’ve used so far. To address this, we need to utilize the IChatCompletionService
provided with the Microsoft Semantic Kernel. This service is typically designed to simplify the creation of chat applications but can be adapted for our purposes to orchestrate complex workflows involving external function calls.
Here’s how we can set up the IChatCompletionService
:
|
|
In this setup, we begin by constructing the kernel builder, just as we did previously, and add the Azure OpenAI chat completion service. However, this time we also integrate the IChatCompletionService
provided by the Semantic Kernel. This service is a higher-level abstraction designed specifically for creating chatbot-like interactions, similar to those managed by ChatGPT
. Unlike traditional methods that handle single prompts, IChatCompletionService
manages a continuous conversation history, allowing the model to maintain context over the course of an interaction.
Here’s how you can implement the ExtractListing using this approach:
|
|
In this setup, we use the ChatHistory object to simulate a conversation with the model. The conversation starts with a system-generated message that outlines the task for the model, explaining exactly how the car listing should be processed. This is followed by the user (our application) providing the actual car listing text.
- AuthorRole.System: This role is used to provide instructions or context to the model, guiding its response pattern.
- AuthorRole.User: This role represents the input from the user, which in this case is the car listing that needs processing.
The GetChatMessageContentAsync
method is then called with the structured conversation history. This method sends the entire conversation to the model and retrieves the structured JSON output, which we then deserialize into a Listing object.
With the transition to using IChatCompletionServic
e and further structuring our code, we’ve introduced a more sophisticated system that allows for expandable functionality through what the Semantic Kernel refers to as Plugins. Plugins enable us to extend the capabilities of our application seamlessly by integrating custom functions directly into the processing workflow of the Semantic Kernel.
To illustrate how plugins can be utilized, let’s consider a practical example of a currency conversion plugin. This plugin will allow our system to convert various currency amounts into US dollars (USD) based on a simplified random conversion rate. Below is the implementation of such a plugin:
|
|
In this example, we’re using a random multiplier to simulate the conversion process. This is purely for demonstration; in a practical setting, you’d replace this with a call to a genuine currency conversion service that provides real-time exchange rates. The KernelFunction
attribute marks ConvertToDollar as a plugin function, signaling to the Semantic Kernel that it can be called as part of its operational workflow. The Description
attributes ensure that the purpose of the function and its parameters are clear for the LLM.
To integrate our newly created CurrencyPlugin
into the Semantic Kernel’s workflow, we need to register the plugin with the kernel. This is done by adding the plugin to the kernel builder. Here’s how you can do it:
|
|
With the CurrencyPlugin
now part of our kernel configuration, it’s ready to be invoked as needed. The next step involves adjusting how we call the IChatCompletionService
to leverage this new capability.
|
|
By modifying the OpenAIPromptExecutionSettings
, specifically the ToolCallBehavior
property to AutoInvokeKernelFunctions
, we instruct the system to automatically call our plugin functions when certain conditions within the chat content are met.
With these changes implemented, when you run the code, the interaction will not only parse the car listings but also dynamically convert the price of each listing into USD using the CurrencyPlugin. This random conversion demonstrates the plugin’s functionality, although in a live environment, you would likely use a more deterministic method tied to actual currency exchange rates. This enhancement makes the output not only more uniform but also more adaptable to varying international inputs, illustrating a significant leap in the system’s capability to handle diverse data types and requirements.
Using the CurrencyPlugin
, we can now observe the transformation in how our data processing. Given the original car listing:
|
|
the listing is transformed into the following JSON:
|
|
Here, we see a clear transformation: the car’s price is now shown in USD thanks to the CurrencyPlugin. This example demonstrates how plugins can neatly organize data and convert values. Yet, implementing these plugins can sometimes bring up challenges.
Even with the advanced technology behind the Semantic Kernel and its plugins, there are times when the model might not react to the prompts as we expect. This issue might require us to make some adjustments to the prompts to help the model perform better. Another option is to consider using more advanced models like GPT-4
, known for better understanding and responding to detailed instructions. However, switching to GPT-4
could be more expensive, so it’s important to think about whether the additional cost is justified by the need for more precise outcomes.
The use of plugins, as shown in the currency conversion, significantly boosts the functionality of our models. This currency plugin is just one example. You can create plugins for various tasks, such as parsing dates, converting measurement units, or even calling external services, which greatly broadens what your applications can do. The ability to add these plugins opens up many possibilities, allowing for the creation of more powerful and adaptive applications. Whether you decide to tweak the prompts for better accuracy or upgrade to a more robust model, these tools offer valuable ways to improve how your system works and how accurately it performs.
Using multiple models
If, like me you run into scenarios where you want to balance the use of the expensive GPT-4
to certain functions, you’ll need to configure the kernel to include the models you want to use.
First, set up the kernel to handle multiple models:
|
|
In this setup, we add multiple AddAzureOpenAIChatCompletion
calls with different modelId
and serviceId
. This way, the kernel registers different models that you can use for various parts of your application.
To use a specific model with IChatCompletionService
, pass the serviceId when retrieving the service:
|
|
When you want to call InvokePromptAsync
, you can specify which model to use by adjusting the arguments
with PromptExecutionSettings
:
|
|
By setting up your kernel like this, you can easily switch between models as needed, optimizing both performance and cost for different parts of your process. This flexibility allows you to tailor the application’s AI capabilities to specific tasks, ensuring you get the best results without unnecessary expense.
Conclusion
I find myself somewhat torn over using something as powerful and unpredictable as a Generative AI model to parse data. Honestly, I can’t think of another method that achieves this level of complexity and capability. If you know of any alternatives, I’d love to hear about them! I have been testing this approach with a variety of unstructured data, and the results have been promising. It just takes this seemingly impossible task and makes it possible. With the parallel processing and plugins, the system was able to handle large volumes of data and perform complex transformations with ease. However, I’ve encountered some challenges, particularly with hallucinations and inaccuracies when calling functions or plugins. Sometimes it works flawlessly; other times, it doesn’t respond as expected, even with identical inputs. This inconsistency leads me to conclude that while this tool is indeed powerful, it isn’t quite ready for autonomous use in production settings without oversight.
To effectively use this system in production, verifying the accuracy of the data is crucial, hence the need for a human-in-the-loop system. In my case, a job queue where users can review and compare the source to the generated output has been sufficient. However, as the volume of parsing grows and the demand for human oversight becomes unsustainable, we might need to consider mitigation strategies like employing LLMs in different roles to validate and verify each other’s outputs. I plan to go into this idea in a future post. If you’re encountering similar challenges, I’d love to connect and brainstorm solutions together.
Despite these challenges, I’m quite satisfied with my decision to use Microsoft Semantic Kernel. As this article has shown, it offers a robust and flexible framework that simplifies the development of AI solutions, providing features that enhance its readiness for production environments and setting it apart from the OpenAI SDK.