Automate Data Extraction Like a Pro: Step-by-Step Guide to Building an AI Pipeline

Kshitij Kutumbe
4 min readAug 29, 2024

--

Photo by ZHENYU LUO on Unsplash

Introduction

In this blog, we’ll guide you through creating a data extraction pipeline that leverages LangChain, SERPAPI, and OpenAI. This pipeline will automatically search for data on the web, process the extracted data using a custom language model, and return the relevant data in structured format. In this case, we’ll take case of revenue extraction, which is crucial for a lot of business usecases like competitor analysis, sales insights, lead generation etc.

Prerequisites

Before diving into the code, make sure you have the following prerequisites:

  • Python 3.8 or later
  • API keys for OpenAI and SERPAPI
  • Installed Python packages: serpapi, langchain, langchain-community, langchain-openai, kor, and dotenv

You can install the necessary packages using pip:

pip install serpapi langchain langchain-community langchain-openai kor python-dotenv

Step 1: Setting Up Environment Variables

To keep your API keys secure, we’ll use the dotenv package to load environment variables from a .env file.

.env File

Create a .env file in the root directory of your project and add the following:

OPENAI_API_KEY=your_openai_api_key
SERPAPI_API_KEY=your_serpapi_api_key

Loading Environment Variables in Python

In your Python script, load the environment variables using the following code:

from dotenv import load_dotenv
import os

load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["SERPAPI_API_KEY"] = os.getenv("SERPAPI_API_KEY")

Step 2: Initializing the Language Model (LLM)

We’ll be using OpenAI’s gpt-3.5-turbo model through the langchain_openai library. The model is configured to process text and extract relevant revenue information.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=2000,
model_kwargs={"frequency_penalty": 0, "presence_penalty": 0, "top_p": 1.0}
)

Step 3: Defining the Extraction Schema

Using the Kor library, we define a schema that the language model will use to extract specific information (in this case, revenue data) from the text.

from kor import create_extraction_chain, Object, Text

_schema = Object(
id="revenue_info",
description="Extract Revenue amount of the company from text",
attributes=[
Text(
id="extracted_revenue",
description="Extract revenue amount of the company from the text, remember only extract revenue amount, do not extract funding amount",
),
Text(
id="source_domain",
description="Extract domain of the url",
)
],
many=False,
)

Step 4: Creating the Revenue Extraction Class

We’ll create a Revenue class that uses the defined schema and the language model to process text and extract the revenue information.

class Revenue:

def __init__(self, llm):
self.llm = llm

def analyze(self, text):
out_dict = {
'extracted_revenue': '',
'source_domain': '',
'flag_revenue': 1
}

if not text:
return out_dict

chain = create_extraction_chain(self.llm, _schema, encoder_or_encoder_class='json')
out = chain.invoke(input=text)

try:
result = out['text']['data']['revenue_info']
out_dict['extracted_revenue'] = result.get('extracted_revenue', '')
out_dict['source_domain'] = result.get('source_domain', '')
out_dict['flag_revenue'] = 0
out_dict = self._flag_and_wrap(out_dict)
except:
pass

return out_dict

def _flag_and_wrap(self, out_dict):
for key in out_dict:
if out_dict[key] in ["NA", "Na", "na"]:
out_dict[key] = ""
if not out_dict.get('extracted_revenue'):
out_dict['flag_revenue'] = 1
return out_dict

Explanation:

  • analyze(): This method takes the scraped text as input, processes it through the language model, and returns the extracted revenue information.
  • _flag_and_wrap(): This helper method ensures the extracted data is clean and flags any missing information.

Step 5: Scraping Company Revenue Data with SERPAPI

Now, let’s use SERPAPI to gather information about a company’s revenue from Google search results.

Setting Up SERPAPI

First, install the serpapi Python client:

pip install serpapi

Then, use the following code to perform a Google search:

from serpapi import GoogleSearch

def search_company_revenue(company_name):
params = {
"engine": "google",
"q": f"{company_name} company revenue",
"api_key": os.getenv("SERPAPI_API_KEY"),
"num": 5
}

search = GoogleSearch(params)
results = search.get_dict()
return results["organic_results"]

# Example usage
comp_name = input('Enter company name')
search_results = search_company_revenue(comp_name)
description = search_results[0]['snippet']
displayedUrl = search_results[0]['link']

Explanation:

  • search_company_revenue(): This function sends a query to Google using SERPAPI and returns the top 5 organic search results.
  • description: We extract the first result’s snippet (a brief description) and URL to pass it to our revenue extraction pipeline.

Step 6: Running the Revenue Extraction

Finally, we’ll pass the scraped data to our Revenue class and extract the revenue information.

revenue_info = Revenue(llm).analyze(f'url:{displayedUrl}  {description}')
print(revenue_info)

Explanation:

  • revenue_info: This variable stores the extracted revenue information as a dictionary. It includes the revenue amount, source domain, and a flag indicating whether the extraction was successful.

In this blog, we explored how to build an automated revenue extraction pipeline using LangChain, SERPAPI, and OpenAI. By using these tools, we can automate the process of gathering and extracting revenue-related information from the web, making it a valuable asset for business intelligence and data analysis.

Future Work

  • Improved Data Handling: Implement better error handling and data validation to improve the robustness of the pipeline.
  • Expand Extraction Capabilities: Customize the schema and extraction logic to extract other financial or business-related information.
  • Scalability: Optimize the pipeline for large-scale extraction tasks and multiple simultaneous queries.

By following this tutorial, you should now be equipped to build your own data extraction pipelines, tailored to your specific needs. Happy coding!

For discussion on similar work:

Email

kshitijkutumbe@gmail.com

Github

--

--

Kshitij Kutumbe
Kshitij Kutumbe

Written by Kshitij Kutumbe

Data Scientist | NLP | GenAI | RAG | AI agents | Knowledge Graph | Neo4j kshitijkutumbe@gmail.com www.linkedin.com/in/kshitijkutumbe/

Responses (1)