Automate Data Extraction Like a Pro: Step-by-Step Guide to Building an AI Pipeline
Introduction
In this blog, we’ll guide you through creating a data extraction pipeline that leverages LangChain, SERPAPI, and OpenAI. This pipeline will automatically search for data on the web, process the extracted data using a custom language model, and return the relevant data in structured format. In this case, we’ll take case of revenue extraction, which is crucial for a lot of business usecases like competitor analysis, sales insights, lead generation etc.
Prerequisites
Before diving into the code, make sure you have the following prerequisites:
- Python 3.8 or later
- API keys for OpenAI and SERPAPI
- Installed Python packages:
serpapi
,langchain
,langchain-community
,langchain-openai
,kor
, anddotenv
You can install the necessary packages using pip:
pip install serpapi langchain langchain-community langchain-openai kor python-dotenv
Step 1: Setting Up Environment Variables
To keep your API keys secure, we’ll use the dotenv
package to load environment variables from a .env
file.
.env File
Create a .env
file in the root directory of your project and add the following:
OPENAI_API_KEY=your_openai_api_key
SERPAPI_API_KEY=your_serpapi_api_key
Loading Environment Variables in Python
In your Python script, load the environment variables using the following code:
from dotenv import load_dotenv
import os
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["SERPAPI_API_KEY"] = os.getenv("SERPAPI_API_KEY")
Step 2: Initializing the Language Model (LLM)
We’ll be using OpenAI’s gpt-3.5-turbo
model through the langchain_openai
library. The model is configured to process text and extract relevant revenue information.
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=2000,
model_kwargs={"frequency_penalty": 0, "presence_penalty": 0, "top_p": 1.0}
)
Step 3: Defining the Extraction Schema
Using the Kor library, we define a schema that the language model will use to extract specific information (in this case, revenue data) from the text.
from kor import create_extraction_chain, Object, Text
_schema = Object(
id="revenue_info",
description="Extract Revenue amount of the company from text",
attributes=[
Text(
id="extracted_revenue",
description="Extract revenue amount of the company from the text, remember only extract revenue amount, do not extract funding amount",
),
Text(
id="source_domain",
description="Extract domain of the url",
)
],
many=False,
)
Step 4: Creating the Revenue Extraction Class
We’ll create a Revenue
class that uses the defined schema and the language model to process text and extract the revenue information.
class Revenue:
def __init__(self, llm):
self.llm = llm
def analyze(self, text):
out_dict = {
'extracted_revenue': '',
'source_domain': '',
'flag_revenue': 1
}
if not text:
return out_dict
chain = create_extraction_chain(self.llm, _schema, encoder_or_encoder_class='json')
out = chain.invoke(input=text)
try:
result = out['text']['data']['revenue_info']
out_dict['extracted_revenue'] = result.get('extracted_revenue', '')
out_dict['source_domain'] = result.get('source_domain', '')
out_dict['flag_revenue'] = 0
out_dict = self._flag_and_wrap(out_dict)
except:
pass
return out_dict
def _flag_and_wrap(self, out_dict):
for key in out_dict:
if out_dict[key] in ["NA", "Na", "na"]:
out_dict[key] = ""
if not out_dict.get('extracted_revenue'):
out_dict['flag_revenue'] = 1
return out_dict
Explanation:
- analyze(): This method takes the scraped text as input, processes it through the language model, and returns the extracted revenue information.
- _flag_and_wrap(): This helper method ensures the extracted data is clean and flags any missing information.
Step 5: Scraping Company Revenue Data with SERPAPI
Now, let’s use SERPAPI to gather information about a company’s revenue from Google search results.
Setting Up SERPAPI
First, install the serpapi
Python client:
pip install serpapi
Then, use the following code to perform a Google search:
from serpapi import GoogleSearch
def search_company_revenue(company_name):
params = {
"engine": "google",
"q": f"{company_name} company revenue",
"api_key": os.getenv("SERPAPI_API_KEY"),
"num": 5
}
search = GoogleSearch(params)
results = search.get_dict()
return results["organic_results"]
# Example usage
comp_name = input('Enter company name')
search_results = search_company_revenue(comp_name)
description = search_results[0]['snippet']
displayedUrl = search_results[0]['link']
Explanation:
- search_company_revenue(): This function sends a query to Google using SERPAPI and returns the top 5 organic search results.
- description: We extract the first result’s snippet (a brief description) and URL to pass it to our revenue extraction pipeline.
Step 6: Running the Revenue Extraction
Finally, we’ll pass the scraped data to our Revenue
class and extract the revenue information.
revenue_info = Revenue(llm).analyze(f'url:{displayedUrl} {description}')
print(revenue_info)
Explanation:
- revenue_info: This variable stores the extracted revenue information as a dictionary. It includes the revenue amount, source domain, and a flag indicating whether the extraction was successful.
In this blog, we explored how to build an automated revenue extraction pipeline using LangChain, SERPAPI, and OpenAI. By using these tools, we can automate the process of gathering and extracting revenue-related information from the web, making it a valuable asset for business intelligence and data analysis.
Future Work
- Improved Data Handling: Implement better error handling and data validation to improve the robustness of the pipeline.
- Expand Extraction Capabilities: Customize the schema and extraction logic to extract other financial or business-related information.
- Scalability: Optimize the pipeline for large-scale extraction tasks and multiple simultaneous queries.
By following this tutorial, you should now be equipped to build your own data extraction pipelines, tailored to your specific needs. Happy coding!
For discussion on similar work:
kshitijkutumbe@gmail.com