Generate Customer Purchase Data With Python For Recommendations
Hey guys! Let's dive into how we can generate some realistic synthetic data for customer purchases using Python. This is super important because it's going to power our co-occurrence-based recommendation engine. We’ll be simulating invoice data that mirrors real-world scenarios, which will help us build a killer recommendation system. Think of it as creating a digital twin of customer transactions to fuel our machine learning models. So, buckle up, and let’s get started!
Why Generate Synthetic Data?
Before we jump into the nitty-gritty, let's quickly chat about why we’re doing this. Generating synthetic data is crucial for a couple of big reasons:
- Privacy: Real customer data is sensitive stuff. By generating synthetic data, we can develop and test our systems without exposing any actual customer information. It’s all about keeping things secure and compliant.
- Scalability: Sometimes, we just don’t have enough real data to train our models effectively. Synthetic data lets us scale up our dataset to get better results. The more data, the merrier, right?
- Flexibility: With synthetic data, we can control the types of scenarios and edge cases we want to test. This means we can fine-tune our recommendation engine to handle all sorts of situations.
The Invoice Data Structure
Okay, so what exactly are we going to generate? Each invoice entry will have these key fields:
invoice_id
: A unique identifier for each invoice. Think of it as the receipt number.date
: The date the invoice was created. This helps us track trends over time.customer_id
: A unique identifier for each customer. This is how we link purchases to specific users.customer_contact_info
: Details like email and phone number. Useful for future communications and analysis.billing_address
: Where the customer is located. This can help with regional analysis and targeted promotions.salesperson
: Who made the sale? This can help us evaluate sales performance.item_description
: A description of the item purchased (e.g., “Wooden Door”).part_number
: A unique identifier for the part. This is crucial for tracking specific items.quantity
: How many of the item were purchased.unit_price
: The price of a single item.total_amount
: The total cost for that item (quantity * unit_price).
Each invoice can contain multiple items, just like a real-life shopping cart. For example, a customer might buy a door, a knob, and some hinges all in one go. Plus, some parts have variants, like different styles of doors (Door 456, Door 567), adding another layer of complexity and realism to our data.
Setting Up the Python Environment
Alright, let's get our hands dirty with some code! First things first, we need to make sure we have the right tools installed. Fire up your terminal and let’s install the necessary Python libraries. We’ll be using Faker
to generate realistic-looking data and pandas
to handle our data efficiently.
pip install Faker pandas
Faker
is a fantastic library for generating all sorts of fake data, from names and addresses to dates and product descriptions. pandas
will help us organize our data into a nice, clean table.
Generating Basic Invoice Data
Now, let’s start with the basics. We’ll generate some simple invoice data and then build on that. Here’s a Python snippet to get us going:
import pandas as pd
from faker import Faker
import random
fake = Faker()
def generate_invoice_data(num_invoices=100):
data = []
for i in range(num_invoices):
invoice_id = fake.uuid4()
date = fake.date_between(start_date='-1y', end_date='today')
customer_id = fake.random_int(min=1000, max=9999)
customer_contact_info = fake.email()
billing_address = fake.address()
salesperson = fake.name()
num_items = random.randint(1, 5) # Each invoice can have 1-5 items
for _ in range(num_items):
item_description = fake.sentence(nb_words=4)
part_number = fake.ean13()
quantity = random.randint(1, 10)
unit_price = round(random.uniform(10, 200), 2)
total_amount = quantity * unit_price
data.append({
'invoice_id': invoice_id,
'date': date,
'customer_id': customer_id,
'customer_contact_info': customer_contact_info,
'billing_address': billing_address,
'salesperson': salesperson,
'item_description': item_description,
'part_number': part_number,
'quantity': quantity,
'unit_price': unit_price,
'total_amount': total_amount
})
return pd.DataFrame(data)
invoice_data = generate_invoice_data(num_invoices=1000)
print(invoice_data.head())
In this code:
- We import
pandas
,Faker
, andrandom
. - We initialize
Faker
to generate fake data. generate_invoice_data
function creates a specified number of invoices.- For each invoice, it generates basic details like ID, date, customer info, and salesperson.
- It then adds 1 to 5 items per invoice, generating item descriptions, part numbers, quantities, unit prices, and total amounts.
- Finally, it returns a
pandas
DataFrame containing all the generated data.
This gives us a good starting point, but we need to make the data a bit more realistic.
Adding Complexity: Part Variants and Co-Purchased Items
To make our data truly useful for building a recommendation engine, we need to simulate the relationships between items. This means adding part variants and ensuring that compatible items are often co-purchased. Let's tweak our code to include these features.
import pandas as pd
from faker import Faker
import random
fake = Faker()
# Define some item categories and parts
item_categories = {
'Door': ['Door 456', 'Door 567', 'Door 789'],
'Knob': ['Knob A', 'Knob B', 'Knob C'],
'Hinge': ['Hinge 1', 'Hinge 2'],
'Lock': ['Lock X', 'Lock Y']
}
# Define co-purchase probabilities
co_purchase_probs = {
'Door': {'Hinge': 0.8, 'Knob': 0.7, 'Lock': 0.6},
'Knob': {'Door': 0.7},
'Hinge': {'Door': 0.8},
'Lock': {'Door': 0.6}
}
def generate_invoice_data_enhanced(num_invoices=1000):
data = []
for i in range(num_invoices):
invoice_id = fake.uuid4()
date = fake.date_between(start_date='-1y', end_date='today')
customer_id = fake.random_int(min=1000, max=9999)
customer_contact_info = fake.email()
billing_address = fake.address()
salesperson = fake.name()
num_items = random.randint(1, 5)
purchased_items = [] # Keep track of purchased item categories
for _ in range(num_items):
category = random.choice(list(item_categories.keys()))
item_description = random.choice(item_categories[category])
part_number = fake.ean13()
quantity = random.randint(1, 3)
unit_price = round(random.uniform(50, 300), 2)
total_amount = quantity * unit_price
data.append({
'invoice_id': invoice_id,
'date': date,
'customer_id': customer_id,
'customer_contact_info': customer_contact_info,
'billing_address': billing_address,
'salesperson': salesperson,
'item_description': item_description,
'part_number': part_number,
'quantity': quantity,
'unit_price': unit_price,
'total_amount': total_amount
})
purchased_items.append(category)
# Add co-purchased items
for purchased_category in purchased_items:
if purchased_category in co_purchase_probs:
for co_purchase_category, probability in co_purchase_probs[purchased_category].items():
if random.random() < probability:
co_purchase_item = random.choice(item_categories[co_purchase_category])
co_purchase_part_number = fake.ean13()
co_purchase_quantity = random.randint(1, 2)
co_purchase_unit_price = round(random.uniform(20, 150), 2)
co_purchase_total_amount = co_purchase_quantity * co_purchase_unit_price
data.append({
'invoice_id': invoice_id,
'date': date,
'customer_id': customer_id,
'customer_contact_info': customer_contact_info,
'billing_address': billing_address,
'salesperson': salesperson,
'item_description': co_purchase_item,
'part_number': co_purchase_part_number,
'quantity': co_purchase_quantity,
'unit_price': co_purchase_unit_price,
'total_amount': co_purchase_total_amount
})
return pd.DataFrame(data)
invoice_data_enhanced = generate_invoice_data_enhanced(num_invoices=1000)
print(invoice_data_enhanced.head())
Here’s what we’ve added:
item_categories
: A dictionary defining item categories (Door, Knob, Hinge, Lock) and their variants.co_purchase_probs
: A dictionary defining the probabilities of items being co-purchased. For example, a Door has an 80% chance of being purchased with a Hinge.- The enhanced function now randomly selects item categories and their variants.
- It keeps track of purchased item categories and adds co-purchased items based on the defined probabilities. This makes our data more realistic and useful for recommendation engines.
Saving the Data to SQLite
Now that we’ve generated our realistic invoice data, let’s save it to an SQLite database. This will make it easy to query and use in our recommendation engine. First, we need to install the SQLAlchemy
library, which is a powerful tool for interacting with databases in Python.
pip install SQLAlchemy
Once that’s installed, we can use the following code to save our data:
from sqlalchemy import create_engine
# Create an SQLite engine
engine = create_engine('sqlite:///customer_purchases.db')
# Save the DataFrame to the database
invoice_data_enhanced.to_sql('invoices', engine, if_exists='replace', index=False)
This code does the following:
- Imports
create_engine
fromSQLAlchemy
. - Creates an SQLite engine that connects to a database file named
customer_purchases.db
. - Saves the
invoice_data_enhanced
DataFrame to a table namedinvoices
in the database. Theif_exists='replace'
argument tellspandas
to replace the table if it already exists, andindex=False
prevents the DataFrame index from being written to the database.
Using the Data for Offline Similarity Computation
With our data safely stored in an SQLite database, we can now use it to compute item-to-item similarities offline. This is a crucial step for building our co-occurrence-based recommendation engine. We’ll be using these similarities to suggest items that customers might like based on their past purchases.
Here’s a basic outline of how we can do this:
- Connect to the Database: First, we need to connect to our SQLite database using
SQLAlchemy
. - Query the Data: We’ll write SQL queries to retrieve the data we need, such as item purchases per invoice.
- Build a Co-occurrence Matrix: We’ll create a matrix that shows how often items are purchased together. Each cell in the matrix will represent the number of times two items have been purchased in the same invoice.
- Compute Similarity Scores: We’ll use a similarity metric (like cosine similarity) to calculate how similar items are based on their co-occurrence patterns.
- Store Similarity Scores: We’ll store these similarity scores, so we can quickly retrieve them when making recommendations.
Let's look at some code snippets to illustrate these steps.
1. Connect to the Database
We already did this in the previous section, but here’s a reminder:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///customer_purchases.db')
2. Query the Data
We’ll use pandas
to execute SQL queries and load the results into a DataFrame:
import pandas as pd
# Query item purchases per invoice
query = """
SELECT invoice_id, item_description
FROM invoices
"""
invoice_items = pd.read_sql_query(query, engine)
print(invoice_items.head())
3. Build a Co-occurrence Matrix
We’ll use pandas
and scikit-learn
to build the co-occurrence matrix:
from sklearn.preprocessing import MultiLabelBinarizer
# Group items by invoice
invoice_items_grouped = invoice_items.groupby('invoice_id')['item_description'].apply(list)
# Use MultiLabelBinarizer to create a co-occurrence matrix
mlb = MultiLabelBinarizer()
co_occurrence_matrix = mlb.fit_transform(invoice_items_grouped)
co_occurrence_df = pd.DataFrame(co_occurrence_matrix, columns=mlb.classes_)
print(co_occurrence_df.head())
4. Compute Similarity Scores
We’ll use cosine similarity to compute the similarity between items:
from sklearn.metrics.pairwise import cosine_similarity
# Compute cosine similarity
similarity_matrix = cosine_similarity(co_occurrence_df.T)
similarity_df = pd.DataFrame(similarity_matrix, index=co_occurrence_df.columns, columns=co_occurrence_df.columns)
print(similarity_df.head())
5. Store Similarity Scores
We can store the similarity scores in a dictionary or a DataFrame for easy access:
# Example: Store similarity scores in a dictionary
similarity_scores = {}
for item1 in similarity_df.index:
similarity_scores[item1] = similarity_df[item1].sort_values(ascending=False).drop(item1)
# Print top 5 similar items for 'Door 456'
print(similarity_scores['Door 456'].head(5))
This is a simplified example, but it gives you the basic idea of how to compute item-to-item similarities using our generated data.
Next Steps: Building the Recommendation Engine
So, what’s next? Now that we have our synthetic data and a way to compute item similarities, we can start building our recommendation engine. Here’s a quick overview of the steps involved:
- Implement a Recommendation Function: We’ll create a function that takes a customer’s purchase history as input and returns a list of recommended items based on the similarity scores we computed earlier.
- Evaluate the Engine: We’ll need to evaluate how well our recommendation engine is performing. We can use metrics like precision and recall to measure the accuracy of our recommendations.
- Integrate with the UI: Finally, we’ll integrate our recommendation engine with the user interface, so customers can see personalized recommendations while they shop.
This whole process is a journey, guys, but generating realistic synthetic data is a huge first step. It allows us to iterate and improve our recommendation engine without worrying about privacy issues or data scarcity.
In conclusion, we've successfully walked through generating synthetic customer purchase data using Python, focusing on realistic scenarios like part variants and co-purchased items. We've also stored this data in an SQLite database and explored how to compute item-to-item similarities, laying the groundwork for our co-occurrence-based recommendation engine. Keep experimenting and refining your approach, and you'll be well on your way to building a powerful recommendation system!