Generate Customer Purchase Data With Python For Recommendations

Sep 15, 2025 by Square 64 views

Hey guys! Let's dive into how we can generate some realistic synthetic data for customer purchases using Python. This is super important because it's going to power our co-occurrence-based recommendation engine. We’ll be simulating invoice data that mirrors real-world scenarios, which will help us build a killer recommendation system. Think of it as creating a digital twin of customer transactions to fuel our machine learning models. So, buckle up, and let’s get started!

Why Generate Synthetic Data?

Before we jump into the nitty-gritty, let's quickly chat about why we’re doing this. Generating synthetic data is crucial for a couple of big reasons:

Privacy: Real customer data is sensitive stuff. By generating synthetic data, we can develop and test our systems without exposing any actual customer information. It’s all about keeping things secure and compliant.
Scalability: Sometimes, we just don’t have enough real data to train our models effectively. Synthetic data lets us scale up our dataset to get better results. The more data, the merrier, right?
Flexibility: With synthetic data, we can control the types of scenarios and edge cases we want to test. This means we can fine-tune our recommendation engine to handle all sorts of situations.

The Invoice Data Structure

Okay, so what exactly are we going to generate? Each invoice entry will have these key fields:

invoice_id: A unique identifier for each invoice. Think of it as the receipt number.
date: The date the invoice was created. This helps us track trends over time.
customer_id: A unique identifier for each customer. This is how we link purchases to specific users.
customer_contact_info: Details like email and phone number. Useful for future communications and analysis.
billing_address: Where the customer is located. This can help with regional analysis and targeted promotions.
salesperson: Who made the sale? This can help us evaluate sales performance.
item_description: A description of the item purchased (e.g., “Wooden Door”).
part_number: A unique identifier for the part. This is crucial for tracking specific items.
quantity: How many of the item were purchased.
unit_price: The price of a single item.
total_amount: The total cost for that item (quantity * unit_price).

Each invoice can contain multiple items, just like a real-life shopping cart. For example, a customer might buy a door, a knob, and some hinges all in one go. Plus, some parts have variants, like different styles of doors (Door 456, Door 567), adding another layer of complexity and realism to our data.

Setting Up the Python Environment

Alright, let's get our hands dirty with some code! First things first, we need to make sure we have the right tools installed. Fire up your terminal and let’s install the necessary Python libraries. We’ll be using Faker to generate realistic-looking data and pandas to handle our data efficiently.

pip install Faker pandas

Faker is a fantastic library for generating all sorts of fake data, from names and addresses to dates and product descriptions. pandas will help us organize our data into a nice, clean table.

Generating Basic Invoice Data

Now, let’s start with the basics. We’ll generate some simple invoice data and then build on that. Here’s a Python snippet to get us going:

import pandas as pd
from faker import Faker
import random

fake = Faker()

def generate_invoice_data(num_invoices=100):
    data = []
    for i in range(num_invoices):
        invoice_id = fake.uuid4()
        date = fake.date_between(start_date='-1y', end_date='today')
        customer_id = fake.random_int(min=1000, max=9999)
        customer_contact_info = fake.email()
        billing_address = fake.address()
        salesperson = fake.name()

        num_items = random.randint(1, 5)  # Each invoice can have 1-5 items
        for _ in range(num_items):
            item_description = fake.sentence(nb_words=4)
            part_number = fake.ean13()
            quantity = random.randint(1, 10)
            unit_price = round(random.uniform(10, 200), 2)
            total_amount = quantity * unit_price

            data.append({
                'invoice_id': invoice_id,
                'date': date,
                'customer_id': customer_id,
                'customer_contact_info': customer_contact_info,
                'billing_address': billing_address,
                'salesperson': salesperson,
                'item_description': item_description,
                'part_number': part_number,
                'quantity': quantity,
                'unit_price': unit_price,
                'total_amount': total_amount
            })

    return pd.DataFrame(data)

invoice_data = generate_invoice_data(num_invoices=1000)
print(invoice_data.head())

In this code:

We import pandas, Faker, and random.
We initialize Faker to generate fake data.
generate_invoice_data function creates a specified number of invoices.
For each invoice, it generates basic details like ID, date, customer info, and salesperson.
It then adds 1 to 5 items per invoice, generating item descriptions, part numbers, quantities, unit prices, and total amounts.
Finally, it returns a pandas DataFrame containing all the generated data.

This gives us a good starting point, but we need to make the data a bit more realistic.

Adding Complexity: Part Variants and Co-Purchased Items

To make our data truly useful for building a recommendation engine, we need to simulate the relationships between items. This means adding part variants and ensuring that compatible items are often co-purchased. Let's tweak our code to include these features.

import pandas as pd
from faker import Faker
import random

fake = Faker()

# Define some item categories and parts
item_categories = {
    'Door': ['Door 456', 'Door 567', 'Door 789'],
    'Knob': ['Knob A', 'Knob B', 'Knob C'],
    'Hinge': ['Hinge 1', 'Hinge 2'],
    'Lock': ['Lock X', 'Lock Y']
}

# Define co-purchase probabilities
co_purchase_probs = {
    'Door': {'Hinge': 0.8, 'Knob': 0.7, 'Lock': 0.6},
    'Knob': {'Door': 0.7},
    'Hinge': {'Door': 0.8},
    'Lock': {'Door': 0.6}
}


def generate_invoice_data_enhanced(num_invoices=1000):
    data = []
    for i in range(num_invoices):
        invoice_id = fake.uuid4()
        date = fake.date_between(start_date='-1y', end_date='today')
        customer_id = fake.random_int(min=1000, max=9999)
        customer_contact_info = fake.email()
        billing_address = fake.address()
        salesperson = fake.name()

        num_items = random.randint(1, 5)
        purchased_items = []  # Keep track of purchased item categories
        for _ in range(num_items):
            category = random.choice(list(item_categories.keys()))
            item_description = random.choice(item_categories[category])
            part_number = fake.ean13()
            quantity = random.randint(1, 3)
            unit_price = round(random.uniform(50, 300), 2)
            total_amount = quantity * unit_price

            data.append({
                'invoice_id': invoice_id,
                'date': date,
                'customer_id': customer_id,
                'customer_contact_info': customer_contact_info,
                'billing_address': billing_address,
                'salesperson': salesperson,
                'item_description': item_description,
                'part_number': part_number,
                'quantity': quantity,
                'unit_price': unit_price,
                'total_amount': total_amount
            })
            purchased_items.append(category)

            # Add co-purchased items
            for purchased_category in purchased_items:
                if purchased_category in co_purchase_probs:
                    for co_purchase_category, probability in co_purchase_probs[purchased_category].items():
                        if random.random() < probability:
                            co_purchase_item = random.choice(item_categories[co_purchase_category])
                            co_purchase_part_number = fake.ean13()
                            co_purchase_quantity = random.randint(1, 2)
                            co_purchase_unit_price = round(random.uniform(20, 150), 2)
                            co_purchase_total_amount = co_purchase_quantity * co_purchase_unit_price

                            data.append({
                                'invoice_id': invoice_id,
                                'date': date,
                                'customer_id': customer_id,
                                'customer_contact_info': customer_contact_info,
                                'billing_address': billing_address,
                                'salesperson': salesperson,
                                'item_description': co_purchase_item,
                                'part_number': co_purchase_part_number,
                                'quantity': co_purchase_quantity,
                                'unit_price': co_purchase_unit_price,
                                'total_amount': co_purchase_total_amount
                            })

    return pd.DataFrame(data)

invoice_data_enhanced = generate_invoice_data_enhanced(num_invoices=1000)
print(invoice_data_enhanced.head())

Here’s what we’ve added:

item_categories: A dictionary defining item categories (Door, Knob, Hinge, Lock) and their variants.
co_purchase_probs: A dictionary defining the probabilities of items being co-purchased. For example, a Door has an 80% chance of being purchased with a Hinge.
The enhanced function now randomly selects item categories and their variants.
It keeps track of purchased item categories and adds co-purchased items based on the defined probabilities. This makes our data more realistic and useful for recommendation engines.

Saving the Data to SQLite

Now that we’ve generated our realistic invoice data, let’s save it to an SQLite database. This will make it easy to query and use in our recommendation engine. First, we need to install the SQLAlchemy library, which is a powerful tool for interacting with databases in Python.

pip install SQLAlchemy

Once that’s installed, we can use the following code to save our data:

from sqlalchemy import create_engine

# Create an SQLite engine
engine = create_engine('sqlite:///customer_purchases.db')

# Save the DataFrame to the database
invoice_data_enhanced.to_sql('invoices', engine, if_exists='replace', index=False)

This code does the following:

Imports create_engine from SQLAlchemy.
Creates an SQLite engine that connects to a database file named customer_purchases.db.
Saves the invoice_data_enhanced DataFrame to a table named invoices in the database. The if_exists='replace' argument tells pandas to replace the table if it already exists, and index=False prevents the DataFrame index from being written to the database.

Using the Data for Offline Similarity Computation

With our data safely stored in an SQLite database, we can now use it to compute item-to-item similarities offline. This is a crucial step for building our co-occurrence-based recommendation engine. We’ll be using these similarities to suggest items that customers might like based on their past purchases.

Here’s a basic outline of how we can do this:

Connect to the Database: First, we need to connect to our SQLite database using SQLAlchemy.
Query the Data: We’ll write SQL queries to retrieve the data we need, such as item purchases per invoice.
Build a Co-occurrence Matrix: We’ll create a matrix that shows how often items are purchased together. Each cell in the matrix will represent the number of times two items have been purchased in the same invoice.
Compute Similarity Scores: We’ll use a similarity metric (like cosine similarity) to calculate how similar items are based on their co-occurrence patterns.
Store Similarity Scores: We’ll store these similarity scores, so we can quickly retrieve them when making recommendations.

Let's look at some code snippets to illustrate these steps.

1. Connect to the Database

We already did this in the previous section, but here’s a reminder:

from sqlalchemy import create_engine

engine = create_engine('sqlite:///customer_purchases.db')

2. Query the Data

We’ll use pandas to execute SQL queries and load the results into a DataFrame:

import pandas as pd

# Query item purchases per invoice
query = """
SELECT invoice_id, item_description
FROM invoices
"""
invoice_items = pd.read_sql_query(query, engine)
print(invoice_items.head())

3. Build a Co-occurrence Matrix

We’ll use pandas and scikit-learn to build the co-occurrence matrix:

from sklearn.preprocessing import MultiLabelBinarizer

# Group items by invoice
invoice_items_grouped = invoice_items.groupby('invoice_id')['item_description'].apply(list)

# Use MultiLabelBinarizer to create a co-occurrence matrix
mlb = MultiLabelBinarizer()
co_occurrence_matrix = mlb.fit_transform(invoice_items_grouped)
co_occurrence_df = pd.DataFrame(co_occurrence_matrix, columns=mlb.classes_)

print(co_occurrence_df.head())

4. Compute Similarity Scores

We’ll use cosine similarity to compute the similarity between items:

from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity
similarity_matrix = cosine_similarity(co_occurrence_df.T)
similarity_df = pd.DataFrame(similarity_matrix, index=co_occurrence_df.columns, columns=co_occurrence_df.columns)

print(similarity_df.head())

5. Store Similarity Scores

We can store the similarity scores in a dictionary or a DataFrame for easy access:

# Example: Store similarity scores in a dictionary
similarity_scores = {}
for item1 in similarity_df.index:
    similarity_scores[item1] = similarity_df[item1].sort_values(ascending=False).drop(item1)

# Print top 5 similar items for 'Door 456'
print(similarity_scores['Door 456'].head(5))

This is a simplified example, but it gives you the basic idea of how to compute item-to-item similarities using our generated data.

Next Steps: Building the Recommendation Engine

So, what’s next? Now that we have our synthetic data and a way to compute item similarities, we can start building our recommendation engine. Here’s a quick overview of the steps involved:

Implement a Recommendation Function: We’ll create a function that takes a customer’s purchase history as input and returns a list of recommended items based on the similarity scores we computed earlier.
Evaluate the Engine: We’ll need to evaluate how well our recommendation engine is performing. We can use metrics like precision and recall to measure the accuracy of our recommendations.
Integrate with the UI: Finally, we’ll integrate our recommendation engine with the user interface, so customers can see personalized recommendations while they shop.

This whole process is a journey, guys, but generating realistic synthetic data is a huge first step. It allows us to iterate and improve our recommendation engine without worrying about privacy issues or data scarcity.

In conclusion, we've successfully walked through generating synthetic customer purchase data using Python, focusing on realistic scenarios like part variants and co-purchased items. We've also stored this data in an SQLite database and explored how to compute item-to-item similarities, laying the groundwork for our co-occurrence-based recommendation engine. Keep experimenting and refining your approach, and you'll be well on your way to building a powerful recommendation system!