E-Commerce Profitability Analytics — Amazon India Dataset

Published:

Role: Independent project · Period: Spring 2025 – Oct 2025

View on GitHub →

Overview

A full end-to-end e-commerce analytics and machine learning project on the Amazon India transaction dataset, executed entirely in R and R Markdown with light use of Google Sheets for manual category refinement. The deliverable was a polished R Markdown report combining narrative interpretation with multi-panel analytics visualizations — the kind of artifact a business analytics team would actually circulate.

What I did

  • Cleaned and merged messy multi-source CSV files representing 1.2M+ transactions, standardized inconsistent labels, and validated dataset quality.
  • Engineered features including profit margin %, discount ratios, and category-level aggregates that turned raw transaction rows into business-meaningful signals.
  • Performed comprehensive EDA using tidyverse and ggplot2 to identify revenue trends, high-performing product groups, and pricing anomalies.
  • Built multiple predictive and segmentation models:
    • k-means clustering for customer/product segmentation
    • Linear regression for baseline profitability drivers
    • Tree-based models (CART, Random Forest, Gradient Boosting) for nonlinear interactions
  • Applied clustering and regression to optimize pricing strategies, generating recommendations on optimal price bands and high-margin categories.
  • Delivered a polished R Markdown report combining narrative, code, and visuals — not just a notebook, but a document a stakeholder can read.

Why it matters

E-commerce decision-making is dominated by pricing, mix, and promotion choices. This project shows how a single analyst can take messy multi-source data and produce both the predictive models and the human-readable report a category manager would actually use.

Tech stack

R · R Markdown · tidyverse · ggplot2 · Google Sheets · k-means · linear regression · CART · Random Forest · Gradient Boosting