Complete project: https://github.com/syedimranmurtaza/Data-Science-Models.
Overview
This project focuses on building a predictive model to classify individuals as heavy gamers based on personal and behavioral data. The goal is to identify patterns that indicate whether someone spends 3 or more hours daily playing online games.
Why It Matters
Understanding gamer behavior can help improve game design, marketing strategies, and tools for healthier digital habits. For instance, identifying high-engagement user profiles can assist in designing features that balance fun with time management and awareness of potential overuse.
Dataset
The dataset includes data from 118 online gamers in Pakistan, covering demographic and gaming-related features such as:
- Age, gender, education, income, occupation
- Daily play time, frequency, game difficulty
- Motivations like stress relief, achievement, or social interaction
After cleaning and preprocessing, 12 meaningful features were selected for model training. These steps included handling missing values, encoding categorical data, converting time units, and standardizing textual entries.
Feature Engineering
Key steps in preparing the dataset included:
- Target Variable Creation: A binary column
heavy_gamer
was added, marking users who play ≥3 hours daily as 1. - Categorical Encoding: Text fields like gender, education, and occupation were transformed into numerical values.
- Motivation Columns: Multi-label responses (e.g., “Stress Relief, Achievement”) were separated into binary columns for more precise analysis.
- Clean-up: Duplicate or unnecessary columns were removed for a refined dataset.
Modeling and Results
Model 1: Logistic Regression
A classic binary classification model was used first. After splitting the data (80% training / 20% testing) and applying feature scaling, the model achieved:
- Accuracy: 91.67%
- AUC Score: 0.84
- Performance Metrics: Evaluated using ROC curve and confusion matrix
View Model on Google Colab
Model 2: Decision Tree Classifier
This model delivered perfect classification results on the test data, achieving:
- Accuracy: 100%
- AUC Score: 1.00
- Performance Metrics: ROC curve reached the ideal top-left, and the confusion matrix showed zero errors
View Model on Google Colab
Comparison
While both models performed well, the Decision Tree classifier showed perfect results—potentially overfitting due to the small dataset. Logistic Regression gave slightly lower accuracy but offers more generalization potential.
Bonus: LLM-Powered Data Analyst Bot
To enhance usability, an AI-powered data analysis bot was built using LangChain and OpenAI GPT-3.5. This feature allows users to interact with the dataset through natural language—no coding needed.
Capabilities
- Ask questions like:
- “Which gender has more heavy gamers?”
- “Show average hours by education level”
- Generate visual insights using plain-language prompts:
- “Plot heavy gamers by occupation”
- “Bar chart of average playtime by age”
- Outputs include text insights and real-time plots using matplotlib and seaborn.
This smart assistant bridges the gap between technical data science and user-friendly data exploration, showing how LLMs can support more accessible analysis in practical scenarios.