End-to-End Data Science Case Study: Leveraging Microsoft Fabric and Python for Real-World Machine Learning Solutions

By Abayomi Abiodun – Microsoft MVP and Data Science Consultant, Onyx Data 

Are you exploring how to use Microsoft Fabric and Python to solve complex data science problems using machine learning algorithms? 

In this hands-on case study, I applied Microsoft Fabric’s Lakehouse and Python’s machine learning ecosystem to a real-world data analytics challenge drawn from the DataDNA Challenge. Each month, participants submitted feedback after completing a dataset challenge. I analysed these submissions and implement both supervise and unsupervised machine learning algorithms to uncover actionable insights and develop intelligent models. 

Overview of the DataDNA Challenge 

Each month, participants filled out a form after completing a dataset challenge. My goal was to analyse these submissions and implement both supervise and unsupervised machine learning algorithms to answer business questions across six case studies: 

  1. Predicting Challenge Engagement 
  2. Participant Clustering for Persona Discovery 
  3. Sentiment & Theme Analysis of Feedback 
  4. Predicting Dataset Ratings 
  5. Recommendation Engine for Dataset Topics 
  6. Anomaly Detection in Submissions 

DataDNA Challenge

Data Collection and Loading 
Step 1: Upload data to OneLake 

Monthly Excel files were organized under Files/DataDNA Challenge in the Microsoft Fabric Lakehouse. 

 Load data from Microsoft Fabric Lakehouse 

Step 2: Combine and Preprocess 

All datasets were cleaned, standardized, and concatenated into a single Data Frame df1. 

Microsoft Fabric Lakehouse
1. Predicting Challenge Engagement

Objective: 

Predict whether a user submitted a valid LinkedIn post (as a proxy for engagement) using various supervised machine learning algorithms techniques. 

Features: 

  • Completion time 
  • BI tool used 
  • Dataset rating 
  • Feedback presence 

Target: 

  • Binary indicator of a valid LinkedIn post 

Model Approach: 

Used several classifiers (Decision Tree, Random Forest, SVM, XGBoost) to compare accuracy. 

Result: The best models revealed that longer completion time, PowerBI users and thoughtful feedback correlate with higher engagement. 

2. Participant Clustering for Persona Discovery using Unsupervised Machine Learning Algorithms Techniques.

Objective: 

Uncover behavioural personas using clustering (unsupervised learning). 

Features: 

  • BI tool used (encoded) 
  • Dataset rating 
  • Feedback length 
  • Completion time 

Model Used: KMeans and DBSCAN  

KMeans and DBSCAN

KMeans and DBSCAN 2

DBSCAN algorithms works best when clustered compared to Kmeans  

DBSCAN algorithms

DataDNA DBSCAN algorithms

Result: Identified distinct user groups: 

  • Power Users – Detailed, slow, high-rating 
  • Fast Explorers – Quick, minimal feedback 
  • Analysts – Moderate speed, in-depth feedback 

3. Sentiment & Theme Analysis of Feedback Using Machine Learning.

Objective: 

Understand feedback sentiment and discover dominant themes using TextBlob and LDA, a machine learning techniques. 

Tools: 

  • TextBlob for sentiment 
  • LDA for topic modelling 

Theme Analysis

Theme Analysis Onyx Data

Result: Positive feedback clustered around dataset quality, while negative feedback often pointed to tool compatibility and instructions.

4. Predicting Dataset Ratings Using Random Forest Regressor, a machine learning technique

Objective: 

Estimate how a user might rate a dataset based on their feedback and behaviour. 

Features: 

  • Text embeddings from feedback 
  • Completion time 
  • BI tool used 
  • LinkedIn post status 

Model Used: RandomForestRegressor 

RandomForestRegressor

Result: High ratings were associated with long, positive feedback and verified LinkedIn posts.

5. Recommendation Engine for Dataset Topics Using Machine Learning

Objective: 

Recommend future topics for each participant based on their past feedback. 

Method: 

  • TF-IDF + cosine similarity for feedback comparison 

 

Dataset Topics

Dataset Topics 2

Result: Feedback-driven topic personalization helped identify demand for themes like health data, elections, and finance.

6. Anomaly Detection in Submissions Using Machine Learning Techniques

Objective:
Detect suspicious or bot-like entries to improve data quality. 

Features: 

  • Extremely short/long completion time 
  • Repeated email or LinkedIn profiles 
  • Empty feedback 

Model Used: Isolation Forest 

Isolation Forest

Isolation Forest 2

Isolation Forest 3

Result: Approximately 10% of records were flagged, enabling targeted data cleansing. 

 

Final Thoughts 

Using Microsoft Fabric’s Lakehouse & Notebooks with Python, Machine Learning, I created a powerful, end-to-end data science solution that: 

  • Predicted user behaviour 
  • Discovered personas 
  • Analysed sentiment 
  • Made smart recommendations 
  • Detected anomalies 

Tools Used: 

Tool Purpose
Microsoft Fabric  Unified analytics, storage, and compute 
pandas, scikit-learn, xgboost, logistic regression, Decision Trees, Random Forest, Support vector machine, KNeighbors   Machine learning & preprocessing 
TextBlob, LDA  NLP & topic modelling 
SentenceTransformer  Feedback embeddings 
pyLDAvis  Topic interpretation 
Lakehouse  Unified data access