By Abayomi Abiodun – Microsoft MVP and Data Science Consultant, Onyx Data
Are you exploring how to use Microsoft Fabric and Python to solve complex data science problems using machine learning algorithms?
In this hands-on case study, I applied Microsoft Fabric’s Lakehouse and Python’s machine learning ecosystem to a real-world data analytics challenge drawn from the DataDNA Challenge. Each month, participants submitted feedback after completing a dataset challenge. I analysed these submissions and implement both supervise and unsupervised machine learning algorithms to uncover actionable insights and develop intelligent models.
Overview of the DataDNA Challenge
Each month, participants filled out a form after completing a dataset challenge. My goal was to analyse these submissions and implement both supervise and unsupervised machine learning algorithms to answer business questions across six case studies:
- Predicting Challenge Engagement
- Participant Clustering for Persona Discovery
- Sentiment & Theme Analysis of Feedback
- Predicting Dataset Ratings
- Recommendation Engine for Dataset Topics
- Anomaly Detection in Submissions

Data Collection and Loading
Step 1: Upload data to OneLake
Monthly Excel files were organized under Files/DataDNA Challenge in the Microsoft Fabric Lakehouse.
Load data from Microsoft Fabric Lakehouse
Step 2: Combine and Preprocess
All datasets were cleaned, standardized, and concatenated into a single Data Frame df1.

1. Predicting Challenge Engagement
Objective:
Predict whether a user submitted a valid LinkedIn post (as a proxy for engagement) using various supervised machine learning algorithms techniques.
Features:
- Completion time
- BI tool used
- Dataset rating
- Feedback presence
Target:
- Binary indicator of a valid LinkedIn post
Model Approach:
Used several classifiers (Decision Tree, Random Forest, SVM, XGBoost) to compare accuracy.

Result: The best models revealed that longer completion time, PowerBI users and thoughtful feedback correlate with higher engagement.
2. Participant Clustering for Persona Discovery using Unsupervised Machine Learning Algorithms Techniques.
Objective:
Uncover behavioural personas using clustering (unsupervised learning).
Features:
- BI tool used (encoded)
- Dataset rating
- Feedback length
- Completion time
Model Used: KMeans and DBSCAN


DBSCAN algorithms works best when clustered compared to Kmeans


Result: Identified distinct user groups:
- Power Users – Detailed, slow, high-rating
- Fast Explorers – Quick, minimal feedback
- Analysts – Moderate speed, in-depth feedback
3. Sentiment & Theme Analysis of Feedback Using Machine Learning.
Objective:
Understand feedback sentiment and discover dominant themes using TextBlob and LDA, a machine learning techniques.
Tools:
- TextBlob for sentiment
- LDA for topic modelling


Result: Positive feedback clustered around dataset quality, while negative feedback often pointed to tool compatibility and instructions.
4. Predicting Dataset Ratings Using Random Forest Regressor, a machine learning technique
Objective:
Estimate how a user might rate a dataset based on their feedback and behaviour.
Features:
- Text embeddings from feedback
- Completion time
- BI tool used
- LinkedIn post status
Model Used: RandomForestRegressor

Result: High ratings were associated with long, positive feedback and verified LinkedIn posts.
5. Recommendation Engine for Dataset Topics Using Machine Learning
Objective:
Recommend future topics for each participant based on their past feedback.
Method:
- TF-IDF + cosine similarity for feedback comparison



Result: Feedback-driven topic personalization helped identify demand for themes like health data, elections, and finance.
6. Anomaly Detection in Submissions Using Machine Learning Techniques
Objective:
Detect suspicious or bot-like entries to improve data quality.
Features:
- Extremely short/long completion time
- Repeated email or LinkedIn profiles
- Empty feedback
Model Used: Isolation Forest



Result: Approximately 10% of records were flagged, enabling targeted data cleansing.
Final Thoughts
Using Microsoft Fabric’s Lakehouse & Notebooks with Python, Machine Learning, I created a powerful, end-to-end data science solution that:
- Predicted user behaviour
- Discovered personas
- Analysed sentiment
- Made smart recommendations
- Detected anomalies
Tools Used:
| Tool | Purpose |
| Microsoft Fabric | Unified analytics, storage, and compute |
| pandas, scikit-learn, xgboost, logistic regression, Decision Trees, Random Forest, Support vector machine, KNeighbors | Machine learning & preprocessing |
| TextBlob, LDA | NLP & topic modelling |
| SentenceTransformer | Feedback embeddings |
| pyLDAvis | Topic interpretation |
| Lakehouse | Unified data access |