MetroPulse
Build an end-to-end analytics system that tells investors which city micro-markets give the best mix of rental yield and capital appreciation and how confident we are in those signals.
Real estate decisions are emotional and noisy. Brokers shout about βhotβ localities, spreadsheets contain dozens of uncomparable metrics, and investors struggle to trade off income (rent) vs appreciation (price growth). I built MetroPulse to cut through that noise: a transparent pipeline that turns price, rent and infrastructure signals into a ranked, explainable list of micro-markets β plus a Power BI / Streamlit dashboard you can actually use in meetings.
Below I explain what I built, why, how the data is prepared, feature engineering and business logic, why clustering matters, and what the dashboard explains.
Project Walkthrough
1. Why this project β the problem and the solution
The Problem
- Investors must balance two competing goals: cash flow now (rental yield) vs price growth later (capital appreciation).
- Lots of markets look good on one metric and poor on another.
- Infrastructure, liquidity and volatility matter, but are rarely combined into one decision tool.
- Backtesting (did the signal actually predict future growth?) is often skipped.
How MetroPulse helps
- Merges price, rent, transaction and infrastructure data into a single master dataset.
- Creates investment signals (rental yield, YoY growth, infra momentum, liquidity, volatility).
- Produces an explainable investment score that trades off yield, growth and risk.
- Segments markets with HDBSCAN so outliers and transitional markets are not forced into wrong buckets.
- Backtests scores vs real future returns and attaches confidence bands so you know how reliable each signal is.
- Displays all of this via interactive dashboards (Power BI / Streamlit) so non-technical stakeholders can make decisions.
2. Preparing the data β what datasets and what features
Datasets used
- property_prices.csv β monthly price per sqft, rent per sqft, unit size, transaction_count, date, micro_market, property_type
- infrastructure_scores.csv β metro distance, it_park distance, counts of schools/hospitals/malls, infra_score, date, micro_market
- market_metadata.csv β static attributes: market_category (prime/mid/emerging), zone_type (res/mixed), dominant_buyer, risk_category, remarks
- (optional) market_metrics.csv β derived metrics if you want to reuse them, but I prefer recomputing for reproducibility
Core features you should have in the master dataset
3. Feature engineering and business logic
Engineering steps (practical)
- Merge property prices with infra on `(city, micro_market, date)` and with `market_metadata`.
- Compute rental yield as annual rent / property price.
- Compute YoY growth with `.pct_change(periods=12)`.
- Liquidity score: normalize transaction counts (minβmax).
- Volatility: rolling 12-month standard deviation of price.
- Infra score & momentum: based on distances and YoY change.
- Backtest label: `future_12m_growth` computed by shifting forward.
The Investment Score (Explainable)
Normalize features to 0β1 to avoid scale bias. Use business weights like:
investment_score = 0.30 * yield_norm + 0.30 * growth_norm + 0.20 * infra_norm + 0.10 * liquidity_norm - 0.10 * risk_norm
Map `risk_category` (Low/Medium/High) to additive penalties. Apply Cluster Adjustment: Final score = investment_score Γ cluster_multiplier.
4. Why we cluster, how, and which clusters we expect
Clustering reveals natural groups like income-centric or growth-centric. HDBSCAN is preferred as it discovers cluster counts automatically and marks noise/outliers.
Typical Cluster Labels
- Balanced: high yield + high growth + low volatility
- Yield Focused: higher rental yields, lower growth
- High Growth: strong growth, higher infra momentum
- Outlier / Speculative: inconsistent signals, low liquidity
5. Other important components used in the project
- Validation & backtesting: Testing how the score at time t correlates with returns at t+12.
- Confidence bands: Built from inverse volatility and liquidity to create High/Medium/Low bands.
- Explainability: SHAP-style explanation to show which features pushed a market into its cluster.
- Tools & stack: Python (Pandas, Scikit-learn, HDBSCAN) and Power BI.