MetroPulse

Build an end-to-end analytics system that tells investors which city micro-markets give the best mix of rental yield and capital appreciation and how confident we are in those signals.

Real estate decisions are emotional and noisy. Brokers shout about “hot” localities, spreadsheets contain dozens of uncomparable metrics, and investors struggle to trade off income (rent) vs appreciation (price growth). I built MetroPulse to cut through that noise: a transparent pipeline that turns price, rent and infrastructure signals into a ranked, explainable list of micro-markets — plus a Power BI / Streamlit dashboard you can actually use in meetings.

Below I explain what I built, why, how the data is prepared, feature engineering and business logic, why clustering matters, and what the dashboard explains.

Project Walkthrough

1. Why this project — the problem and the solution

The Problem

Investors must balance two competing goals: cash flow now (rental yield) vs price growth later (capital appreciation).
Lots of markets look good on one metric and poor on another.
Infrastructure, liquidity and volatility matter, but are rarely combined into one decision tool.
Backtesting (did the signal actually predict future growth?) is often skipped.

                        How MetroPulse helps
                        Merges price, rent, transaction and infrastructure data into a single master dataset.
Creates investment signals (rental yield, YoY growth, infra momentum, liquidity, volatility).
Produces an explainable investment score that trades off yield, growth and risk.
Segments markets with HDBSCAN so outliers and transitional markets are not forced into wrong buckets.
Backtests scores vs real future returns and attaches confidence bands so you know how reliable each signal is.
Displays all of this via interactive dashboards (Power BI / Streamlit) so non-technical stakeholders can make decisions.

                    

2. Preparing the data — what datasets and what features

Datasets used

property_prices.csv — monthly price per sqft, rent per sqft, unit size, transaction_count, date, micro_market, property_type
infrastructure_scores.csv — metro distance, it_park distance, counts of schools/hospitals/malls, infra_score, date, micro_market
market_metadata.csv — static attributes: market_category (prime/mid/emerging), zone_type (res/mixed), dominant_buyer, risk_category, remarks
(optional) market_metrics.csv — derived metrics if you want to reuse them, but I prefer recomputing for reproducibility

Core features you should have in the master dataset

`avg_price_sqft` (numeric)

`avg_rent_sqft` (numeric)

`rental_yield_pct` = (avg_rent_sqft * unit_size * 12) / (avg_price_sqft * unit_size) * 100

`yoy_price_growth` (percent change vs same month last year)

`infra_score` (composite 0-1) & `infra_momentum`

`liquidity_score` & `price_volatility`

3. Feature engineering and business logic

Engineering steps (practical)

Merge property prices with infra on `(city, micro_market, date)` and with `market_metadata`.
Compute rental yield as annual rent / property price.
Compute YoY growth with `.pct_change(periods=12)`.
Liquidity score: normalize transaction counts (min–max).
Volatility: rolling 12-month standard deviation of price.
Infra score & momentum: based on distances and YoY change.
Backtest label: `future_12m_growth` computed by shifting forward.

The Investment Score (Explainable)

Normalize features to 0–1 to avoid scale bias. Use business weights like:

investment_score = 0.30 * yield_norm + 0.30 * growth_norm + 0.20 * infra_norm + 0.10 * liquidity_norm - 0.10 * risk_norm

Map `risk_category` (Low/Medium/High) to additive penalties. Apply Cluster Adjustment: Final score = investment_score × cluster_multiplier.

4. Why we cluster, how, and which clusters we expect

Clustering reveals natural groups like income-centric or growth-centric. HDBSCAN is preferred as it discovers cluster counts automatically and marks noise/outliers.

Typical Cluster Labels

Balanced: high yield + high growth + low volatility
Yield Focused: higher rental yields, lower growth
High Growth: strong growth, higher infra momentum
Outlier / Speculative: inconsistent signals, low liquidity

5. Other important components used in the project

Validation & backtesting: Testing how the score at time t correlates with returns at t+12.
Confidence bands: Built from inverse volatility and liquidity to create High/Medium/Low bands.
Explainability: SHAP-style explanation to show which features pushed a market into its cluster.
Tools & stack: Python (Pandas, Scikit-learn, HDBSCAN) and Power BI.

6. The dashboard

KPI Row: Avg Yield, Growth, Infra, and Final Score at a glance.

Yield vs Growth: Scatter plot to find "Balanced" opportunities.

Risk vs Return: Identifying high returns with low risk.

Market Drill-down: Time series of price and rent momentum.

Backtest: Bar chart showing future growth by score quintile.