MetroPulse

Build an end-to-end analytics system that tells investors which city micro-markets give the best mix of rental yield and capital appreciation and how confident we are in those signals.


Real estate decisions are emotional and noisy. Brokers shout about β€œhot” localities, spreadsheets contain dozens of uncomparable metrics, and investors struggle to trade off income (rent) vs appreciation (price growth). I built MetroPulse to cut through that noise: a transparent pipeline that turns price, rent and infrastructure signals into a ranked, explainable list of micro-markets β€” plus a Power BI / Streamlit dashboard you can actually use in meetings.

Below I explain what I built, why, how the data is prepared, feature engineering and business logic, why clustering matters, and what the dashboard explains.

Project Walkthrough

1. Why this project β€” the problem and the solution

The Problem

  • Investors must balance two competing goals: cash flow now (rental yield) vs price growth later (capital appreciation).
  • Lots of markets look good on one metric and poor on another.
  • Infrastructure, liquidity and volatility matter, but are rarely combined into one decision tool.
  • Backtesting (did the signal actually predict future growth?) is often skipped.

How MetroPulse helps

  • Merges price, rent, transaction and infrastructure data into a single master dataset.
  • Creates investment signals (rental yield, YoY growth, infra momentum, liquidity, volatility).
  • Produces an explainable investment score that trades off yield, growth and risk.
  • Segments markets with HDBSCAN so outliers and transitional markets are not forced into wrong buckets.
  • Backtests scores vs real future returns and attaches confidence bands so you know how reliable each signal is.
  • Displays all of this via interactive dashboards (Power BI / Streamlit) so non-technical stakeholders can make decisions.

2. Preparing the data β€” what datasets and what features

Datasets used

  • property_prices.csv β€” monthly price per sqft, rent per sqft, unit size, transaction_count, date, micro_market, property_type
  • infrastructure_scores.csv β€” metro distance, it_park distance, counts of schools/hospitals/malls, infra_score, date, micro_market
  • market_metadata.csv β€” static attributes: market_category (prime/mid/emerging), zone_type (res/mixed), dominant_buyer, risk_category, remarks
  • (optional) market_metrics.csv β€” derived metrics if you want to reuse them, but I prefer recomputing for reproducibility

Core features you should have in the master dataset

`avg_price_sqft` (numeric)
`avg_rent_sqft` (numeric)
`rental_yield_pct` = (avg_rent_sqft * unit_size * 12) / (avg_price_sqft * unit_size) * 100
`yoy_price_growth` (percent change vs same month last year)
`infra_score` (composite 0-1) & `infra_momentum`
`liquidity_score` & `price_volatility`

3. Feature engineering and business logic

Engineering steps (practical)

  1. Merge property prices with infra on `(city, micro_market, date)` and with `market_metadata`.
  2. Compute rental yield as annual rent / property price.
  3. Compute YoY growth with `.pct_change(periods=12)`.
  4. Liquidity score: normalize transaction counts (min–max).
  5. Volatility: rolling 12-month standard deviation of price.
  6. Infra score & momentum: based on distances and YoY change.
  7. Backtest label: `future_12m_growth` computed by shifting forward.

The Investment Score (Explainable)

Normalize features to 0–1 to avoid scale bias. Use business weights like:

investment_score = 0.30 * yield_norm + 0.30 * growth_norm + 0.20 * infra_norm + 0.10 * liquidity_norm - 0.10 * risk_norm

Map `risk_category` (Low/Medium/High) to additive penalties. Apply Cluster Adjustment: Final score = investment_score Γ— cluster_multiplier.

4. Why we cluster, how, and which clusters we expect

Clustering reveals natural groups like income-centric or growth-centric. HDBSCAN is preferred as it discovers cluster counts automatically and marks noise/outliers.

Typical Cluster Labels

  • Balanced: high yield + high growth + low volatility
  • Yield Focused: higher rental yields, lower growth
  • High Growth: strong growth, higher infra momentum
  • Outlier / Speculative: inconsistent signals, low liquidity

5. Other important components used in the project

  • Validation & backtesting: Testing how the score at time t correlates with returns at t+12.
  • Confidence bands: Built from inverse volatility and liquidity to create High/Medium/Low bands.
  • Explainability: SHAP-style explanation to show which features pushed a market into its cluster.
  • Tools & stack: Python (Pandas, Scikit-learn, HDBSCAN) and Power BI.

6. The dashboard

KPI Row: Avg Yield, Growth, Infra, and Final Score at a glance.
Yield vs Growth: Scatter plot to find "Balanced" opportunities.
Risk vs Return: Identifying high returns with low risk.
Market Drill-down: Time series of price and rent momentum.
Backtest: Bar chart showing future growth by score quintile.