My Projects
Data Science Project: Predictive Analysis for Bank of England
I collaborated in a team of five data scientists to develop a predictive analysis model using large language models (LLMs) for the Bank of England.
Objective
The Bank of England sought to detect early warning signals of financial distress among banks in the UK financial sector. Identifying these signals enables policymakers to take proactive measures and mitigate systemic risk.
Data & Challenges
The primary data source was quarterly earnings reports from banks, each with unique formatting and structure. My responsibilities included:
-
Query Engineering: Extracting unstructured data from reports, standardizing it into a structured format, and separating presentation sections from Q&A sessions.
-
Indicator Mapping: Leveraging a JSON file containing 46 predefined indicators of financial distress and associated keywords (e.g., “key staff leaving”). I used this to design automated queries that interrogated each report.
Solution
The system produced detailed risk reports per bank. Each report highlighted:
-
High-risk, medium-risk, and low-risk indicators
-
Confidence scores
-
Supporting reasoning and evidence
Example Extract – JP Morgan Report
=== EXECUTIVE SUMMARY ===
🔴 High Risk Indicators (70-100): 6
🟡 Medium Risk Indicators (30-69): 14
🟢 Low Risk Indicators (0-29): 34
=== HIGH RISK INDICATORS ===
🔴 Indicator: Increased Reliance on Financing Activities
📊 Category: Business Model and Strategic Shifts
📈 Confidence Score: 70.0
📁 Source: output\JPM_2025_Q1_presentation.txt
🔍 Keywords Found: Increased reliance on external funding, higher leverage, higher deposit balances, lower balances, elevated refinancing activity, Leveraged Finance
💭 Reasoning: The text shows signs of increased reliance on financing activities. The bank has seen an increase in fees driven by elevated refinancing activity, particularly in Leveraged Finance. This suggests that the bank is relying more on external funding. Additionally, the bank's payments revenue has increased due to higher deposit balances, indicating a higher leverage. The bank's expenses have also increased, which could be a sign of financial distress. However, the bank's loans have decreased, which could suggest that the bank is trying to reduce its reliance on external funding.
---

Data Science Project: Negative reviews to improve the business
Businesses thrive on positive reviews, but negative feedback can be just as valuable—often highlighting opportunities for real improvement.
For a project with PureGym, I analysed negative reviews from Trustpilot and Google. Using the BERTopic LLM, I grouped review content into key themes. To see if I could improve the insights, I then filtered specifically for negative reviews expressing anger, based on the hypothesis that angry customers are more likely to express their true frustrations.
I then tested a different approach: applying the Phi generative LLM to extract the top three topics from each review.
Across all three methods, the results consistently evaluated the most important issues PureGym should address to strengthen customer satisfaction and improve its business.
What I learnt from this project is that there are several techniques in which you can extract meaning from the data.

Time Series Project: Investigating book sales trends
Weekly book sales data was obtained from a spreadsheet. An initial exploration revealed that sales typically peak shortly after publication before gradually declining. For some titles, a seasonal pattern was evident, with sales rising in November and December due to the Christmas market.
This project focused on two children’s books: The Alchemist and The Very Hungry Caterpillar. Both titles display noticeable seasonal trends in their sales patterns.
To analyze these trends, I first tested for autocorrelation using the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF). I then applied several time series forecasting models, including SARIMA, XGBoost, and LSTM, along with two hybrid approaches: a sequential model combining LSTM with Auto-ARIMA, and a parallel model combining LSTM with SARIMA.
Model performance was evaluated by comparing predicted sales against actual sales beyond a chosen cutoff date, in order to identify which method delivered the most accurate forecasts.
