MRP Abstracts

2023

Al-Falahy, Raed – Predicting Patient Readmissions in Hospitals: A Machine Learning Approach (Supervisor: Ayse Basar Bener)

Patient readmissions in hospitals pose significant challenges to healthcare systems and patient outcomes. Reducing unnecessary readmissions can lead to improved patient care, cost savings, and better allocation of resources. This project aims to address the problem of predicting the likelihood of patient readmissions in hospitals using machine learning models. By leveraging a comprehensive dataset of diabetic patients, I employ various predictive models to identify factors influencing readmissions and develop accurate prediction models. This project utilizes machine learning algorithms to analyze a comprehensive dataset of diabetic patients, encompassing demographics, medical history, medications, and encounter information. By conducting rigorous data preprocessing, feature engineering, and model selection, my aim is to develop robust prediction models capable of accurately identifying patients at risk of readmission. Our models achieve high accuracy values, with Logistic Regression achieving an accuracy of 0.97, Random Forest with 0.98, LightGBM with 0.97, and XGBoost with 0.97. These results demonstrate the effectiveness of my approach in accurately predicting patient readmissions. The findings from this project contribute to the growing body of literature on readmission prediction and provide insights into the factors influencing patient outcomes and healthcare resource utilization. By understanding and preventing unnecessary readmissions, my work aims to contribute to improved patient outcomes and reduced healthcare costs.

Anand, Gurjyot Singh – Credir Card Fraud Detection Using Machine Learning (Supervisor: Amira Ghenai)

With the rise in electronic financial transactions in recent years, credit card fraud has become an increasingly pressing issue, posing significant threats to both financial institutions and their customers. This research offers a thorough machine learning approach for detecting suspicious credit card activities. Through an in-depth analysis, we identified challenges posed by imbalanced datasets in fraud detection. We utilized various machine learning models like Random Forest, KNN, SVM, and Logistic Regression, optimizing them through hyperparameter adjustments and data resampling techniques such as SMOTE. Performance evaluation metrics such as precision, recall, and ROC-AUC score were used for benchmarking. The StratifiedKFold method was employed to ensure uniform representation of classes during data splits. Our study not only underscores the importance of addressing dataset imbalance but also paves the way for enhanced electronic transaction security.

Awan, Hina – Comparing Several Machine Learning and Deep Learning Models to Detect Health Care Providers’ Fraud in Insurance Claims (Supervisor: Farid Shirazi)

The potential for financial losses due to false claims is one of the most serious and well-known challenges that all insurance businesses face. Healthcare fraud is being more acknowledged as a severe social issue. Intentional deceit or misrepresentations are used in health insurance fraud to obtain a financial benefit in the form of health expenses. It is difficult to identify such assertions. Furthermore, numerous traditional approaches have been shown to be ineffective.
In this paper, exploratory data analysis has been done to identify the critical characteristics of possibly fraudulent providers' behavior. The second part covers the analysis taken from training and evaluating various supervised machine learning algorithms such as Logistic Regression, Decision Tree, Random Forest, Gradient Boost, XG Boost, and deep learning models applied to the dataset which is publicly available on Kaggle. The performance metrics of confusion matrix, precision, recall, AUC, and F1 score are used to evaluate the model’s performance. In this study, it was discovered that a neural network model trained on over-sampled data outperformed other models.

Chanana, Gazal – Automobile Insurance Fraud Analysis and Detection (Supervisor: Ravi Vatrapu)

The report documents the journey undertaken to build an effective predictive model capable of distinguishing fraudulent claims from legitimate ones. Leveraging a real-world dataset, the project employs feature engineering, model selection, performance evaluation, and interpretation of results. I began this project by meticulously addressing data preprocessing challenges, including outlier handling, feature selection, and scaling. The report elaborates on the rationale behind each step, emphasizing the importance of effective feature engineering in model performance. The heart of the report lies in the meticulous selection and training of machine learning models, such as Random Forest, Logistic Regression, Gradient Boosting, SVM, Decision Tree, and XGBoost. A detailed exposition delves into the mechanisms and parameters of each model, analyzing their performance across multiple metrics, including accuracy, precision, recall, F1-score, and ROC AUC. The comprehensive analysis of ROC AUC values and the trade-offs between precision and recall further refines the model selection process.

Choudhry, Soheer – Predicting US Wireless Carrier Operator Performance Using Social Media Data and Google Trends (Supervisor: Ozgur Turetken)

This study proposes to analyze data from the U.S. telecom sector, specifically focusing on the major telcos, which includes AT&T, Sprint, Verizon, and T-Mobile and to develop a means of providing predictive insights on financial performance using sentiment data.
The objective is to combine sentiment and popularity data as a means of effectively predicting directional future performance. The developed score will be tested against identified performance indicators to predict future performance by using analytics techniques, including simple regression (baseline), neural networks and ARIMA models.
The findings suggest that while neural networks consistently outperform other predictive models – i.e., regression and ARIMA – these other models also produce relatively consistent results when predicting operative efficiency. Key considerations include required accuracy given that while neural networks model outperforms the others, it is also resource and time intensive. Notable observations include that social media sentiment data along with a popularity index performed satisfactorily. When simplifying the volume and scale of data, neural networks model performance decreased significantly while the regression model evidenced improvements across several response variables. This may be explained by the fact that significant information is lost when aggregating data points, which is a significant disadvantage to the neural networks model while conversely the simplification of the datasets was a significant advantage for the regression model. Across all models, some response variables consistently did not perform as well as others.

Cobbinah, Maame – A Deep Learning Approach to Forecast the Canadian Consumer Price Index (CPI) Using Encoder-Decoder Attention Mechanisms with Teacher Forcing Techniques (Supervisor: Aliaa Alnaggar)

In the wake of the COVID-19 pandemic, the global economy witnessed a significant upsurge in inflation rates, causing a substantial erosion of consumer purchasing power and raising concerns around impending recession. A key metric for assessing inflation is the Consumer Price Index (CPI), which tracks the average price change of different baskets goods and services over time.
Accurate forecasting of CPI is crucial for policymakers to strategically manage inflation as well as take well informed decisions around fiscal and monetary policy. This paper proposes a novel deep learning approach that utilizes a RNN encoder-decoder Attention model with Teacher-Force technique to forecast the Canadian Consumer Price Index (CPI). Although this method is quite complex in its implementation, it performs well with macroeconomic data, such as the CPI, due to its ability to model long-term dependencies in inflationary variables and reduce forecasting errors.
The experimental part of this paper focuses on evaluating the performance of the proposed method, along with traditional statistical methods, supervised learning techniques, and other deep learning approaches. The results show that the RNN Encoder-Decoder Attention Model with Teacher Force Technique outperforms existing methodologies in the literature, generating lower errors and improving R-squared values across all CPI indicators.

Da Silva, Brian Jones – Enhancing Time Series Forecasting Accuracy With Ensemble Methods for Multiple Seasonality in Sales and Retail (Supervisor: Ravi Vatrapu)

Time series forecasting plays a crucial role in the retail industry, as accurate sales predictions enable businesses to optimize inventory management, supply chain operations, and resource allocation. Retail sales data often exhibit multiple seasonal patterns, such as daily, weekly, and annual cycles, which can be challenging to capture using traditional forecasting models. This research aims to explore the effectiveness of ensemble methods in improving the accuracy of time series forecasts by combining the strengths of multiple models to capture various seasonal patterns and other complexities in retail sales data.

Ekundayo, Matthias – Predicting House Prices Using Advanced Regression Techniques (Supervisor: Saman Hassanzadeh Amin)

This project explores different advanced regression models in predicting the prices of houses. This project involves the use of Linear Regression, Support Vector Machine (SMV), Extreme Gradient Boosting (XGBoost) and Multi Layer Perceptron (MLP) in predicting the price of houses given the dataset. Part of the experiments carried out, include using only features that have significant correlation to the label. These selected features are both numerical features, and categorical features that were one-hot encoded so the models are able to work better with the data. At the end of the different experiments carried out in this project, XGBoost was the algorithm that performed the best on the test dataset. The ability to predict the appropriate price of houses is beneficial to both the seller of the house and the buyer, and the overall economy at large. With the increase in data around the world and better access to these data, there is huge potential for the prediction of house prices and the real estate market in general.

Elahi, Muzammil – Household Temporal Space Heating Demand Prediction Using Machine Learning Modelling (Supervisor: Alan Fung)

The purpose of this research is to develop a generalized model that ideally does not rely on time series data to predict the space heating demand for household on an hourly basis. The data was collected from Ecobee Smart thermostat that provides the data in 5-minute intervals. When in the exploratory data analysis phase, the data had to be cleaned first for missing values and changed into hourly intervals. From there a variety of quantitative and visually techniques were used to find patterns, insights and relationships in and between the data. Furthermore, through feature engineering new features were created to help gain additional insights. It was found that the most significant feature for the difference between the indoor temperature and the heat stepping point temperature. When modelling four different algorithms were considered and implemented. These were an Artificial Neural Network, Linear Regression, Generalized Linear Model and Random Forest. Repeated with different parameters for model comparison. The metrics used to evaluate the models where mean square error, root mean squared error, mean absolute error and R2. Based on the first three metrics the Neural Network model performs the best however based on the last one the Random Forest model performs the best. For the future models such as Long Short-Term Memory should be implemented as they can take advantage of the time series features in the data. Use all months in training and testing instead of only the winter months. The data itself is limited to only Toronto and won’t scale well to other parts of the world.

Farrukh, Fatima Faiza – Improving Access to Hospital Beds Through Time Series Forecasting and Mathematical Optimization (Supervisor: Aliaa Alnaggar)

The COVID-19 pandemic has underscored the urgent need to effectively manage and allocate hospital resources in response to unforeseen surges in demand. This research paper introduces a resource management framework to address the challenges of allocating hospital resources during crises, such as pandemics and epidemics. The study focuses on hospital bed demand patterns in 20 Texas cities but the approach can be adapted to various settings and resource types.
The framework comprises two key phases:

• Time Series Forecasting: Predicting future bed demands using meticulous forecasting techniques.

• Mathematical Optimization: Developing optimal strategies to handle demand surges beyond a hospital’s capacity. This may involve creating expansion centers within the hospital or transferring patients to higher-capacity Trauma Service Areas nearby.

The paper underscores the practicality and importance of a comprehensive system for resource allocation during crises, enhancing healthcare system resilience and responsiveness.

Go, Carlos – Synthrad2023 – Synthesizing Computed Tomography for Radiotherapy (Supervisor: Naimul Khan)

In radiotherapy, two of the most widely used medical imaging modalities are Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). Each modality has its own benefits and advantages over the other, however it is often important to have both modalities to make a proper diagnosis in radiotherapy. As such, a challenge on https://grand-challenge.org/ has been hosted by several Universities around the world, with the aim of generating synthetic CT (sCT) images from either an MRI or cone-beam computed tomography (CBCT). In this paper, I tackle the first task of generating sCTs from MRI images. I propose to use an emerging deep learning framework called diffusion and score-matching models. My results show that while the images generated by the model are not a perfect match with the target images, it presents an opportunity for better performance with a setup that has more computational power available.

Gong, Li – Combining Graph Convolutional Networks and Generative Adversarial Networks for Robust Recommendation Systems (Supervisor: Pawel Pralat)

Recommendation systems play a key role in delivering personalized user experiences, yet they face persistent challenges like the sparse interaction data and "cold-start" problem. This study investigates the effectiveness of the Light Graph Convolutional Network (LightGCN) model in generating user and item embeddings from sparse interaction data and evaluates its potential to provide diverse and effective recommendations. Furthermore, the study explores the integration of Generative Adversarial Networks (GANs) to mitigate the cold start problem by generating synthetic item embeddings, expanding the pool of recommendations for new or less-interacted users and items. LightGCN surpasses Matrix Factorization in handling data sparsity, while GANs enhance suggestions for users with limited interaction history. The findings collectively contribute to the advancement of recommendation systems, offering a robust framework to tackle challenges arising from sparse data and the cold-start problem.

Gupta, Snigdha – StreetSmart – A Legal Traffic-Bot Using Generative Language Models (Supervisor: Mucahit Cevik)

It is not easy to understand the complex legal language of traffic rules and regulations for the general public. This chatbot attempts to make people more aware of their rights and responsibilities. “StreetSmart” aims to improve public awareness by taking in user queries and providing answers in accordance with the laws while understanding the context. While considering a pre-trained T5 model initially, the decision shifted to using the GPT-2 model due to its more natural and engaging conversational style. This model was fine-tuned to suit the chatbot's requirements. Research data was gathered from reputable online sources such as the Government of Ontario website and legalline.ca. Our numerical data shows that this chatbot has the potential to help in preventing unintended violations of traffic laws by promoting a better understanding of these laws and spreading awareness among the people, ultimately improving the level of safety and order on the road.

Hossain, Nusrat – Analysis of the Effectiveness of MRI Image Data Vs. Tabular Data When Training Classifiers for Dementia Diagnosis (Supervisor: Sharareh Taghipour)

Background: Machine learning algorithms make dementia diagnosis easier than the traditional approach. However, along with the classifiers, the data used to train such algorithms are also important and must be chosen wisely. Aim: To compare several algorithms trained on two different data types and examine which datatype is most suitable. Methodology: MRI image data and tabular data are being compared. The algorithms being trained are deep neural networks, logistic regression, decision tree, and random forest. A series of experiments are conducted in which the data and algorithms are modified to obtain the best results. Results: The deep neural network trained on image data performed best with an accuracy of 96.96%. However, after undersampling the image data, the classifiers with the best performances were the deep neural network, random forest, and decision tree, all trained on tabular data with accuracies of 83.53%, 80.84%, and 78.87%, respectively. Conclusions: Tabular data containing demographic, diagnostic, and anatomical information is better to use when training algorithms for dementia classification. However, this type of data can be sparse whereas MRI image data is more abundant.

Kaski, Alicia – Transfer Learning Computer Vision Techniques for Early Dementia Diagnosis Using Brain MRI (Supervisor: Farid Shirazi)

Early detection of dementia is crucial for facilitating well-informed care management and improving patient quality of life. Past literature has explored the potential of applying transfer learning computer vision techniques to diagnose dementia using brain MRIs with promising results. However, the datasets used to train these models have been limited. This study aims to investigate transfer learning computer vision techniques for dementia diagnosis using the novel OASIS-4 dataset, which contains 3D brain MRI scans of 663 subjects with varying stages of dementia. The proposed methodology involves modifying and comparing the performance of three convolutional neural networks with pre-trained ImageNet weights (Inception-v3, ResNet50, VGG19) to automatically detect dementia from 3D brain MRIs. Additionally, this study explores the prediction times of the proposed models to assess their inference latencies. According to the findings, the VGG19 model achieved the highest accuracy (83.08%), recall (90.52%), and F1-score (90.52%), while the Inception V3 model demonstrated the highest a precision (93.07%) out of all three models. The VGG19 model was also significantly faster at classifying 3D brain images than the other two models (p<0.0001). In conclusion, our findings suggest that transfer learning with pre-trained models, especially VGG19, can effectively classify dementia from 3D brain MRIs, emphasizing the potential of these techniques for enhancing diagnostic processes in clinical settings.

Li, Ang – Adaptive Learning Management Systems Through Deep Learning (Supervisor: Ravi Vatrapu)

Most modern Learning Management Systems rely on instructor input for student feedback. For computer science (CS) courses, automated unit testing can assist with overall grading, but cannot provide individualized or issue-specific feedback. Furthermore, code beyond the erroneous section is not assessed. This could result in both students and instructors being unaware of shortfalls, leading to repeated mistakes and learner dissatisfaction. This MRP aims to use a deep learning model to augment unit testing by first accurately predicting code correctness on a university CS courses submission repository [1], and then attempt to use Local Interpretable Model-agnostic Explanations (LIME) to provide code correctness saliency. The results show that a trained neural network can identify incorrect code/errors to a high level of accuracy for diverse array of Python tasks. Furthermore, the feedback is suitable for refinement into dashboard-based analytics for rapid and wide dissemination down to the individual level.

Li, Shijie – Comparing Supervised Learning and Reinforcement Learning Algorithms for Real-Time Inventory Management (Supervisor: Mucahit Cevik)

The problem setting considered in this paper involves a grocery store orders its inventory from two available sources. One costs less but requires more days for delivery, and the other costs more but guarantees one day delivery. The grocery store will optimize the order amount between the two available sources to avoid stockout and overstocking. This paper proposes three approaches to managing the inventory: naive approach, supervised learning approach with LGBM, and reinforcement learning approach with DQN. Numerical results show that the supervised learning approach outperforms both the naive approach and reinforcement learning approach in single-product and single-store inventory management problems. Some key advantages of supervised learning are higher flexibility, faster training speed, and robust output. The reinforcement learning approach with DQN may have a qualified performance but require significant work on hyperparameter tuning and model training.

Marcogliese, James – Culinary Puzzle Pieces: Higher-Order Ingredient Pairings in World Cuisines (Supervisor: Ravi Vatrapu)

This study applies data-driven techniques to explore the culinary structure of world cuisines, identifying correlations between shared and unique flavour compounds and frequent ingredient combinations. Analyzing food pairing and bridging, the research categorizes cuisines into four classes based on their flavour compound pairing and bridging characteristics. Frequent pattern mining is employed to identify common ingredient tuples across cuisines. It further utilizes decision tree classifiers to predict whether a cuisine's tendency towards flavour pairing or bridging can be inferred from its ingredient combinations. The findings can inform recipe creation, enhance culinary tradition understanding, and advance computational gastronomy by aiding AI recipe generators to generate palatable and tradition-respecting recipes.

Naupada, Lakshmi – Wealthbot: Development of Chatbot for Investment Management (Supervisor: Sharareh Taghipour)

In the ever-evolving landscape of financial services, integrating cutting-edge technologies has proven instrumental. This report presents "Investment Bot," a sophisticated chatbot enriching user experiences and informed decision-making in wealth management.
Powered by the adaptable Rasa interface, Investment Bot offers comprehensive assistance across mutual funds, ETFs, and stocks, empowering users to make informed investment choices with ease. It addresses common investment challenges by leveraging real-time financial data and advanced analytics, tailoring recommendations to individual risk profiles and goals.
The development process included rigorous testing and iterative refinement, enhancing the chatbot's accuracy and responsiveness. In conclusion, Investment Bot exemplifies the potential of AI-driven chatbots in reshaping wealth management. It signifies the convergence of AI, finance, and user-centric design, promising a more accessible and informed future for investment management in the ever-evolving financial services landscape, benefiting users seeking to optimize their investments.

Nguyen, Gia Viet Huy – Machine Learning Approach on Ethereum Transaction Classification (Supervisor: Ravi Vatrapu)

Ethereum, the world's second-largest cryptocurrency, serves multiple purposes, including financial transactions, smart contracts, and data storage. However, the lack of regulation attracts bad actors, leading to calls for bans. To counter this, this paper aims to classify Ethereum transactions into categories, such as 'DeFi', 'Token Contract', 'DAO', 'Verified Contract', using supervised machine learning techniques on a comprehensive dataset. This classification could help detect suspicious transactions and potentially aid the decision making of policy makers to regulate the new blockchain sector. The public dataset on Kaggle and Google Big Query will be utilized for the scope of this paper. Machine learning models like XGBoost, LightGBM, multi-layer perceptron and LSTM will be leveraged for classification. This paper will also suggest suitable models for crypto investigation or live monitoring based on the performance metrics and time of training.

Noor, Faiza Sabiha – Time Series Forecasting Methods of Toronto’s Housing Market (Supervisor: Aliaa Alnaggar)

In recent years, significant surges in housing prices in Toronto have disrupted historical patterns of steady seasonality and trends, presenting challenges for accurate price forecasting. An evaluation on underlying causes, such as external macroeconomic impacts, could aid in better forecasting of future trends. In this paper, time series forecasting of Toronto median house price is evaluated through implementing statistical and deep learning models that can incorporate exogenous macroeconomic factors. These models include ARIMA, SARIMA, SARIMAX and LSTM. The data consists of monthly median house price and exogenous variables such as inflation, and several different bank rates. Models are trained to make quarterly and yearly predictions for the years 2020 to 2022. We compare the performance of the various forecasting models and assess the value of incorporating exogenous variables. The results indicate that considering the exogenous variables in yearly forecasts can generally reduce mean absolute percent error of the predicted prices.

Raj, Ritik – Predicting Impact of Marketing Channels and Digital Marketing KPIs with Machine Learning (Supervisor: Ravi Vatrapu)

In this project, we use machine learning to address contemporary challenges in assessing the impact of diverse marketing channels on revenue and digital KPIs. Our goal is to predict the effects of marketing avenues, like email, social media, SEO, PPC, and display advertising, on revenue and digital KPIs. We analyze historical data using regression, decision trees, and marketing mix algorithms, considering various time frames (daily, weekly, monthly, quarterly).
We leverage the "DT Mart Dataset" to create models like Marketing Mix Modeling (MMM) and Brand Equity models, assessing both short-term and long-term impacts of marketing efforts on sales and brand valuation. Key research questions include the effectiveness of different channels and using machine learning to predict impact and set achievable targets.

Rutkowska, Agnieszka – Deep Reinforcement Learning for Disaster Recovery Orchestration in Data Centers (Supervisor: Mucahit Cevik)

Ensuring the continuous operation of IT infrastructure is of paramount importance for contemporary businesses. Recognizing the evolving complexities and inherent limitations of traditional disaster recovery measures, this research pioneers the use of Deep Reinforcement Learning. By leveraging a Double Deep Q-Network model, the study identifies optimal recovery sequences in the event of comprehensive data center failures. The methodology is tailored to meet three main objectives: minimize breaches in recovery targets, optimize the operational ratio, and expedite recovery durations. The results suggest that while the DDQN model exhibits proficiency in adapting to primary patterns, its effectiveness remains on par with existing methodologies. In summation, this research highlights both the potential benefits and challenges of integrating DRL into DR strategies for data centers, offering insights for future advancements in the field.

Sanford, Griffin – Analysis of Explainable AI Techniques on House Market Prices (Supervisor: Alexey Rubtsov)

In this paper, explainable artificial intelligence techniques were used to interpret an XGBoost model. The technique used was SHAP values, which are based on cooperative game theory that gives weight to each feature in their contribution to final predictions. SHAP was compared to other popular feature importance techniques such as correlation, gain and weight. SHAP simplified the feature importance process. SHAP was also used to identify weak features that could be removed from training to reduced complexity of the model and somewhat reduce unnecessary noise. The features removed did not improve or reduce performance, which showed that SHAP was correct in identifying them as useless. SHAP also proved extremely useful in explaining prediction instances and how the result was reached. A few instances with large errors were examined using SHAP values to identify them as outliers.

Singh, Bhupinder – Predicting Drought Susceptibility in the US Using Meteorological Data (Supervisor: Alan Fung)

Global temperatures have unequivocally become hotter, and hotter conditions lead to extreme weather—including severe drought. Regionally, the driest parts of the earth are getting drier. This research aims to investigate if droughts in the US could be predicted using meteorological data alone. First, an exploratory analysis of the dataset comprised of open data offered by the NASA POWER Project and the US Drought Monitor was conducted to determine characteristics and points of interest. Following this, the dataset was preprocessed to generate suitable feature and response matrices for machine learning. Four regression analysis algorithms namely K Nearest Neighbors, Decision Tree, Random Forest, and Support Vector Regression were trained to predict the drought scores and were evaluated using the R2, MAE, MSE and RMSE metrics. Random Forest was found to be the best model considering all evaluation metrics and run-time. The results show that suitable meteorological data indicators can be used to predict drought susceptibility with reasonable accuracy.

Singh, Rashmi – An Ensemble Approach to Loan Default Prediction (Supervisor: Alexey Rubtsov)

The purpose of this study is to bifold thoroughly the Loan default prediction and credit scoring and understand who can be a good loan payer and the loans that hold a risk by using an ensemble approach (combining two ML tools: Decisions trees and Logistic regression) to make better loan decisions and to present future directions to researchers in the domain of Loan default prediction. In my MRP I would ensemble logistic regression and decision tree model and then apply ensemble model where, Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. Logistic regression, despite its name, is a classification model rather than regression model. Logistic regression is a simple and more efficient method for binary and linear classification problems. It is a classification model, which is very easy to realize and achieves very good performance with linearly separable classes. It is an extensively employed algorithm for classification in industry. Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. The goal of ensemble model is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Both logistic regression and decision tree models provide advantages in terms of prediction accuracy and interpretability. They offer complementary strengths and can be utilized in various domains depending on the characteristics of the dataset and the desired trade-offs between model complexity and performance.

Patel, Smit – Improving Customer Churn Prediction Through Segmentation (Supervisor: Ozgur Turetken)

Customer churn prediction models have been proven to provide companies with the ability to determine if a customer will churn based on a variety of factors. However, this space still has room for improvement and this paper aims to determine if customer churn prediction accuracy can be improved using customer segmentation by cluster groups of customers based on similarities in their buying patterns and company engagement behaviors. Through this process, companies will also gain the ability to better understand which variables lead to a higher churn rate while also being able to use these clusters for targeted marketing initiatives.

So, Brandon – Attentional Bidirectional LSTM for Instrumental Music Synthesis (Supervisor: Sridhar Krishnan)

This paper aims to create a lightweight Bi-LSTM Attentional model to generate instrumental music. Two datasets are used: a collection of jazz music MIDI files from the Weimar Jazz Database, and an assortment of classical piano MIDI files. We use python toolkit music21 to extract relevant music data as well as generation of new MIDI files from predicted outputs. An exploratory analysis of the MIDI dataset is conducted to find patterns in music data to assist in data preprocessing. Five models of various input, output, and architecture are proposed and trained. The generated music results are compared using a mean opinion score (MOS) of surveyed individuals. We report generated music continuations made by an attentional bidirectional LSTM model to be able to achieve a MOS close to that of original music scores, while maintaining or lowering the computational complexity of previous similar LSTM music generation models.

Sun, Zeyuan - Enhancing Medical Transcription Information Extraction using Transformer Models and Doccano Annotation (Supervisor: Mucahit Cevik)

Medical transcription, pivotal to healthcare documentation, transforms oral clinical narratives into detailed written records. However, the intricate medical lexicon often introduces ambiguities, potentially compromising the accuracy of information extraction. This study’s primary objective is to ascertain the proficiency of prominent Natural Language Processing (NLP) models—Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized BERT Pretraining Approach (RoBERTa), and Distilled version of BERT (DistilBERT)—in surmounting these challenges, utilizing the Mprepver Kaggle dataset as the testbed. Initial annotation of the data was facilitated using the Doccano tool. Our evaluations reveal disparate performances across the models: BERT achieved a modest accuracy of 0.11, while both RoBERTa and DistilBERT demonstrated an impressive accuracy of 0.98. This stark divergence underscores the criticality of model selection in tasks as imperative as medical transcription, with RoBERTa and DistilBERT emerging as potential frontrunners in the domain. The insights garnered from this study could set new benchmarks in medical transcription processing, emphasizing the strategic deployment of NLP models for optimal outcomes.

Wu, Yu Heng – Sentiment Analysis of Online Reviews From YELP Open Dataset (Supervisor: Farid Shirazi)

In the digital age, online reviews have become a central component in driving consumer choices. This study focuses on sentiment analysis of Yelp reviews, juxtaposing traditional machine learning (ML) algorithms (Naïve Bayes, Logistic Regression, Random Forest, Support Vector Machines) against the contemporary BERT model. Drawing from a vast dataset of over 6 million reviews, a balanced training set was derived by undersampling prevalent 5-star reviews. Our key objectives encompass both categorizing reviews into positive or negative sentiments, but also predicting precise star ratings. Remarkably, while conventional ML models demonstrated a range of accuracy levels, BERT stood out with its proficiency, particularly in positive/negative sentiment classification, reaching a flawless accuracy rate. These findings underscore BERT’s potential in complex sentiment tasks, even as traditional models showcase notable abilities. The performance of each model is evaluated based on classification reports and a confusion matrix.

Zmytrakov, Yurii – Exploring Deep Learning and Machine Learning Techniques to Prevent Payment Frauds with the Help of Ensemble Learning (Supervisor: Farid Shirazi)

The purpose of this research project is to develop a generalized model that will be preventing fraudulent online transactions, this binary classification model can be implemented into online payment platforms that are used by online retailers, and ultimately help online retailers reduce losses caused by fraudulent activities. Furthermore, this paper will identify what factors contribute to the accuracy and effectiveness of fraud transactions classifier algorithms, and how those algorithms can be optimized to reduce the false positives and false negatives. To detect fraudulent activities, features will be constructed based on domain knowledges of industry experts and based on the best practices in Machine and Deep Learning.

2022

Ahmad, Agha Anas – Anomaly Detection in Financial Accounting Data Using Deep Learning (Supervisor: Farid Shirazi)

In financial statement audits, detecting fraud in accounting data has long been a difficulty. The bulk of currently used approaches are based on handmade rules taken from well-known fraud cases. While effective, these rules have the problem of failing to generalise beyond recognised fraud scenarios, allowing fraudsters to progressively learn ways to get around them. More sophisticated techniques, on the other hand, typically lack smooth interpretability of the discovered findings, as a result of deep learning's recent success. The main objective of the current research work is to develop a framework for building anomaly detection algorithm. The research proposes the use of adversarial autoencoder networks to address this problem, such that artificial neural networks can learn to represent real-world diary entries in a semantically meaningful way. The learnt representation gives a comprehensive perspective of a group of journal entries and boosts performance dramatically. This research also demonstrates how such a representation, along with the network's rebuilding flaw, may be used to perform an unsupervised, highly adaptable anomaly evaluation.

Azoulay, Adam – Reinforcement Learning-based Peak Shaving for Home Energy Management (Supervisor: Mucahit Cevik)

This paper aims to assess the performance of deep reinforcement learning (RL) in shaving peaks in power consumption to reduce user costs and strain on power generators. Conventionally, forecasting and rule-based policies have been used to implement peak shaving strategies. Deep RL has been used to control home energy power usage to reduce costs, but not to explicitly shave peaks in power usage. We propose using deep RL to generate a peak shaving policy by using a simulated power consumption environment and assess its relative performance compared to the traditional forecasting-based policy. The deep RL policies significantly outperform the forecasting-based policy on the testing data, reducing the costs and removing the need for a custom algorithm. Using deep RL is a valid approach to solving the peak shaving problem and a good alternative to traditional forecasting-based peak shaving methods.

Bera, Swayami – Data Augmentation Strategies To Improve Natural Language Processing for Software Requirement Texts (Supervisor: Mucahit Cevik)

This work aims to study the effects of using data augmentation techniques on the performance of different natural language processing (NLP) tasks such as text classification and named entity recognition using different text datasets from the requirement engineering domain. Software requirement specification (SRS) documents lay the groundwork for the development of products and outline the precise functional and non-functional requirements of a certain piece of software. Hence it is deemed crucial to have an automated system that is able to sort and identify critical entities and anomalies from these documents. However, one of the major obstacles in using machine learning approaches to automate these tasks is the lack of labeled or annotated textual data available. We show that the use of simple data augmentation techniques can combat this issue. We also introduce a structured combination of individual augmentation techniques that shows improved performance for the considered NLP tasks.

Bhatt, Bhakti – Cyber Resilience Topic Modelling (Supervisor: Atefeh Mashatan)

Cyber resilience is receiving increasing attention today. By definition, cyber resilience enables business continuity by preparing for, responding to, and recovering from cyber activities. The aim of this study is to investigate the themes and trends in cyber resilience research using text mining tools. We applied bibliometric analysis and Latent Dirichlet Algorithm (LDA) topic modelling on text found in 1252 academic articles from 1995 to 2022, to empirically identify key research topics and their respective trends. The findings reveal a promising interest in cyber resilience. We also extrapolate and present the topics in academic literature before and after the pandemic to draw comparisons. Thus, we present 29 topics and two high-level themes in cyber resilience research, their evolution overtime, and provide a comparison of the topics in literature pre- and post-pandemic. The results highlight research gaps in the domain which will aid researchers in making informed decisions on future studies.

Chaurah, Abhishek – Bike Share Toronto (Supervisor: Ravi Vatrapu)

Bike Share Toronto is Canada’s second largest public bike share system which was launched in 2011 that is heavily used by the locals for different purposes. The aim is to implement various machine learning models and compare their performance metrics. Also, performing Exploratory Data Analysis (EDA) will help the Toronto Parking Authority to get insights about the bike share usage at different stations that will help them to exercise good judgment and improve their service for dear users to commute locally without having to worry about the availability of bikes at any given time.

Czobit, Melania – Exploring the Effect of Daytime Physical Activity on Sleep Quality (Supervisor: Sharareh Taghipour)

Fitness trackers are a wearable technology item and activity tracker that monitors health metrics such as steps per day, heart rate and sleep patterns. Due to this capability, it enables one to track these metrics over a period of time and has the potential to unveil other health indicators. Sleep duration and quality of sleep or lack thereof, have been linked to health conditions such as type 2 diabetes and heart disease. Using the data available, the following models will be used to predict if activity during the day affects sleep quality: multiple logistic regression, random forest, and LSTM. This project will use the Fitbit activity and sleep dataset (NetHealth, 2019)

Darad, Simran – Sentimental Analysis of Covid-19 Twitter Data (Supervisor: Sridhar Krishnan)

The novel coronavirus disease (COVID-19) is an ongoing pandemic with large global attention. However, spreading fake news on social media sites like Twitter is creating unnecessary anxiety and panic among people towards this disease. In this paper, we applied an idea to utilize machine learning techniques to predict the sentiment of the people using social media such as Twitter during the COVID-19 peak in April 2021. The data contains tweets collected on the dates between 16 April 2021 and 26 April 2021 where the text of the tweets has been labelled by training the models with an already labelled dataset of corona virus tweets as positive, negative, and neutral. Sentiment analysis was conducted by a deep learning model known as Bidirectional Encoder Representations from Transformers (BERT) and various Machine Learning models for text analysis and performance which were then compared along each other. ML models used are Naïve Bayes, Logistic Regression, Random Forest, Support Vector Machines, Stochastic Gradient Descent and Extreme Gradient Boosting. Accuracy for every sentiment was separately calculated. The accuracy of all the ML models produced were 66.4%, 77.7%, 74.5%, 74.7%, 78.6%, and 75.5%, respectively and BERT model produced 84.2%. Each sentiment classified model has accuracy around or above 75%, which is a quite significant value in text mining algorithms.

De Silva, Sidda Marakkala – Adaptive Network-Based Fuzzy Inference Systems (ANFIS) For Explainable Yield Curve Modelling (Supervisor: Alexey Rubtsov)

Artificial Neural Networks (ANNs) are a standard tool in Machine Learning, but with unique challenges in the finance industry. ANNs are susceptible to yielding biased results, and the lack of explainability introduces a legal risk in addition to the operational risk. A Fuzzy Inference System (FIS) is interpretable, can combine human expert knowledge in doubt, but lacks learnability. Adaptive Network-Based Fuzzy Inference Systems ( ANFIS) is explored to train FIS with an ANN for yield curve modelling. The yield curve is used to predict changes in economic output and growth. In this paper, an ANFIS system is developed to answer the following question “How does the yield curve for 8 maturities move when only one point on the curve moves by a given amount?”

Garg, Yashi – Analysis of Creditworthiness – All Lending Club Loan Data (Supervisor: Farid Shirazi)

In today’s world everybody wants to take credit from banks and other financial institutions (Lending Club – It’s a peer to peer lending institution which is also working as a bank, It gives secured and unsecured loans to its clients) for their needs (personal or professional) and Lending Club have to determine the credit worthiness of their clients. Now we (data analysts) have to identify the risk by developing a data analytics-based strategy for the Lending Club which can help in determining the creditworthiness for the applicants. This serves the banks to save themselves from the defaulters and also at the same time save the economy with the burden of bailing out the financial institutions in order to save the economy from collapsing like the sub-prime mortgage crisis of 2007 in the US. The approach of this project has been taken from the current boom in the housing market in Canada resulting in undue pressure on the banks for processing the loans at quicker pace. The first trivial task was to identify the data set which will be appropriate for the analysis and can give meaningful outcomes. The dataset for the Lending Club presents a perfect opportunity to better understand on how data analytics use with most complex and responsible task in the financial industry. Any data is meaningless without proper analyzing the meaning of various fields and in order to gain understanding for the same, effort was put into understanding the basics of lending industry. This is followed by exploratory data analysis and feature selection. A study on correlation of features is also essential to determine the outcome. Another critical step is to visualize the features in accordance to the target and this is further followed by normalizing the data. In order to effectively determine the model, dataset splitting is done into training and test set. Finally, we build the model to predict the defaulters for the future loan applications.

Ghosh, Biraj – Diabetic Foot Ulcer Detection and Segmentation Using U-net (Supervisor: Khan, Naimul)

The aim behind this body of work is to detect and localize diabetic foot ulcers. Diabetic foot ulcers are a frequent result of diabetes, a global epidemic, and this research uses deep learning to segment the ulcer regions from pictures of patients' feet. The network proposed in this research is a U-net, an encoder decoder network which is end-to-end trainable. The first part of the network is a shrinking path of convolutions to extract the features of the image, and the second part of U-net is a symmetric enlarging path which helps with pixel level reconstruction of the masked image. This network, when trained on data from the Diabetic Foot Ulcer(DFU) challenge, yields great results, with a dice score of 0.7339 and AUC of 0.8804 using a focal tversky loss function. A detailed analysis of these aforementioned metrics have been done using different loss functions like Dice Loss and Tversky Loss.

Huang, Shansong – Credit Card Fraud Detection Using Machine Learning (Supervisor: Farid Shirazi)

In recent years, the growth of electronic banking has boosted financial fraud, which results in billions of dollars of loss worldwide each year. Several machine learning techniques for detecting fraud are evolved and applied to detect credit card fraud. This paper aims to provide a comparative analysis and comprehensive review of different approaches to detecting fraud. This paper investigates the performance of logistic regression, decision trees, XGBoost, k-nearest neighbour (k-NN), random forest and LightGBM on highly unbalanced credit card fraud data. The dataset contains 284,807 credit card transactions made in September 2013 by European cardholders. The performance of each model is evaluated based on classification reports and a confusion matrix.

Islam, Masuk Ul – Energy Demand and Price Forecasting By Taking Account of Weather Data (Supervisor: Saman Hassanzadeh Amin)

Meetings the basic energy needs of population at a reasonable cost would be an important policy objective at any national level. It is therefore necessary to identify and analyze how the current energy needs are being met while predicting the nature of the future demands. For delivering a sustainable energy to the end users, it requires an increased efficiency on both the generation and demand side which can only be optimized if the numbers are known in advance. Also, Energy demand is influenced by different parameters in today’s world like seasonality, weather conditions and so on. Due to the unique nature of the industry, many studies are in progress to better predict the future demand of energy consumption. The aim of this project is to explore the hourly energy demand and price data from ENTSOE for major five cities in Spain. And then different forecasting models – ARIMA, SARIMAX and LSTM were applied to forecast energy demand and price. Finally, the models were compared based on the evaluation metrics of RMSE (Root Mean Squared Error) and sMAPE (Symmetric Mean Absolute Percentage Error).

Jiwani, Farzeem – Machine Learning Approaches to Investigate Fundamentals That Impact the Trends in Foreign Exchange Rates (Supervisor: Wahab Mohamed Ismail)

The aim of this paper is to examine disparate economic factors that influence the US and the Canadian dollar exchange rate using machine learning techniques. Exchange rate forecasting is one of the attractive and challenging issues in international economics, as many uncertain factors come into play concurrently. Therefore, improving the accuracy of forecasting exchange rates and analyzing the fundamentals is of great interest to market participants, policymakers, and academics. One of the prominent issues for exchange rate prediction is the features driving it. The challenge of choosing the optimal forecasting model is mostly a result of the frequent changes among the factors impacting exchange rates. By taking into account a variety of alternative fundamentals that affect the Canadian dollar, we statistically evaluate the forecasting performance of the models and focus on understanding what drives movements in the Canadian dollar, and which indicators influence the CAD dollar the most. There are diverse attributes that could prospectively impact the exchange rate, e.g., the unemployment rate, purchasing power, oil price volatility and money supply. We forecast the exchange rate using traditional models, in particular, Random Forests, Extra Trees, XGBoost, Support Vector Machines, Lasso, and Ridge, as well as deep learning models, namely, LSTM and GRU. Evaluation is carried out using out-of-sample testing against standard regression metrics, specifically normalized deviation (ND), mean absolute error (MAE), root mean squared error (RMSE), and normalized root mean squared error (NRMSE). In addition to modeling the selected rates and indices mentioned above as potential features, the study aims to interpret the results using techniques such as SHAP to quantify the amount of significance each macroeconomic component attributes to change in the exchange rate. Our results indicate that the linear regression models namely, Lasso and Ridge regression, perform better than the others on spatial information. However, the deep learning models outperform others when trained on temporal information. While purchasing power parity, unemployment, money supply stocks, and oil prices were generally major factors in driving the exchange rate movement, the S&P 500 and commodity price indexes came out as significant for deep learning models. The need for a model that can predict foreign exchange rate direction and account for the relationship between these macroeconomic fundamentals and the exchange rate, will be an outstanding contribution as this helps understanding which economic factors drive the fluctuations in the exchange rates.

Joseph Santhana Raj, Rinaldo Sonia – Hourly Space Heating Energy Demand of Residential Houses Using AI/ML Technique with High Frequency Data from Smart Thermostat (Supervisor: Alan Fung)

The aim of the research project would be to work towards building a generalized data driven model that can model and predict the space heating demand for household on an hourly basis. The data used is gathered from Ecobee Smart thermostat that provide high frequency, but imprecise data and we are trying to use such data source to predict each houses hourly space heating demand for real-time predictive control in large scale. First, an exploratory analysis is conducted using descriptive analytics and data visualization to try and find patterns, or relationships that could help give insight into the data. Further, multiple approaches and techniques such as data aggregation and inclusion of time-lag information and weather information are applied to model and predict the space heating demands of any house using basic, easy to obtain features only. Several data analysis and predictive modelling techniques using neural networks have been tried among which 60 minutes time lag applied to independent variables along with the external weather information seem to help achieve the desired results. For future explorations to predict more precise better prediction and results, it would be great to add various data sources including different features like family size and relevant information using domain knowledge.

Maqbool, Aliya – Left Ventricle Segmentation of Echocardiograms Through Deep Learning (Supervisor: Naimul Khan)

An echocardiogram creates ultrasound images of heart structure and produces accurate assessment of the blood flowing through the heart. This assessment comprises of both normal and abnormal blood flow through the heart, to visualize any abnormal communications between the left and right ventricles of the heart, any leaking of blood through the valves and estimation of how well the valves open or do not open in the case of valvular stenosis. Echocardiography is specifically targeted at visualizing the blood flow from the left ventricle of the heart. Previously, ultrasound images of the left ventricle were analyzed manually by cardiologists for segmentation of the left ventricle. This project proposes a deep learning model that will analyze the ultrasound videos to perform semantic segmentation of the Left Ventricle on the Echo Dynamic Dataset. The Echo Dynamic dataset was obtained through a standard full resting echocardiogram study consisting of echocardiogram videos that visualized the heart through various angles, positions, and image acquisition techniques. The dataset consisted of 10,030 apical-4-chamber echocardiography videos of unique individuals who experiences cardiac assessment between 2016 and 2018 as part of routine clinical care at Stanford University Hospital. Each frame of each echocardiogram video will be semantically analyzed, and left ventricle segmentation of the heart will be performed on each video and saved for later analysis. The segmentation model used for the project will be DeepLabv3 with a backbone architecture of ResNet-50. The performance of the model will be evaluated by the DICE similarity coefficient.

Middela, Lakshmi Nitisha – Twitter Sentiment Analysis of Blockchain Technology (Supervisor: Farid Shirazi)

Bitcoin is a decentralized digital currency, without a central bank or single administrator, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. Bitcoin is one of the most trending cryptocurrency due to its price fluctuation phenomena. On the other hand, scientists now recognize the power of predicting the sentiments based on the tweets for different events, political crisis and even for the economy. On 19 January 2021 Elon Musk placed #Bitcoin in his Twitter profile tweeting “In retrospect, it was inevitable”, which caused the price to briefly rise by about $6,000 in an hour. The sentiment of tweets on Twitter about cryptocurrency directly or indirectly recognize the overall behavior of Bitcoin. This project will look into training a deep learning model for the sentiment prediction of tweets about Bitcoin. The main objective of the proposed work is to achieve better accuracy and authenticity of our prediction model.

Mohammed Abdul, Faheem – Propaganda Text Classification Analysis News Based on Their Propagandistic Contents (Supervisor: Farid Shirazi)

“Propaganda is a mechanism to influence public opinion, which is inherently present in extremely biased and fake news.” If a Celebrity/famous personality has given his/her personal opinion about any upcoming or occurred event organized by a government which is disliked by political parties, organizations, or any individuals then those political parties or individuals can come up with an extreme bias by manipulating original opinion. Here, I propose a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and certain keywords. I experiment thoroughly with different variations of such a model on a new publicly available corpus, and I show that character n-grams and other style features outperform existing alternatives to identify propaganda based on word n-grams. I make sure that the test data comes from news sources that were unseen on training, thus penalizing learning algorithms that model the news sources used at training time as opposed to solving the actual task. This allows users to quickly explore different perspectives of the same story, and it enables investigative journalists to dig further into how different media use stories and propaganda to pursue their agenda.

Monira, Serazam – Pneumonia Detection from Chest X-Ray Images Using Deep Learning Algorithms (Supervisor: Farid Shirazi)

The risk of pneumonia is immense for many, especially in developing nations where billions face energy poverty and rely on polluting forms of energy. The WHO estimates that over 4 million premature deaths occur annually from household air pollution-related diseases including pneumonia. Over 150 million people get infected with pneumonia on an annual basis especially children under 5 years old. In such regions, the problem can be further aggravated due to the dearth of medical resources and personnel. Build an algorithm to automatically identify whether a patient is suffering from pneumonia or not by looking at chest X-ray images. The algorithm had to be extremely accurate because lives of people are at stake. We present a novel approach using deep learning algorithms to find pneumonia from chest X-ray image. Here we propose a model to identify pneumonia after evaluating a series of deep learning algorithms.

Myers, Mitchell – Disambiguating Medical Abbreviations with Token Classification Methods (Supervisor: Mucahit Cevik)

Abbreviations are an unavoidable and critical part of medical text, especially in clinical patient notes; for the person writing them, they can save time and space, protect sensitive information, and avoid repetition. However, being that most abbreviations can have multiple senses, with no standardized mapping system accurately, disambiguating them becomes a very difficult and time-consuming yet important task. The goal of this study is to examine the feasibility of token classification methods for medical abbreviation disambiguation and explore this method’s ability to deal with multiple unique abbreviations in a single text. We use two datasets to compare and contrast the performance of several transformers-based BERT models pre-trained on different scientific and medical corpora, examine the effect of reducing the number of instances on frequently occurring labels, and investigate the impact of a unique prediction post-processing method. Through our experiments, we show that downsampling highly frequent labels has a very limited impact on performance but drastically reduces training time. In our comparative tests, we find that pretraining on text that is specifically relevant to the downstream task dataset is helpful to performance but does not guarantee great results, and being resistant to class imbalance allows for more consistent performance overall. We also show that post-processing has a positive but limited effect on results and mainly impacts infrequent label prediction.

Nam, Hee Kyong – Pneumonia Detection in Chest X-ray Images Using Deep Convolutional Network (Supervisor: Farid Shirazi)

Pneumonia is a life-threatening disease that leads to the death of individuals within a short period due to the flow of fluid in the lungs. Early diagnosis and drugs are very important to avoid the progress of the disease and chest X-ray is the most widely used method to diagnose pneumonia by radiologists. However, human-assisted approaches have some drawbacks such as expert availability, treatment cost, and availability of diagnostic tools. Hence, the need for an automated system that operates on chest X-ray images comes into place. This project proposes deep convolutional neural network and transfer learning approaches that automatically detect pneumonia from x-ray images. Moreover, the impact of image augmentation techniques on deep learning framework and visualization of feature maps and class activation maps are examined in this study. According to the findings of this project, deep convolutional neural network can detect pneumonia from chest x-ray images. The proposed Xception based CNN with transfer learning utilizing ImageNet parameters obtained the following results: accuracy (0.95), recall (0.94), precision (0.96), F1-score (0.95), and AUC score (0.95). These positive results allow us to consider the model as an alternative that can be useful in low-resource countries with a shortage of radiology experts and equipment.

Noor, Fozia – Multimodal Stress Detection (Supervisor: Naimul Khan)

Due to the continuous increase in stress patients, researcher attention has been diverted toward developing automatic stress detection systems. Stress is a response to anything causing pressure or is threatening. The human body responds to it with physiological responses, such as changes in heart rate, respiration, and electrodermal activity. Combining these physiological signals can provide complementary information to create a robust stress detection model. Therefore, considering multi-modal approaches to build an automated system that can accurately and automatically detect stress is essential. This study proposes a framework to build a multi-modal detection model. The main challenges in building an accurate model are choice of features, classification algorithm choice, and classification strategy. The study uses different approaches to maximise the performance of a detection system. This study shows that the influence features in classification compare the different modality features' role in the detection and use multiple feature selection to find the best combination of features and reduce time and space. The study also employed different machine learning algorithm and designed neural network approach to achieve higher performance results. The performance of the proposed model is evaluated using different testing strategies and average classification accuracy of 93% is achieved using Neural Network. Classification accuracy of 96% is achieved on a AutoML. This shows the real-world applicability and robustness of the designed algorithm.

Ojo, Ayodotun – A Predictive Analysis of Credit Card Customer Churn Using Machine Learning and Ensemble Techniques (Supervisor: Farid Shirazi)

Customer relationship management and customer churn prediction have received growing attention over the past decade. In this research paper, we conducted a predictive analysis leveraging Machine Learning techniques such as Random Forest, Support Vector Machine, K-Nearest Neighbours, XGBoost and Artificial Neural Networks, as well as Ensemble techniques such as Stacking using Logistic Regression as a meta-learner and Voting to classify credit card customer churn. Since foregone profits of attrited customers and the cost of attracting new customers can be significant, an increase in retention rate can be very profitable to banks and service providers. By ascertaining potential churners, service providers can design a more effective client relationship management strategy and offer more tailored services to customers. An exploratory data analysis is conducted, and we also determine which attributes are more relevant in predicting customer churn to support the interpretability of the analysis.

Pingali, Krishnamohan – Keyword Extraction Using NLP Tools (Supervisor: Farid Shirazi)

Natural Language Processing has become an integral part of this modern digital age. It has a number of real-life applications in a broad variety of fields ranging from optimization of search engines, language translation, to customer friendly chatbots and many more. One of the leading fields of research in NLP has been in Text analysis. The aim of this project is to extract keywords from many research papers, and to determine their association with the themes of the research paper. The research papers used for this project are related to AI, Business, Healthcare, IoT and Smart Contracts. Various NLP methods like Yake, KeyBert, Sentence BERT etc. will be used to identify the keywords.

Rahman, Md Mozahidur – Empirical Evaluation of the Sales Prediction Using Data From Dataco’s Supply Chain Management and the Addition of Economic Indicators (Supervisor: Saman Hassanzadeh Amin)

Sales forecasts determine how much product is needed to meet targets. Economic indicator is a macroeconomic metric that researchers use to assess current and future economic activity and opportunities, which is also a key factor of ‘systematic risk’ of the local or global organizations. This research had constructed to establish an empirical comparison between future sales forecast considering by economic indicators. Population, unemployment rate and GDP per capita of customer country have been selected as representative economic indicators. After the deployment of the data model and its examination, some intriguing facts emerged. There is no significant improvement on sales forecast after incorporating economic indicator. In addition, the most accurate predictor was Xgboost, which provided mean absolute error (MAE) score of 0.014. Even after accounting for economic indicators, the Random Forest Predictor still produced the same result.

Rovinska, Svetlana – Stress Detection From Multimodal Wearable Sensor Data (Supervisor: Naimul Khan)

The aim of this study is to build and train a well performing generalizable model to automatically detect mental stress based on physiological signals obtained using wearable sensor devices. The following main trends for mental stress detection are found in recent literature: traditional machine learning methods with manual feature extraction from time sequences, and deep learning methods for automatic extraction of features from time sequences. In this study we adopt the latter approach due to its advantage of not requiring underlying knowledge of how to process signal sequences from various modalities. We applied a novel idea of using unsupervised convolutional autoencoders to automatically extract features from time-series data that are then fed to supervised classifier to classify people’s affective state. WESAD open-source dataset was used to train and test our models.

Sarp, Gorkem – Emotion Categories of Tweets About Human-Induced Climate Change (Supervisor: Farid Shirazi)

Climate change is a major worldwide issue. Multiple studies used Twitter data to conduct sentiment and emotion analyses to gauge public opinion about this issue and gain insights. This study looked at emotion change in climate change-related tweets, over a 10-year period. Logistic regression was used for emotion classification on February-April 2022 tweets containing hashtags with six basic emotions. Climate change-related tweets sent during the first week of every year from 2013 to 2022 were collected to predict their emotion categories. There was a significant difference in terms of proportions of emotion categories over the years, but not an apparent linear trend. Largest predicted emotion category was fear. Potential explanations for the results and suggestions for future studies were noted.

Sharma, Richa – Predicting Carbon Dioxide Emission From Sustainable Smart City Environment Perspective (Supervisor: Farid Shirazi)

The biggest challenge for our smart sustainable cities is to handle the climate change which is due to various factors and CO2 emission by the vehicles is one of them. This project aims to predict the CO2 emission produced by the brand-new vehicles over the years. We propose and compare various machine leaning models and deep learning models incorporating feature selection such as Multilinear Regression, Random Forest Regression, Lazy Regressor, ANN Model, and LSTM recurrent Model to predict the CO2 emission in Canada. Both deep learning models that is Ann Model and LSTM model outperformed the machine learning models in terms of accuracy but requires extensive and more powerful processors. To further classify the most impacting vehicles in terms of emission to consider for our sustainable smart cities we used the KNN and Random Forest Tree Model. To demonstrate the performance of the models the RMSE (Root Mean Square Error), MAPE (Mean Absolute Percentage Error), MAE (Mean Absolute Error) measures are used. Random Forest Model is much closer to the target variable when we classify.

Sunil, Neha – Prediction and Classification of Breast Cancer Using Machine Learning Techniques (Supervisor: Farid Shirazi)

Breast cancer, one of the most frequent malignancies among women, is regarded to be nearly one in three cancers diagnosed among women in the United States. For specific care and medical treatment, diagnostic procedures with improved performance are crucial in this domain. In this research, the main aim is to classify whether the breast cancer is malignant or benign and also help predict the reappearance and non-recurrence of cases classified as malignant. The data used was obtained from the public Wisconsin Breast Cancer dataset. An exploratory analysis is conducted using descriptive analytics as well as data visualization to help identify some patterns, or relationships that could help shed light on hidden information in the data available to us followed by the second part of the analysis stemming from training different classifier models that can help classify the tumor in the breast as benign or malignant and predict the reoccurrence of cases classified as malignant by observing the most important features. The performance of the different classifiers was compared using precision, recall as well as Area Under the Curve.

Sunil, Nikhil – Customer Review Classification Using Machine Learning and Deep Learning Techniques (Supervisor: Farid Shirazi)

In today’s world, the online ecommerce industry has become very competitive and it continues to grow at a rapid pace. Companies generate a lot of data, that contain customer feedback data like reviews about their products and services. Customer online reviews play a very important role in helping the company improve their sales and increasing their customer base. In this paper, we take different customer reviews on women’s clothing and classify them as good or bad, that can help in making a decision on whether the product/service is doing well in the market or not. To achieve this, I will be using various classification models in deep learning and machine learning to classify the customer reviews and compare their accuracies and other parameters, to decide which model would be the best fit for this task. The machine learning models proposed here are Logistic Regression, AdaBoost, Decision Tree, Support Vector Machine, Random Forest and deep learning models like LSTM, Bi-LSTM and GRU. The models are evaluated by using various metrics like confusion matrix, AUROC curve and classification report.

Watts, Benjamin – Diabetes Blood Glucose Management With Deep Learning Algorithms (Supervisor: Farid Shirazi)

This research explores the use of Deep Learning models to create forecasts for Blood Glucose (BG) values at 30 and 60-minute prediction horizons. The research explores Gated Recurrent Units (GRU), Simple Recurrent Neural Networks, and Long-Short Term Memory (LSTM), and empirically determines a best architecture for an accurate forecast with minimal training time. We also explore the benefits of feature engineering and model personalization and compare that with base learner performance. Overall, I suggest a best architecture for the GRU to achieve an RMSE of 1.69 MMOL/L and the data structure to train a model given a time-series of BG results. The management of BG in people living with Diabetes is important in managing their health outcomes. With the increasing prevalence of data sources in addition to CGM, research will continue to assist those living with diabetes to leverage deep learning to improve health outcomes.

Yu, Qinyun – Prediction and Forecasting of Specific Local Outdoor Temperature for Advanced Predictive Control of HVAC Systems Using Local Geospatial and Weather Data (Supervisor: Alan Fung)

Due to the scarcity and cost of precise meteorological sensors, temperature cannot be accurately measured in dense urban environments due to factors like the urban heat island effect. Local weather station data was gathered, examined, and cleaned. This data was fed into various time series forecasting models (SARIMAX, Prophet, LightGBM), and evaluated using various error metrics by comparing outdoor temperatures with forecasted temperatures. The forecasting capabilities were then examined through backtesting by continually forecasting n-periods and retraining. Given its explainability, scalability, and hyperparameter tuning, LightGBM was the evident winner. With a mean absolute error of 0.43 when forecasting 24-hours ahead, relatively accurate outdoor temperature forecasts could be made given exogenous features like solar radiation and nearby station data. For future studies, it is recommended to create several deep learning models to compare with current ensemble method, and finetune the current model with hyperparameter tuning.

2021

Al-Husban, Raad – Deep Learning-based Text Clustering Over Software Requirement Datasets and Detection of Conflicting Software Requirements (Supervisor: Mucahit Cevik)

This paper attempts to tackle the problem of detecting conflicting software requirements via deep learning based text clustering methods. We propose a novel approach in which transformer based embeddings are combined with TF-IDF, a more conventional and probabilistic method. The former provides with the advanced capabilities of neatly vectorizing synonyms in nearby points in the Euclidean space. The latter method uses an older - yet reliable- method of term-frequency document-frequency embedding. In the proposed model, software requirements are embedded (vectorized) using three different methods, TF-IDF, Universal Sentence Encoder (USE) and BERT-TFIDF (A merger of two embeddings vectors). For each of the transformer-based method, pre-trained methods from Hugging Face and Tensor Flow libraries were used. Specifically, we employed a Distilled BERT model for its excellent performance and agility. A cosine similarity matrix is generated for each of these embeddings and it forms as basis for finding similar and conflicting software requirements. We note that conflicting requirements are most likely similar requirements, hence they will have a high cosine similarity. Therefore, our approach attempts to define an optimal cut-off threshold were a specific cosine similarity value would indicate that a given requirement is likely to be a conflicting one. To check our dataset, multiple annotators have manually annotated three different datasets extracted from the PURE requirements dataset. After identifying optimal cut-off thresholds, these values are used for future unseen sets of software requirements. The test data performed remarkably well on the UAV dataset with less satisfactory results on the Open Coss and World Vista datasets. Despite some shortcomings, our approach showed promising results and could be developed further for possible future practical use in automating the identification of conflicting software requirements.

Azadeh, Sara – Classifying COVID-19 Related Tweets Based on Gender and Sentiment (Supervisor: Ebrahim Bagheri)

News and awareness about COVID-19 spread like the pandemic itself on social media. During the resulting lockdown, people used social media to express their feelings and find and share Covid-19 related information.
There are different COVID-19 tweet-related data sets available, and we explored some of them, and finally, we chose this one 1 to explore. In this study, first, we did the sentiment analysis based on LIWC and VADER Sentiment Analysis. Then we classified tweets based on sentiments generated by VADER with different Machine Learning models such as Naive Bayes, Support Vector Machine, Random Forest, and XGBoost. We got the highest accuracy with the XGBoost model. Furthermore, we used one of the pre-trained BERT models by Hugging Face and Naive Bayes model, which had the highest accuracy among other models to predict the gender based on the tweet text. Then we classified the tweets based on gender and sentiments.

Berry, Sean – Comparative Analysis on Time-Series Clustering (Supervisor: Mucahit Cevik)

The COVID-19 pandemic began changing lifestyles, and with the developing technology, online grocery delivery applications became an indispensable part of life. The demand for such applications has increased to a point where new competitors are disrupting the market. This competition has forced companies to restructure their product pricing strategies. Companies in the marketplace can identify the change patterns in product prices, giving them a competitive advantage. This paper investigates alternative approaches to popular time-series clustering methodologies, with the aim of clustering products based on historic price and sales volumes.
In this paper we propose a novel distance metric that considers how product prices move together, rather than simply measuring numerical distance between the data points. Our approach is then compared with more popular distance metrics and clustering algorithms. Image clustering is also assessed as an alternative for time series clustering based on visual patterns. Using a combination of our custom evaluation metric along with Calinski Harabasz and Davies Bouldin indices, which are commonly used internal validity metrics, we can evaluate the performances of different clustering algorithms. Our proposed approach as well as image clustering are found to perform well for the task of grouping products with similar product pricing, and sales volumes.

Bhardwaj, Bhawna – Sentiment Analysis of Reviews Submitted For Research Papers on Openreviews.net (Supervisor: Ebrahim Bagheri)

Sentiment analysis or opinion mining is used to automate the detection of subjective information such as opinions, attitudes, emotions, and feelings. Many researchers spend long time searching for suitable papers for their research. Online reviews on papers are the essential source to help them. Thus, online reviews can save the researcher's time. It provides effort and paper cost. In this MRP, the aim is to perform sentiment analysis of reviews of research papers submitted on openreviews.net. Sentiment analysis is chosen as the appropriate method to analyze this data as this method does not go through all the papers but rather relates to the sentiments of these papers in terms of positive or negative opinions. The primary objectives of this research are:

• Identification of the algorithms and metrics for evaluating the performance of Machine Learning and Deep Learning Classifiers.

• Comparison of the metrics from the identified algorithms depending on the size of the dataset that affects the performance of the best-suited algorithm for sentiment analysis.

Through the analysis of various Machine and Deep learning algorithms, the aim of this research is achieved in identifying the best- suited algorithm for sentiment analysis on reviews submitted on openreview.net with respect to the selected dataset ICLR 2021.

Bhowmic, Biddut – Twitter Sentiment Prediction Analysis of Blockchain Technology in the Financial Sector by Machine Learning Models (Supervisor: Farid Shirazi)

The financial service industry is showing more and more interest in blockchain technology, making investments in blockchain-based applications. The objective of this project is to build machine-learning models to understand the sentiment of the general population regarding blockchain technology in the financial sector through analyzing tweets in their Twitter accounts that will help to develop valuable insights for blockchain adoption by different financial institutions. The Twitter data has been collected using Twitter stream API as well as from the Twitter archive data set. After pre-processing and clearing tweets the polarity of tweets has been measured with “nltk” library for determining the base labeled (positive, neutral, negative) data. In the second part, I have run several machine learning models and compared the accuracy between models to predict the positive, neutral, and negative sentiment of the tweets. Among three learning models, Logistic Regression (0.863) and Random Forest Classification (0.863) show better accuracy than Multinomial Naive Bayes(0.807). This study has been able to contribute to the blockchain literature by i) providing some important insights into the current public attitudes to the blockchain; ii) developing models to predict blockchain sentiment that could be useful for the financial institutions in their future decision-making and planning for blockchain adoption.

Chhabra, Kritika – Exploring the Effectiveness of Deep Learning and Transformer-Based Models for Software Requirement Classification (Supervisor: Mucahit Cevik)

The software requirements specification is the foundation of a software project. It establishes the framework that each team involved in development will follow. It provides the required information to multiple teams to ensure everyone is on the same page. It also describes the functionality the product needs to fulfill and other kinds of the non-functional requirement for the product or the software. Software requirements must be classified for later use of requirements in the design and implementation phases. This classification can be done manually, which takes a lot of time and effort. It can also lead to misinterpretation of any requirement because of the ambiguous nature of natural languages. Also, in case of redundant information in an SRS document, it can cost more time and labor. Thus, automation of software requirement identification and classification can smoothen the software development process and can prove useful in saving time and effort of the teams. Software requirement classification is one form of the text classification problem. In our research, we performed requirement classification in two parts: multi-class text classification with imbalanced dataset (involving all functional and different types of non-functional requirements) and binary text classification with balanced dataset (combining all different requirements except functional under one label ‘non-functional’). We conducted a detailed comparative study of five transformer-based models (Bert and its variants) and deep learning models (LSTM and BiLSTM) with and without word embedding (BiLSTM+Glove, BiLSTM+ELMo) for software requirement classification. We test the performance of these models using two datasets: the NFR and PURE datasets. Our results show that the transformer-based model performed better for most of the cases, and showed highly accurate results for binary classification. In addition, the word embeddings proved useful for multi-class text classification improving the BiLSTM performance by almost 20%. We also reported the performance improvement that can be attributed to data balancing strategies.

Deol, Guneet – Textual Analysis of NSERC’s Awards Data Using Topic Modeling (Supervisor: Ebrahim Bagheri)

Latent Dirichlet Allocation is a probabilistic generative model which is widely used to identify hidden semantic structure through unsupervised learning in a large text corpus. In this work, experiments were conducted to realize LDA implementation to perform Topic modeling on textual data. Dataset containing 607,565 unlabeled rows was used in the work and a LDA model was trained to identify the distribution of topics throughout the text corpus. We evaluate both the qualitative and quantitative results generated by the model through various measures. Based on the results we obtain from LDA we quantify how closely the project summaries are related to various other attributes of the dataset like Funding amount. The work provides insights into Research area trends and highlights characteristics of NSERC Awards Data throughout Canada.

Deshmukh, Saket – Probability Graphical Model Based Approaches in Human Classification and Human Activity Detection (Supervisor: Farid Shirazi)

In this Major Research Project (MRP) I first provide details on what domain, topic, dataset(s) and research question in the first chapter. I start off by providing a detailed explanation on the datasets and the associated questions in the first chapter. Then a literature review of the Probabilistic Graphical Models (PGM) is provided in the second chapter, along with other machine learning algorithms used for classification. The third chapter will provide detail explanation of different graphical models, from the most simple ones to the most complicated ones, and their applications in answering the questions in the project. The performance results are documented in the fourth chapter, followed by future work in the fifth chapter.

Gurumoorthy, Subramanian Kaushik – Generating Realistic Synthetic Data Using Deep Learning Models (Supervisor: Farid Shirazi)

In today's competitive landscape, every organization is spending millions of dollars to stay compliant to stringent regulations like GDPR or CCPA. Ongoing compliance is a complex subject and it require time and money for organizations to implement.
On October 2, 2006, Netflix, the world’s largest online DVD rental service, announced the $1-million Netflix Prize for improving their movie recommendation service. The data was anonymized by removing personal identifiable information by replacing names with random numbers, to protect the privacy of the recommenders. Less than 18 months, the researchers de-anonymized some of the Netflix data by comparing rankings and timestamps with a second data set taken from the Internet Movie Database (IMDb). The researchers proved that it wasn’t hard to reidentify the individuals or their opinions or habits, which can easily help to understand and profile the individuals without much data.
At the same time there is need for organizations to be more agile in accelerating the process of artificial intelligence within each department to gain competitive advantage and increase profits. One of the important inputs to accelerate artificial intelligence is "high-quality realistic data". Organizations have the complete disposal of data to do analytics with data collected from their data stores, but it contains personal identifiable information. The masking or Anonymization process is a laborious and a time-consuming task and also has a downside as there is a chance of reidentification attacks.
So, there is a need to come up with a strategy to ensure we are able to use realistic data for analytics and also prevent privacy concerns.

Jayaraju, Kumara Prasana – Design, Development and Evaluation of Domain-Specific Topic Models and Classifiers for Public Health Using Big Social Data (Supervisor: Ravi Vatrapu)

In this paper, we are focussed on the posts that are related to COVID-19 vaccination from reddit and I have applied descriptive analytics to find an exploratory analysis to verify if I can find any temporal or linguistic patterns, correlations or relationships that could be used to generalize the posting patten and behaviour of reddit users. This research is a from motivation of couple of study that started in Digital Enterprise, Analytics and Leadership centre at Toronto Metropolitan University. This dataset was used to train and evaluate the classifiers was coded manually. This would be our second part and extended research of text analytics with keyword analysis, topic modelling and domain specific classifiers modelling using MUTATO method. We propose the 5 different labels such as Medical Health, Personal, Social, Conspiracy and None for the posts that we collected from reddit.

Jewitt, James – Emotion Detection Guided by Audio and Textual Information (Supervisor: Sridhar Krishnan)

Detecting emotions in speech can have a wide range of usage, such as aiding in human-computer interactions as well as helping identify issues with patients dealing with mental health problems. This paper aims to apply audio signals and text to predict different emotions within the data. Exploratory analysis is performed in order to learn more about the data, including bias, text and audio length, distributions and demographics of recordings. 2D CNN models, which take MFCC features extracted from the audio, are used, and a RoBERTa language model is fine-tuned over text data for my experimental design. The models are combined through model weights on datasets that have both audio and text modalities. The 2D CNN shows promising results in predicting emotions in aligned data, while RoBERTa performs well for detecting emotion with text from online conversations and speech.

Joukova, Alexandra – Identifying Discriminative Attributes for Differentiation Between Depressed and Non-Depressed Social Media Users (Supervisor: Morteza Zihayat)

Social media data presents an opportunity for analyzing users’ personal posts and the possibility of detecting their mental state for clinical purposes. This paper focuses on contrast pattern detection in Facebook posts text data with the purpose of discovering discriminative attributes which differentiate depressed social media users from non-depressed users. A labeled dataset is used which allows to classify the mental status of each user in regards to clinical depression.
Relative frequency ratios are used to find discriminative ngrams, and LDA is used to determine discriminative topics that differentiate the two groups. Machine learning is then applied to compare the results of depression detection models. Linguistic Inquiry and Word Count (LIWC) user level features are used as the baseline model. Results indicate that discriminative topics features improve depression prediction in comparison to the baseline model. Depending on the method, a combined model with LIWC features, discriminant topics, unigrams, and bigrams, can also perform better than the baseline model. Topic modeling can provide important discriminative features that can be useful in depression detection and understanding the differences between depressed and non-depressed social media users.

Klasnja, Anja – Impacts of Query Expansion in Information Retrieval Gender Bias (Supervisor: Ebrahim Bagheri)

This study explored gender bias arising from query expansion (QE) in the context of search engines and information retrieval. The analysis was conducted on Text REtrieval Conference (TREC) test collections - query and document collections of varying sizes and document types. Query reformulations by an assortment of QE algorithms were obtained with ReQue software. Ranked documents were retrieved in response to queries with Anserini software. The impact of different QE methods on gender bias in retrieval results was quantified using RaB/ARaB gender bias metrics and characterized using Linguistic Inquiry and Word Count (LIWC2015) scales of psychological constructs. Gender bias fluctuations were analyzed with respect to query genders, test collections, RaB/ARaB variations, numbers of top documents considered, QE classes and related fluctuations in LIWC scales.

Kodeih, Maya – A Deep Reinforcement Learning Approach Using DQN and DDQN to Optimize Job Shop Scheduling (Supervisor: Mucahit Cevik)

With reinforcement learning, agents are able to learn what actions to select in order to achieve their goal.
This paper, aims to apply this automated learning approach to the single machine dispatching rule selection problem. As such, we aim to show the potential of RL to agent-based production scheduling. In order to maintain enterprise profitability, it is essential to try and optimize machine productivity by optimizing scheduling. Detailed process plans are translated into the shop floor schedule, which is among the most crucial process in manufacturing systems. Although scheduling problems are known to be NP-complete, reinforcement learning algorithms have been shown to perform well for various scheduling problems. In this project, we specifically focus on deep reinforcement learning algorithms to explore new ways of solving the job shop problem for manufacturing production scheduling, and introduce a double deep Q network as an improvement.

Lane, Donald – Semantic Segmentation of Building Rooftops from Aerial Images Using Transformers (Supervisor: Naimul Khan)

Accurate building rooftop extraction from high resolution aerial imagery is of crucial importance in a wide range of applications. While automatic mapping of buildings is limited by insufficient detection accuracy, recent advances in deep learning has increasingly improved the performance of image segmentation. The primary objective of this project is to evaluate the use of a transformer-based model to perform automatic building rooftop segmentation from high-resolution aerial imagery. The results of the transformer-based model are compared to Fully Convolution Network results generated by the author of the dataset utilized in this project. This is a semantic segmentation task. The AIRS (Aerial Image Roof Segmentation) dataset was used to train a transformer-MLP model. SegFormer, a simple yet effective semantic segmentation framework that combines transformers with a lightweight Multilayer Perceptron (MLP) decoder was used for this research project. Applying inference on the test dataset, the results were compared to state-of-the-art results generated by the authors of the AIRS dataset. IoU, F1-score, precision, and recall were used to evaluate and compare performance of the SegFormer model to those using Fully Convolution Network architecture. The SegFormer transformer-based results did not exceed those obtained by the FCN models. The SegFormer produced a mean IoU on the test images of 0.828, whereas the FCN approaches resulted in IoU of 0.882, 0.888, and 0.899. Upon examining each individual image contained in the test dataset, it was proven that the SegFormer results surpassed those of the FCN models in 3 of the 5 geographic regions; the central business area, the industrial area, and the complex area, which primarily consisted of a mix of large and small buildings of varying complexity. These results illustrated the promise and potential of using transformers for vision-related problems, specifically semantic segmentation of building rooftops.

Majeed, Bilal – Deriving Corporate Image from Textual Data Using Sentiment Analysis (Supervisor: Ozgur Turetken)

In this paper, we focus on attempting to understand if a corporation’s success can be explained using various online sources: star ratings, reviews, news, social media, and other platforms. From each of these sources, we can extract consumer opinions that may have an impact on a corporation’s success. Therefore, we first defined the appropriate sources to extract these opinions, and then determined an effective sampling technique to obtain a representative sample. Once the opinions were extracted, we defined corporate sentiment using different combinations of the extracted sources. Lastly, we used the derived corporate sentiments to measure the degree to which it has an impact on a corporation’s popularity and financial performance.

Milley, Lesley – Predicting Wound Etiologies with Deep Learning (Supervisor: Naimul Khan)

In this paper we applied deep learning to predict four wound etiology groupings from two different datasets. Transfer learning of three architectures was explored and further tuning of two of those architectures was conducted using Optuna. The four etiology groupings include Leg Ulcers, Pressure Ulcers, Diabetic Foot Ulcers and Other. Leg Ulcers had the highest F1 score at 0.75 with an average of 0.71 across all of the etiology groupings. GRAD-CAM was used to explore where the model misclassified the data. Improved datasets with examples from all lower leg regions for all of the etiologies would improve this line of research.

Naahid, Shams – Identifying the Most Influential Nodes in Complex Networks Using Various Centrality Measures (Supervisor: Pawel Pralat)

Social networks, or complex networks in general, are complicated networks made up of members and connections between them. These networks are rapidly expanding due to the daily addition of new members. Since members in such large networks are not always equally significant, finding prominent members becomes a functional challenge. Centrality measures are designed in such a way that the corresponding scores reflect the importance of members within a network. These scores are very useful while understanding and maintaining a social network. In this project, we investigate several centrality measures available in literature over a large citation graph. While these results focus on citation network, the tools used could also be used to mine other complex networks, especially social networks. The results obtained were compared to the ground-truth (the current number of citations) for the top scored papers for each centrality measure to be able to confirm possibilities. Three correlation coefficients were then applied over these results along with the ground-truth to further understand which centrality measures were the most effective and which centrality measures were not to be preferred while analysing large networks.

Patel, Meghna – Assessment of Time Series Forecasting and Classification Methods for Predicting Patient Arrivals (Supervisor: Mucahit Cevik)

Online medical consulting, referred to as telehealth/telemedicine, is becoming the norm globally. It is crucial to forecast patient demand accurately for successful resource management and consumer experience. Such forecasting problems can be solved using various statistical and machine learning algorithms. A time series in simple terms is a series of data points collected at predetermined, equidistance time steps. A time series forecasting problem involves building models based on historic data points to predict future data points. For this research, patient volume data provided by an online medical consulting platform is used. Multiple statistical and deep learning forecasting algorithms are assessed in attempt to isolate the best approach to forecast patient arrivals over a specific period of time. Specifically, six regression models and four classification models are evaluated. We show that SARIMA performs best in the case of regression and XGB performs best in the case of classification. This paper compares the performance of each approach and presents challenges faced.

Priya, Sarika – Credit Card Fraud Detection (Anonymized Credit Card Transactions Labeled as Fraudulent or Genuine) (Supervisor: Farid Shirazi)

In the digital era of the transaction, the credit system has enabled millions to fulfil every sort of necessity with ease. With the rise of usage of credit cards, the fraudulent instances associated with it. Some innocent people lost their honest credit to criminals. Regulating such a huge amount of data and finding the suspicious case isn’t something that even a team of highly skilled humans can think of handling. To prevent such cases from happening we are proposing machine learning-based classification models as well as Deep learning-based classification models to help classify whether a transaction is fraudulent or genuine (non-fraudulent). We took data of all the transactions of two days from some European countries which provided us with the extreme level of imbalance in it. We have used stratified sampling for the distribution of the data. The models we propose, i.e., Logistic Regression, Random Forest, Gradient Boosting, Ridge Classifier, MLP, LSTM are first tuned to give best results then they were compared with each other upon various metrics such as Confusion matrix, Area Under the Precision-Recall Curve, Gain and Lift chart to come up with the best possible that would deal with a high level of imbalance, at the same time keeping the score to a productive level. In this whole research, we have implemented all the machine learning classification algorithms and compared them with the deep learning models like MLP and LSTM to find out the best performing model. Also, the emphasis is given on the treatment of imbalanced data to arrive at such results.

Saha, Milan – Consumer Opinion Classification for Major Canadian Telecom Operators (Supervisor: Farid Shirazi)

Telcom market in Canada is very much competitive. There are three main operators along with some other small Telco’s are providing service in Canadian market. Social media is good source of data to measure how any company is performing since customers provide their opinions and put reviews online themselves. Operators are also interactive via social media to reach out to customers.
This project is to measure the competitive performance of mobile phone operators from the data of Twitter by creating a classification model with machine learning to investigate consumer opinions and their way of interaction via tweet data.
Total 116,375 tweets were collected from the official accounts of the top Canadian telecom operators (Bell, Rogers, TELUS, FreedomMobile). After processing and cleaning the dataset, an exploratory analysis was done to find hidden patterns, and a classifier model was developed to analyze sentiment scoring using few ML algorithms.
Out of all 4 Canadian telecom’s TELUS has highest percentage of positive tweets. Rogers is in second position with 70% score. For the text classifier model Linear SVM with count vectorizer had maximum accuracy and later after fine tuning with random oversampling technique TD-IDF vectorizer produced the highest accuracy.
This solution will help wireless telecom operators to learn about the negative customer experiences and improve the positive experience of services.

Sandhya – Speech Emotion Recognition (SER) (Supervisor: Wahab Mohamed Ismail)

In this project, I used a novel idea to use machine learning and deep learning-based techniques to automatically detect emotions in speech. I began by performing an exploratory analysis of four different datasets to see how the different speech samples appear on visualization and to see if any patterns, correlations, or relationships could be discovered.
In the second section, I began by creating a machine learning - based classifier utilising decision tree techniques, followed by the random forest algorithm, multi layer perception, 1D convolutional neural networks (ID CNN), and 2D convolutional neural networks (2D CNN) to reach to the best possible results.

Sidhu, Tanveer – Rossman Sales Prediction (Supervisor: Youcef Derbal)

In this paper, I discuss sales predictions for Rossman, one of the largest German drug store companies. Several factors affect sales, including promotions, school and state holidays, seasonality, competition, and day of the week. Using exploratory data analysis, correlations and relationships between features are explored to identify underlying patterns in the data. Feature engineering is used to identify underlying interactions between different independent variables. The data cover 1115 stores, however for this project I have randomly chosen store 627 for the analysis of sales prediction. Various predictive analytical algorithms were compared based on performance metrics such as root mean squared error and mean absolute error.

Somisetty, Kusumanjali – Online Detection of User’s Anomalous Activities Using Logs (Supervisor: Pawel Pralat)

Securing the Organization’s Confidential Information is always a concern for any Organization. This Paper implements a Machine Learning approach to monitor the Users activities and determine the anomalous Data. The term anomalous data refers to data that are different from what are expected or normally occur. Detecting anomalies is important in most industries. For example, in network security, anomalous packets or requests can be flagged as errors or potential attacks. In customer security, anomalous online behavior can be used to identify fraud. In addition, in manufacturing and the Internet of Things, anomaly detection is useful for identifying machine failures.

Stavrinou, Nicholas – Estimating the Effect of Pathway Alterations on Breast Cancer Survival: Cox Proportional Hazard Model (Supervisor: Youcef Derbal)

In this project the impact of pathway alteration probabilities on breast cancer survival was analyzed using data from The Cancer Genome Atlas. The pathways evaluated are Cell Cycle, DNA Repair, MAPK, and PI3K/AKT1/MTOR. Clinical and microarray data was combined, and survival modeled using Cox Proportional Hazards. Initial exploratory analysis on microarray and clinical data was performed, followed by data preprocessing. Pathway alteration probabilities were calculated using tumour and normal tissue microarray readings. Missing normal tissue was imputed using medians. Feature selection was attempted using LASSO regression which failed to identify known breast cancer survival features, such as hormone receptor status. The final evaluation only included the pathway alteration probabilities and subtype stratified data. The models illustrated the effect of pathway alteration may vary across the different subtypes. Dataset limitations with respect to the number of patients, variation in tissue sample collection, and clinical data sparsity impact the interpretation of results.

Sun, Jia Yong – Combined Model for Individual Household Electric Power Consumption Forecasting Based on the Arima and LSTM (Supervisor: Sharareh Taghipour)

Most of machine learning algorithms and models mainly focus on improving the forecast accuracy, this paper presents a new method of the combined forecasting models. By using ARIMA and LSTM model to fit and predict time-series data respectively, a new combined forecasting method is obtained based on the combination of two methods with reasonable weight coefficients. The result of this research shows that the new combined prediction model can describe the linear time-series data on individual household electric power consumption, of which the prediction accuracy is higher than that of each single model. This new method of combining models can be extended to add more forecast models based on different target data.

Tahrirchi, Fatemeh – Comparing Machine Learning Algorithms with Ordinary Least Squares Methods to Forecast Housing Prices (Supervisor: Murtaza Haider)

House price valuation plays a critical role in real estate markets. Homebuyers and sellers use valuation estimates to judge the adequacy of the list or ask price. Lenders use valuation models to determine the risks associated with mortgages and other underwriting. This study compares house price valuation modelling algorithms and explores the factors (variables) that are instrumental in explaining housing prices. This project describes data cleaning, processing, and training of models using the observed real estate transactions for dwellings sold for prices ranging between $120,000 to $3 million in the City of Toronto. The data set covers sales between January 2016 and January 2018. This study predicts housing values using six machine learning models and one neural network model and compares their output with the traditional OLS algorithm. A review of the estimation results revealed that the XGBoost estimation returned the lowest prediction error.

Tang, Bill – Bull Put Option Vertical Spread Trading Using Neural Networks (Supervisor: Alexey Rubtsov)

This study applies neural networks to generate trade signals for the Bull Put Vertical Spread strategy on the stock Apple (AAPL). Options provide their buyer with the right but not the obligation to buy (Calls) or sell (Puts) 100 shares of an underlying stock at a strike price up until an expiry date. The Bull Put Vertical Spread involves selling a Put and buying another one with the same expiry but a lower strike price. This strategy is suitable for retail investors because it requires low investment, has limited downside and can profit even when the underlying stock does not move. The neural network was structured as a binary classifier, predicting whether AAPL is expected to stay above 95% of the current price within the next 14 days. Applying a set of trading guidelines on the network's outputs, the recommended trades successfully outperformed the buy-and-hold strategy from 2020-Jan till 2021-Aug.

Thakar, Romaben Anandkumar – Credit Card Fraud Detection (Supervisor: Farid Shirazi)

With increasing growth of e-commerce, which has coupled with the increase in online payments, fraud detection has become an important issue for banks. Fraud in financial transactions can cause heavy damages and endanger their reputation among their customers. Thus, focusing on a variety of fraud detection methods, as well as new ways to tackle and preventing them, is becoming increasingly essential. It is important that credit card companies can recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Wilson, Raaid – Bayesian Deep Learning and Intrusion Detection Systems (Supervisor: Farid Shirazi)

Intrusion detection systems are part of cyber-security measures that determine if a computer system has been breached. Many machine learning approaches have been explored to detect different types of security attacks. However, most of these experiments have not tested the model under zero-day attack circumstances. A zero-day attack would comprise of attack patterns that the models under question have not yet been trained on. The purpose of this MRP is to evaluate the effects of Bayesian Regularization, on a deep neural network, for detecting zero-day attacks. The tests show that Bayesian regularization shows no advantage over early stopping method in the proposed scenarios. In the setting of this experiment, the regular-test error rates of the Bayesian model are higher than those of the deep neural network, while the differences in detecting zero-day attacks are negligible.

Yarlagadda, Yesaswini Susrita – An Inexpensive End to End Deep Learning Model to Predict the Steering Angle of the Vehicle (Supervisor: Alex Ferworn)

Google, Tesla, and Uber are the pioneers in automation technology. Automobile industries have adopted intensive algorithms from Google and Tesla to develop autonomous vehicles. These autonomous vehicles have hardware that can support intensive algorithms. However, complex self-driving algorithms developed by Google or Tesla cannot be embedded into small scale hardware. In this paper, an inexpensive deep learning model for an autonomous self-driving vehicle is presented. The main objective of this study was to develop an end-to-end an inexpensive model to predict the steering angle of a vehicle. The autonomous systems present in self- driving vehicles cannot be embedded into commodity hardware. This study presents a solution to developing a lightweight model that can be embedded into low-cost hardware. These models when embedded into robotic systems that can perform transportation of goods can avoid 90% of manual handling accidents. Another major contribution of this study is to modify the present architecture of AlexNet and reuse it to solve an autonomous driving problem. Additionally, the implementation of NVIDIA’s model PilotNet is also presented in this study. Finally, a comparison was carried out in terms of size, performance, and complexity between 3 models. The model proposed in this study is 3 times smaller than PilotNet and 15 times smaller than AlexNet. While reducing the complexity of the Neural Network, the model used in this study maintained the performance concerning NVIDIA’s model.

Zazay, Muhammad Ayaz – Applying Transfer Learning for Binary Classification of Breast Cancer Using Tissue Images (Supervisor: Farid Shirazi)

In this paper, deep learning techniques such as transfer learning is applied to dataset for binary classification of breast cancer. Breast cancer is the second most common cancer present in women. Pre trained models such as Efficient Net and Mobile Net were applied to automate the process of classifying images into malignant or benign classes. Pre trained model weights trained on Image Net dataset were used to classifying the images according to the two classes. The dataset has histopathological images with four different magnification factors 40x, 100x, 200x, 400x. Both the models were applied on each magnification factor and the results were compared. The two models were compared on the basis of F1-score, sensitivity and specificity. The dataset that was used to train and evaluate the performance of the model was obtained from Kaggle website and the name of the dataset is “Break His dataset”.

Zhao, Yuheng – Identifying the Busiest Intersections in Markham Using Betweenness Centrality Method with Different Models (Supervisor: Pawel Pralat)

Transportation in the city of Markham nowadays is worse than before. Roads are always very crowded. This is because lots of people in the GTA area moves to live in Markham. A large number of people live in Markham, but they do not work there. The goal of this project report is to find the busiest intersections between the commuters and work places. The network of the road in Markham is a complex network. For analyzing complex road network, lots of graph mining methods could be used in various areas. In this research, the betweenness centrality measurement has been used to analyze the complex road network to identify the busiest intersections in the Markham in different three models. Three correlation coefficients have been used to identify the differences between models.

Zhou, Jiangning – Prediction of Residential House’s Space Heating Demand Using Artificial Neural Network Models (Supervisor: Alan Fung)

The research applies artificial intelligence to predict the space heating demand of households in Toronto, Ontario, Canada, which is gathered as a part of the Ecobee Donate Your Data program. First, statistical and data visualization are used to conduct exploratory data analysis to help to have a preliminary understanding of the data and explore the relationship among the data. According to the analysis results of EDA, treating space heating as a continuous variable to build a data-oriented regression prediction model. Then, a variety of methods and technologies are used, such as data aggregation and time-lag information in feature extraction, simple machine learning regression algorithm and deeper neural network regression algorithm ANN are used to predict the space heating demand of houses. This study conducted experiments on each scheme in turn to evaluate the practical feasibility of developing the prediction model using readily available data. The experimental results show that the model of space heating demand can be successfully built using artificial neural network. The model use features that are simple and easy to gather, and contains the time-lag characteristics of the past one hour. In the future research of space heating, domain knowledge can be used to filter out relevant data, and then the model can be used for prediction to achieve better prediction results.

2020

Abdolali-Senejani, Ali – Investigating the Challenges of Building a Robust Network Intrusion Detection System Through Assessment of Features and Machine Learning Models (Supervisor: Andriy Miranskyy)

Background: Existing literature on network intrusion detection systems has focused on testing transfer learning models on different portions of the same dataset. The efficacy of transfer learning needs to be assessed on different sources of data. Aim: Use a combination of feature selection and transfer learning strategies to evaluate the performance of machine learning models on distinct network datasets. Methodology: Select common features among different datasets identified as important to perform transfer learning experiments. This involves training models on one dataset to evaluate performance on other datasets. Results: Feature importance tools illustrated that models were using irrelevant features to make decisions. Transfer learning experiments yielded poor results when tested on two distinct datasets. Conclusion: Dropping irrelevant features improved the performance of models. Poor transfer learning results could be associated with factors such as large variations in their creation dates, leading to a significant difference in the workloads under study.

Ahmed, Sabbir – Identifying White Blood Cell Sub Type from Blood Cell Images Using Deep Learning Algorithms (Supervisor: Farid Shirazi)

White blood cell (WBC) differential counting is an established clinical routine to assess patient immune system status. Fluorescent markers and a flow cytometer are required for the current state-of-the-art method for determining WBC differential counts. However, this process requires several sample preparation steps and may adversely disturb the cells. We present a novel approach using deep learning algorithms to find subtype of white blood cell from blood cell image. For the datasets, two deep learning classifiers were evaluated on stain-free imagery using stratified 5-fold cross-validation. On the white blood cell dataset the best obtained result was 84% accuracy. Here we propose a model to identify subtype of white blood cell after evaluating a series of deep learning algorithms.

Anumanchineni, Harish – Cardiovascular Risk Detection Using Machine Learning and Artificial Neural Networks (Supervisor: Farid Shirazi)

In this paper, I applied machine learning and artificial neural networks to predict the risk of an individual to cardiovascular disease. An Individual day to day living habits were considered as features of the dataset for the research.70,000 records were used for research. The data was collected from kaggle.com. Major classification machine learning algorithms were applied for the risk prediction. The dataset was also trained on Artificial Neural Network for better training and more accurate risk prediction. Initially conducted the exploratory data analysis for knowing how the data is distributed. Conducted literature review to get useful insights before proceeding to Research. Applied machine learning and Neural Networks on the dataset. Experimented with reduced features considering key features for risk of cardiovascular disease. Used TensorBoard to see how the training is going at each iteration while using Neural Networks. Successfully predicted the risk with an accuracy of 73 percent.

Araujo, Gregory – Transport Mode Detection Using Deep Learning Networks (Supervisor: Bilal Farooq)

With the advancement of technology, large amounts of global positioning system (GPS) trajectory data is being produced and recorded by many devices and products. This has made learning transportation modes a relevant area of research given its applications to everyday life. Depending on the mode of transportation, factors such as location and weather can have an impact on information collected from GPS trajectories. This project focuses on combining extracted features from GPS trajectories collected via the MTL Trajet Project along with geospatial and temporal attributes provided by the Government of Canada to detect transport mode (bike, car, public transportation, walking). This project examines the effectiveness of utilizing Convolutional Neural Networks (CNN), Long Short Term Networks (LSTM) and Convolutional LSTM Networks (ConvLSTM) in transport mode detection.

Bagheri, Moeen – Multi-Step Forecasting of Walmart Products (Supervisor: Konstantinos Georgiou)

Forecasting future sales is important to retailers for managing inventory and making marketing decisions. Product sales are affected by many external factors, which must be considered when forecasting future sales. In this paper, the effects of these factors were directly taken into account in the four models created. These four models include three singular models, consisting of LSTM, MLP, and LGBM, as well as a hybrid model. The hyperparameters of the singular models were optimized using Bayesian optimization. Furthermore, we aimed to provide 28-days ahead sale forecasts, by forecasting one day at a time. The LGBM model was able to achieve the best performance, followed by the hybrid model. The outstanding results of the LGBM model shows the potential of boosting methods in improving the overall performance. Moreover, the LSTM model was able to outperform the MLP model, which demonstrates the ability of LSTM networks in learning from time-series data.

Beilis Banelis, Aleksander – Loan Outcome Prediction in P2P Lending (Supervisor: Farid Shirazi)

This study utilizes the book of loans from a Peer-to-Peer lending platform called Lending Club to determine if machine learning can be applied to predict the final outcome of already approved loan so that losses can be minimized from bad loans. Three different machine learning algorithms are applied: Logistic Regression, Random Forest, and Feedforward Neural Network. The dataset presents the challenges associated with imbalanced classes, and high class overlap among the features, these are addressed through under-sampling for model training, and cost-sensitivity for model optimization. Results indicate that the benefits of prediction models are almost entirely eliminated by the costs of incorrect predictions for good loans. Applying the same approach can yield a different result where error costs differ.

Chane, Gagandip – Smart Reply for Online Patient-Doctor Chats (Supervisor: Mucahit Cevik)

Telehealth is an evolving field that’s enabling remote medical services through the use of technology. In this study, we propose a smart reply system in the medical domain to support online doctors through automated reply suggestion to patient messages by utilizing historical online chats between doctors and patients. We first showcase a data labeling exercise to transform raw data into a usable format. Followed by this we design a triggering model, a binary classification task to detect patient messages that are good candidates for reply suggestion and a reply suggestion model, a multinomial classification task to predict top responses for a given patient message. For the triggering model, feedforward neural network outperformed the random forest classifer. For reply suggestion, LSTM-based models slightly outperformed feedforward neural networks on average but had a significantly greater run-time. The results proved that its possible to achieve high performing smart reply system in the medical domain.

Chow, Roger – IDC Prediction in Breast Cancer Histopathology Images (Supervisor: April Khademi)

Invasive ductal carcinoma (IDC) is the most common form of breast cancer accounting for 80% of all breast cancer diagnoses. Manual diagnoses of IDC from examining histopathology slides by a pathologist is a tedious time-consuming process. With advances in whole slide image scanner technology, there is growing interest in the automation of IDC detection. In this paper, experiments were conducted to develop a deep convolutional neural network model for IDC classification on breast cancer histopathology image patches. An open source IDC dataset, containing 277,524 labelled image patches (70% IDC negative and 30% IDC positive), was used for this study. The highest performing model built in this study achieved an accuracy of 91.05%, a balanced accuracy 90.12% and a recall rate of 87.98% using the DenseNet201 architecture with cyclic learning.

Chowdhury, Mohsena – Investigating Organizational Crisis Using Text Mining Technique (Supervisor: Farid Shirazi)

In any organization, crises handling is a crucial factor. Crisis can be critical and may even completely collapse the business. From the study we dig out a strong communication could manage the organizational crisis very sensibly. For this project we use email corpus of Enron corp. The aim of this research project to find out any probable relation pattern among organizational email log that identified the crises before it appear using text mining methods. Our approach utilizes Latent Dirichlet Allocation (LDA), a popular topic modeling technique to determine the topical distributions of email. We also utilize two different packages (Gensim and Mallet) to compute the LDA and also compare the result with LSA/LSI. Finally we evaluate these three model by measuring the Coherence Score and find the optimal model depending on the score. Moreover we also analyse the resulted topics to find out if there any pattern that indicate the potential crises.

Cohen, Rory – Predicting the Profitability of Canada’s Big Five Banks: Predictive Analysis With Google Trends and Twitter (Supervisor: Ozgur Turetken)

Google Trends and Twitter sentiment analysis have been used by analytics researchers in a variety of studies to predict sales, market specific data, and customer consumption. This study aims to estimate the quarterly profitability ratios of Canada’s Big Five Banks using the aforementioned data. Differing from similar reports that attempt to use consumer behaviour in financial projections, we also use the Google Trends of each company’s direct competitors to generate our predictions. A linear regression model was created as a baseline for metric prediction. Two subsequent neural networks were tuned and applied to the same datasets in an attempt to improve performance. Two cost related metrics showed strong results using the trend and sentiment data. Ultimately, the findings suggest that factors outside of consumer behaviour have an outsized impact on most quarterly financial metrics. Company resource allocation and management interests play a large role in the ratios within our scope.

Dhall, Ankit – Household Space Heating Demand Modelling Using Simplified Black-Box Models (Supervisor: Alan Fung)

This research aims at applying a novel idea of utilizing an ANN (Artificial Neural Network) black-box model to predict the space heating demand of households in Toronto, Ontario, Canada. The data used is gathered as a part of the Ecobee Donate Your Data program. First, an exploratory analysis is conducted using descriptive analytics and data visualization to try and find patterns, or relationships that could help give insight into the data. Further, multiple approaches and techniques such as data aggregation and inclusion of time-lag information are applied to model and predict the space-heating demands of any house using basic, easy to record features only. In addition, experiments are conducted to gauge the practical viability of the black-box model developed. This research was conducted as a continuation of an ongoing study at the university's Centre for Sustainable Energy Systems (CSES). Despite a few issues faced with the data being modelled, the space-heating demand was successfully predicted using black-box ANN models using simple, easy to observe features and including time-lag information for the past half-hour. In addition, the model was able to portray a practical learning capability as additional data was added. For future studies to predict space-heating using the given data, it is recommended to apply data aggregation techniques and additional feature engineering, as well as filtering out only relevant data using domain knowledge to be able to achieve better prediction results.

Emamidoost, Maryam – Application of Deep Learning in the Segmentation of the Brain Regions to Predict Alzheimer’s Disease (Supervisor: Ayse Bener)

Among any other brain disease, Alzheimer’s Disease is the one that is ranked to be the third cause of death after heart disease and cancer in older adults and the sixth throughout the United State. In this research, we aim to predict Alzheimer’s Disease based on the structural components of the human brain. To this end, we create two Convolutional Neural Network models, first for the segmentation of the brain regions based on the Harvard-oxford Atlas and the second for the prediction of Alzheimer’s Disease based on the segmented MRI images. The results from the prediction of AD based on the segmentation model, indicate that we can create a link between the structure of the brain and the appearance of Alzheimer’s Disease.

Ewen, Nicolas – Self Supervision for Classification on Small Medical Imaging Datasets (Supervisor: Naimul Khan)

Traditionally, convolutional neural networks need large amounts of data labelled by humans to train. Self supervision has been proposed as a method of dealing with small amounts of labelled data. The aim of this study is to determine whether self supervision can increase classification performance on small medical imaging datasets. This study also aims to determine whether the proposed self supervision strategy is a viable option for small medical imaging datasets. A total of 8 experiments are run comparing the classification performance of the proposed method of self supervision with the performance of basic transfer learning. The experiments run with the proposed self supervision strategy perform significantly better than their non-self supervision counterparts. The results suggest that self supervision can improve classification performance on small medical imaging datasets. They also suggest that the proposed self supervision strategy is a viable option for small medical imaging datasets.

Ghavifekr, Amin – Machine Learning Approach in Forex (Foreign Exchange) Market Forecasting (Supervisor: Farid Shirazi)

In recent years, applying machine learning techniques to historical Foreign Exchange market data has gained a lot of attention. We have contributed to the published literature by utilizing comparable methods to the four major currency pairs (EUR/USD, GBP/USD, USD/CHF, and USD/JPY), concentrating on time series analysis for trend or momentum predictions. We used Long-Short Term Memory networks (LSTMs), a form of recurrent neural network to build our model for testing two methods of prediction. Point-by-Point prediction and Multi-Sequence prediction. Furthermore, we examined the use of more than one input dimension. Our results showed that Multi-Sequence prediction method together with employing multi-dimensional inputs, although not perfect, does give us clearer indication of future price trends.

Gupta, Vatsla – Automated Hate Speech Detection Using Deep Learning Models (Supervisor: Anatoliy Gruzd)

Social media is an interactive online platform where people express their opinions on various subjects. It has become a prominent ground for online toxic hate behavior. Online hate speech detection has received significant attention due to the rise in cyberbullying across various social media platforms and poses key challenges like understanding the context in which the words are used. To address this concern, we explore combinations of word embedding models like Keras word embedding, Word2Vec and GloVe with deep learning models like BiLSTM, CNN and CNN-BiLSTM to explore the deeper semantics and syntactic construct of the tweets to help understand the context and hence aid in hate speech detection. First, the words in the tweets are converted into word vectors using word embedding models. Then these word vectors are fed into deep learning models to effectively learn the context for hate speech classification. The experimental results showed that for each model with different embedding matrices significantly improved accuracy and F1 score. Tested by 10-fold cross-validation, CNN-Bi-LSTM with the GloVe word embedding performed the best. Also, to explain the predictions of deep learning classifiers, LIME analysis is performed to validate model’s credibility.

Hemel, Tahseen Amin – A Deep Learning Approach in Detecting Financial Fraud (Supervisor: Farid Shirazi)

Being capable of detecting fraud transactions out of all credit card transactions in real-time is extremely important for financial institutions. According to McKinsey, worldwide losses from card fraud could be close to $44 billion by 2025 . Thus it is challenging for financial institutions to quickly identify the fraudulent transactions without hampering legitimate transactions to provide superior customer experience to all stakeholders. In this project, I have used the dataset which contains transactions made by credit cards in September 2013 by European cardholders (open source available in Kaggle) and conducted four different experiments with different classifying approaches to identity fraud and legitimate transactions. To measure the performance of the classifiers in different experiments I have used Classification reports, confusion matrix from ‘sklearn’ library and RMSE score, and compared which experimental setup is more efficient in identifying frauds.

Houshmand, Bita – Facial Expression Recognition Under Partial Occlusion (Supervisor: Naimul Khan)

Facial expressions of emotion are a major channel in our daily communications, and it has been subject of intense research in recent years. To automatically infer facial expressions, convolutional neural network based approaches has become widely adopted due to their proven applicability to Facial Expression Recognition (FER) task. On the other hand Virtual Reality (VR) has gained popularity as an immersive multimedia platform, where FER can provide enriched media experiences. However, recognizing facial expression while wearing a head-mounted VR headset is a challenging task due to the upper half of the face being completely occluded. In this project we attempt to overcome these issues and focus on facial expression recognition in presence of a severe occlusion where the user is wearing a head-mounted display in a VR setting. We propose a geometric model to simulate occlusion resulting from a Samsung Gear VR headset that can be applied to existing FER datasets. Then, we adopt a transfer learning approach, starting from two pretrained networks, namely VGG and ResNet. We further fine-tune the networks on FER+, AffectNet and RAF-DB datasets. Experimental results show that our approach achieves comparable results to existing methods while training on three modified benchmark datasets that adhere to realistic occlusion resulting from wearing a commodity VR headset.

Ilic, Igor – Explainable Boosted Linear Regression (Supervisor: Mucahit Cevik)

Time series forecasting is a continuously growing research field. In recent times, there has been a growth in the number of deep learning-based models. While these models are highly accurate, a trade-off is made in terms of model interpretability. Not only do deep learning methods present problems with interpretability, but they are also more difficult to compare on single time-series datasets. This work alleviates these problems by presenting two novel ideas. First, a new approach to time series model comparison is introduced. This new approach allows for robust time series comparison in cases of lengthy model training time. Then, a new time series forecasting model, Explainable Boosted Linear Regression (EBLR), is presented. EBLR is compared to other ensemble methods and can retain accuracy while reducing complexity in its formulation.

Ioi, Kevin – Dirichlet Multinomial Mixture Models for the Automated Annotation of Financial Commentaries (Supervisor: Mucahit Cevik)

Supply chain managers require detailed reports to be written whenever performance significantly deviates from forecasts. These reports summarize and explain the key drivers of the variance, which are then used to inform future business decisions and forecasting. This paper proposes an automated system for the annotation of the variance commentaries and applies machine learning models to classify time series instances of performance data with the topic derived labels. Class labels manually annotated by an industry analyst are compared against the output of three topic modeling methodologies, namely LDA, GSDMM and GPUDMM. Various machine learning models are applied for the classification task, including LSTM, FCN, XGB and KNN-DTW. The numerical study shows that topic derived labels achieve higher levels of performance in the classification task when compared to the baseline labels. The proposed system could save time and provide valuable insights for business management.

Ionno, Anthony – Benchmarking Machine Learning Prediction Methods on an Open Dataset of Hourly Electricity Data for a Set of Non-Residential Buildings (Supervisor: Thomas Duever)

Roughly 64.7% of Canada’s annual electricity consumption was attributed to non-residential (commercial and industrial) buildings in 2016 [1]. Smart meters provide companies and building managers with an opportunity to track electricity consumption at an hourly or sub-hourly granularity. Efficient management of a non-residential building’s electricity consumption is beneficial for the building manager in terms of bill savings, the electricity system, the demand reduction in peak hours or demand shifting can defer or even prevent significant system infrastructure investment [2], and for the environment in the form of reduced carbon emissions. The aim of this paper is to present a variety of supervised and unsupervised machine learning methods that might allow companies or building managers to better predict future electricity consumption to make more informed decisions on building operations. In this paper we train six machine learning models on one year’s worth of hourly electricity data for each of the 828 non-residential buildings in our dataset. Randomised search time-series cross-validation was used to determine optimal hyperparameters for each building and model combination. We also present a cluster analysis model as an exploratory technique to understand how daily electricity load profiles can be grouped and compared more generally in a variety of circumstances. Our test results show that across our sample and each of the models tested Mean Absolute Percentage Error (MAPE) varied considerably and that this is likely due to significant differences in a building’s electricity consumption patterns between the training and test set. We also found that Gradient Boosting Decision Trees (GBDT) outperformed all the other machine learning models we tested by a significant margin.

Kabe, Devika – Text Highlighting to Improve Quality of Online Medical Services (Supervisor: Mucahit Cevik)

The medical domain is one which is often subject to information overload. With the constant updates to online medical repositories and the increasing availability of biomedical datasets, it is difficult to analyze the data in a structured way. This creates additional work for medical professionals who are heavily dependent on medical data to complete their research. This paper aims to apply different text highlighting techniques to be able to capture relevant medical context. This would reduce the doctors’ cognitive load and response time to patients by facilitating them in making faster decisions, thus improving the overall quality of online medical services. Two methods to highlight text are performed. The first is via Local Interpretable Model-Agnostic Explanations (LIME), which are applied to a number of classification models. The second method is applying binary classification models to n-grams. These models are applied to different vector embeddings including word2vec and BERT. The results of these experiments show that unigram classification models outperform LIME and can successfully be used to highlight medically-relevant words. The results also show that performance goes down as the models highlight bigrams and trigrams, and thus segment highlighting needs to be analyzed further.

Kamei, Josephine – Predicting the Remaining Useful Life of the C-MAPSS Turbofan Engine Simulation Dataset FD001 (Supervisor: Sharaeh Taghipour)

Prognostics has been employed in machinery maintenance where degradation patterns due to various mechanical problems are observed, and as such, prognostics constantly monitors the current state of the machinery which helps predict the time remaining before a likely machinery or system failure, which is referred to as remaining useful life (RUL). This report focused on utilizing the C-MAPSS turbofan engine simulation dataset FD001 where implementing Regression Decision Trees, Random Forests, and Gradient Boosting Regressor algorithms would predict RUL values. The results indicated that Random Forests achieved the most accurate prediction model when compared to the other two algorithms.

Karami, Zahra – Cluster Analysis of Stock For Efficient Portfolio Management (Supervisor: Farid Shirazi)

Stocks are a common kind of financial time series. In this project, I present a new similarity measure for time series clustering and then select a set of stocks to create an efficient portfolio, which is of crucial importance in the process of creating an efficient portfolio. This method reduces the efficient times of portfolio using clustering-based selection, and only selects a subset of stocks from different groups to create an efficient portfolio each time, then it is easy to get the portfolio with the lowest risk at a given level of return. S&P index stocks were used for current work and compared with other selection methods, the results show that this method could largely reduce the efficient times of portfolio. ward hierarchical cluster was used to cluster stocks.

Lalonde, Rebecca – Direct Marketing Modelling: Comparing Accuracy and True Positive Rates of Classification Models (Supervisor: Brennan Thompson)

There are many classification models available: when predicting the result of a marketing campaign, which is the best to use? Metrics such as accuracy and true positive rate must be considered in order to maximize profit. This paper compares these metrics across various classification models. The dataset used is a Portuguese bank’s campaign, found at UCI’s Machine Learning Repository (Moro et al., 2014), targeting customers through phone calls to encourage them to subscribe to a term deposit. The classification goal is to predict whether a customer will accept or decline. Analysis is coded in RStudio; the models looked at include Naïve Bayes, logistic regression, decision trees, and SVM. Accuracy and true positive rates are compared with confusion matrices and ROC curves. The runtimes and interpretability of each model are also discussed.

Li, Vivian – Predicting Stock Market Volume Changes with News Article Topics (Supervisor: Alexey Rubtsov)

In this project, news articles from Kaggle’s “All the News” dataset are used to predict changes in the S&P500 index trading volumes. The dataset was split into subsets to create 5 cross-validation sets and 3 test sets. To build the model, we started by extracting 100 main topics on peak days, where a peak day is defined as a day where the change in trading volumes is outside the 95% confidence interval in the training set. From the 100 topics, a LASSO regression model was used to select the most relevant topics for predicting volume changes and to generate predictions. For the 3 test datasets, the model performance was mainly evaluated on the number of peaks explained and the RMSE, and the results had varied success. Compared to a time series prediction, however, the LASSO regression was able to better predict the timing of the volume fluctuations. On days where the number of peaks were explained, many of the top topics were related to finance, business, and politics.

Malik, Garima – Predicting Financial Commentaries Using Deep Neural Networks (Supervisor: Mucahit Cevik)

Companies generate financial reports to measure business performance and assess deviations from the forecasts. Analysts comment on these reports to explain the causes of deviations. In this research paper, we propose a deep learning-based approach to predict the commentaries from the financial data generated by a company. We formulated the problem as a time series classification task where variance drawn from the difference of forecast and actual numbers is presented as a monthly time series. The data is manually labeled by financial expert’s into financial commentary classes. We considered various Deep learning models for the prediction task including Long Short Term Memory Networks (LSTM) and Fully Convolutional Networks (FCN). In order to show the competencies of neural network architectures, we have also created the synthetic time series data and classification is performed on industry data as well as on rule-based data. We consider AI interpretability as an additional component to the project to better explain the predictions to business users. Our numerical study shows that FCN provides higher performance and a natural and better explainability with Class Activation Maps compared to the other methods. The proposed approach leverages management information systems to offer significant insights for the managers and financial experts on key financial issues including sales and demand forecasting.

Milacic, Dejan – Neural Style Transfer of Environmental Audio Spectograms (Supervisor: Sridhar Krishnan)

Neural Style Transfer is a technique which uses a Convolutional Neural Network to extract features from two input images and generates an output image which has the semantic content of one of the inputs and the “style” of the other. This project applies Neural Style Transfer to visual representations of audio called spectrograms to generate new audio signals. Audio inputs to the style transfer algorithm are sampled from the Dataset for Environmental Sound Classification (ESC-50). Generated audio is compared on the basis of input spectrogram type (STFT vs. CQT) and pooling type (max vs. average). Comparison is done using Mean Opinion Scores (MOS) calculated from ratings of perceptual quality given by human subjects. The study finds that STFT spectrogram inputs achieve high MOS when subjects are given a description of the style audio. The audio generated using CQT spectrogram inputs raises concerns about using visual domain techniques to generate audio.

Murad, Mohammad Wahidul Islam – Demand Forecasting For Wholesale Sales by Industry Considering Seasonality Demand (Supervisor: Saman Hassazadeh Amin)

Demand forecasting is the basis for planning supply chain activities and is very important to choose effective forecasting technique that is appropriate on specific data set. The appropriate forecasting technique helps management to use this information and maintain the flow of materials, products and information in supply chain management. Many active researches are going on different demand forecasting techniques for several years. The aim of this research project is to study and implement effective forecasting techniques applied on time-series data set with different wholesale products by industry type under the North American Industry Classification System. The objective of this research project is to demand forecast for wholesale product by industry based on historical time-series data, evaluate and compare forecast accuracy using performance measurement evaluation metrics. In this research project, three time-series forecasting models ARIMA, SARIMAX and Seasonal Decomposition were used to predict the demand for 23 different wholesale products. The evaluation metrics Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) were used to identify the accuracies between actual and predicted data. The outcome of this project is to compare different forecasting models and identify the most suitable forecasting technique that can be used for predicting wholesale products.

Oyetola, Oyindamola – Predicting Housing Prices Using Deep Neural Networks (Supervisor: Tony Hernandez)

Housing price prediction is an important aspect of real estate as it can improve the efficiency and stability of the real estate market for buyers, sellers and the government. Regression analysis is the traditional approach to hedonic price-prediction models. Using a detailed dataset of house sale transactions in Ames, Iowa, this research compares the predictive power of four analytical methods (deep neural networks, principal component regression, support vector regression and regression trees). Results show that the deep neural network approach is the most effective for predicting house prices. The deep neural network also performs better with large amounts of data compared to other machine learning algorithms.

Parker, Megan – Predicting Stages of Dementia: An Exploration of Feature Selection and Ensemble Methods (Supervisor: April Khademi)

Dementia is a syndrome which affects 50 million patients worldwide with symptoms ranging from forgetfulness to difficulty walking [1]. Diagnosis of dementia is a challenging and important problem as no test exists today which can easily classify the type of dementia for a patient [3]. The objective of this research is to build a pipeline, which uses imaging and non-imaging features to predict the stages of dementia for a given patient. The first aim of this research is to determine whether grouping features into subsets can improve model performance. The second aim of this research is to determine whether the results for individual classifiers can be improved using ensemble methods. The ADNI Dataset will be used in this experiment [2]. The highest performing model was an ensemble which used a combination of deep learning and traditional classifiers trained separately on imaging and non-imaging data, with an accuracy of 89.12%.

Patel, Kshirabdhi – Insight Extraction from Regulatory Documents Using Text Summarization Techniques (Supervisor: Saman Hassanzadeh Amin)

Legal documents are hard to understand and generally requires special knowledge to be able understand and gain information from it. In such situation it is hard to find and follow acts and regulations which are suitable for our business, jobs or other work. Sometimes hiring a person who understands it and can helps us costs hundreds of dollars. So, in current time there is a need of some technology which can help us to overcome these problems and recommend us a list of acts and regulations which are suitable to us and also provide summary which can make us understand those legal texts. To address the discussed situation here we are developing a NLP framework to automatically extract relevant documents as per the user's requirements and give an summary report of the regulations. The dataset which was used here is Canadian Government Regulation and Acts. This dataset was made public for the use of data science community in the year of 2018 by Canadian Government.

Percival, Dougall – Experiments in Human-Interpretable Feature Extraction for Medical Narrative Classification (Supervisor: Mucahit Cevik)

Statistics Canada’s Canadian Coroners and Medical Examiner’s Data is a database containing coroner’s reports, unstructured text with the results of their findings. Statistics Canada is searching for improved methods of identifying relevant information, and classifying reports. Due to COVID-19 imposed constraints, a Medical Transcriptions data set was used to mimic this data. To solve this problem, seven experiments were conducted using rule-based and machine learning based techniques of information extraction, and text classification. The results indicate that custom Named Entity Recognition, a subset of Natural Language Processing, is the most promising method for extracting key information that can further help classify unstructured text narratives. As a government agency, Statistics Canada requires transparency in its methods, and the best method offers not only a strong data classifier, but also one that is transparent and easily interpretable.

Rezwan, Asif – Analysis of Daily Weather Data in Toronto to Predict Climate Change Using Bayesian Approach (Supervisor: Farid Shirazi)

The daily weather data of Toronto City from 1840 – 2017 was used to assess whether there is a change in the pattern of occurrence of rainfall and snowfall over the years in this region using Bayesian analysis procedure. The Markov Chain Monte Carlo (MCMC) methods were used to find the posterior. No-U-Turn Sampler, as a recent MCMC method, generated approximate posterior distributions of lengths of wet and dry spells for the rainfall and snowfall data for the 177-year period. By time series plotting of the posterior a comparison was made and it was found that the probabilities of wet spells have seen some significant changes over the time for both Rainfall and Snowfall Data; the trend is upward in the case of rainfalls while downward for snowfalls.

Saleem, Muhammad Saeed – Speech Recognition on English and French Dataset (Supervisor: Konstantinos Georgiou)

Emotions are a basic part of human nature and carry additional insight into human actions. In this paper I’ll attempt to create a model that’ll help classify basic human emotions. The initial model will be created on English Language dataset from TMU Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Based on recent studies, Mel-Spectogram helps extract important features from an audio data and those features will be used in 4 different models, SVM, Deep NN, CNN, CNN+LSTM, to test which will provide the best result. The best model will also be tested on a French dataset where the idea is to test if basic emotions differ depending on the type of language.

Saleem, Waleed – Recognizing Pattern Based Maneuvers of Traffic Accidents in Toronto (Supervisor: Saman Hassanzadeh Amin)

This paper discusses the utilization of machine learning techniques to detect patterns of traffic accidents in Toronto. The primary and most fundamental purpose of carrying out this research is to identify and analyze the driving patterns and behaviors in Canada Toronto, as the main sample. The aim of this project paper is to examine the factors that contribute to road accidents in the country; and to evaluate statistically the effect of certain driver’s personal characteristics on road accidents. This paper has proposed a model that trained multi class classification dataset through machine learning algorithms. This paper has used 4 classifiers that are used for supervised learning. Each classifier is implemented on the dataset in order to find the accuracy of the model. The classifiers are also compared to find out the best option for the given dataset. The accuracy of the algorithm showed more than 95% on the dataset which indicates the algorithm was a perfect fit for the given dataset.

Seerala, Pranav Kumar – Classification of Chest X-Ray Images of Pneumonia Patients (Supervisor: Sridhar Krishnan)

In this work, an attempt has been made to come up with a neural network with a limited number of parameters, with a goal of classifying chest X-rays of pneumonia from healthy patients. The intended applications would be edge devices like cellular phones, Raspberry Pi’s and other computing devices that could be used in developing countries which might be lacking in hardware to deploy and update the model. The dataset used is primarily on pediatric patients and demonstrates the usage of image segmentation, image de-noising and training data selection, to train on images with the most meaningful information, rather than the entire dataset. The results of the hyper-parameter tuned model show a dramatic improvement in overall accuracy of the test set when compared to other Kaggle kernels.

Silina, Eugenia – Knowing the Targets When Innoculating Against an Infodemic: Classifying COVID-19 Related News Claims (Supervisor: Anatoliy Gruzd)

In this paper, news claims related to Covid-19 were classified into multiple mutually exclusive pre-defined categories, based on text content of each claim. The goal of the study was to automate this classification, previously performed manually. For this purpose, a Naïve Bayes (NB), “one-vs-one” (OVO) and “one-vs-all” (OVA), also known as “one-vs-the-rest” (OvR), approaches were used. While results for OVO were inconclusive, NB and OVA produced similar results, though the overall performance metrics for both NB and OVA were not very high. However, due to the particulars of the dataset, including it being unbalanced, predictions of some of the claim types were more successful than of others, with performance metrics ranging significantly.

Song, Tianci – Damaged Property Detection With Convolutional Neural Networks (Supervisor: Alexander Ferworn)

Recent studies have shown good results using the VGG network architecture to detect damaged buildings automatically with satellite imagery after any natural disasters including hurricane and tsunami. The purpose of this project is to enhance the damaged property detection process using post-hurricane satellite images with different convolutional neural network architectures. The image dataset used in this study contained affected area in Greater Huston Area, Texas before and after Hurricane Harvey in 2017. Two architectures including ResNet and Inception were used in the project. For each architecture, three configurations were trained with 3-fold cross validation, and the best configuration was chosen to develop the final model. The results showed that ResNet provided higher prediction power with accuracy around 98%, while Inception had slightly lower accuracy around 94%. In conclusions, ResNet outperformed Inception. The configuration with data augmentation and reduced adaptive learning rates yielded an improvement for both architectures.

Thanabalasingam, Mathusan – Using ProtoPNet to Interpret Alzheimer’s Disease Classification over MRI images (Supervisor: Mucahit Cevik)

Alzheimer’s Disease is one of the most common diseases today with no cure. It is important to be able to identify the disease in an effective matter, and machine learning has made major strides in being able to do so. Deep neural networks are difficult to interpret; the inputs and outputs can be understood by humans, but not the actual process. This paper considers a relatively new interpretability deep learning neural network architecture called ProtoPNet, which allows a model to make its classifications based on prototypical parts of the input image. The models are trained on the OASIS MRI brain image dataset. The results of this experiment show that ProtoPNet is able to classify images with relatively high accuracy, while also providing a level of interpretability not present in most deep learning models.

Tsang, Leo – Predicting NBA Draft Candidates Using College Statistics (Supervisor: Pawel Pralat)

In this paper, we applied machine learning algorithms to dissect potential NBA draft candidates. We began by conducting descriptive analytics to identify trends of NBA drafted players, followed by an understanding of today’s game, and patterns in looking at our recent past drafted players. The second part of our analysis is handling of imbalance data using Synthetic Minority Over-sampling, and random under sampling. The last part of our analysis is creating strong attributes by feature engineering existing data, and applying XGBoost, Logistic Regression, Multi-Layer Perceptron to identify potential draft candidates. Lastly, we go over the importance of this type of model, and how it can be used by the front office of NBA teams.

Uddin, Md Rokon – Demands and Sales Forecasting for Retailers by Analyzing Google Trends and Historical Data (Supervisor: Saman Hassanzadeh Amin)

Objective of this project is to create forecasting models for retailers by using Artificial Neural Network (ANN) so that they could make business decisions by visualizing future data. Two forecasting models are introduced here. One is sales model which will predict future sales and the second one is demand model which will predict future demands. To achieve the objective, CNN-LSTM model is used for both sales and demand predictions because this hybrid model can learn from very long range of historical data and predict the future efficiently

Xu, Shaofang – Credit Risk Rating Model Development via Machine Learning (Supervisor: Alexey Rubtsov)

Credit rating is a fundamental piece in credit risk management for financial institutions. Recently researchers and practitioners of financial institutions started applying machine learning methods to credit rating problems. Using the historical issuers’ ratings as target and issuers’ information and performance histories as predictive features, credit rating problem can be solved as a binary classification or multi-class classification problem under the supervised learning. This article adopts four approaches: logistic regression (LR), decision trees (DT), gradient boosting regression tree (GB), and random forests (RF), and proposes a simple framework that utilizes the ordinal characteristics embedded in the credit ratings. Many popular binary classification algorithms can also be incorporated into the proposed framework. Empirical results on US listed companies indicate that decision tree based ensemble algorithms, GB and RF in this article, have outperformed the other two approaches as well as the traditional statistical model in all performance measurements including the discriminatory power and the rating match-rate.

Yeasmin, Nilufa – A Prediction Model for Chest Radiology Reports and Capturing Uncertainties of Radiograph Using Convolutional Neural Network (Supervisor: Naimul Khan)

Chest radiography is the most common imaging examination worldwide, critical for screening, diagnosis, and management of many life threatening diseases. The purpose of this project is to create a model for automatically detecting radiology reports, capturing uncertainties inherent in radiograph by using Convolutional Neural Network (CNN). The idea is to investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. The train models take input a single-view chest radiograph and output is the probability of each of the 14 observations. In addition, I‟ve used DenseNet121 and DenseNet169 for training the dataset. I‟ve compared the performance of the different uncertainty approaches on a validation. The model has performed as well as radiologists in detecting different pathologies in chest X-rays.

Zhang, Dongrui – Predicting Exchange Rate of Currency by LSTM Model (Supervisor: Alexey Rubtsov)

Using information technology for international foreign exchange rates forecasting will help investors and policy-makers get more profits and make better policies. Machine learning algorithms are widely used in the prediction of financial time series. The LSTM (Long Short-Term Memory) neural network, as one of the classic models in machine learning technology, is advantageous in mining long-term dependency of sequential data. Based on the analysis of the prediction of the CAD/USD exchange rate, this project discusses the feasibility of the short-term direction forecasting of exchange rate using the LSTM neural network, find out the influence for accuracy between different time steps, and between only exchange rate datasets and exchange rate datasets with macroeconomic features. The results show that using the LSTM model, it has the best performance to predict the exchange rate of the next week’s direction, and has no obvious improvement by adding macroeconomic features.

2019

Afsar, Tazin – Chest X-Ray Segmentation Utilizing Convolutional Neural Network (CNN) (Supervisor: Naimul Khan)

The proposal of this project is to analyze the Chest X-ray segmentation process using the improved Attention gate (AG) U-Net architecture. This model suppresses the irrelevant regions and saliently points out the useful features for the targeted relevant tasks. Also, it takes less computational resources and learns automatically for the different sizes and shapes of the target chest’s X-ray images. AG with U-Net increased the model’s sensitivity and accuracy. This experiment is a continuation of an existing work of “RSNA Pneumonia Detection Challenge” [27]. The proposed architecture is analyzed by using two renowned chest x-rays data-sets: Montgomery County and Shenzhen Hospital. The experimental result shows the improvement of dice scores and accuracy by 1.0% and 4.0% respectively compared with the existing standard U-net architecture.

Ahmed, Sayed – Effect of Dietary Patterns on Chronic Kidney Disease (CKD) Measures (ACR), and on the Mortality of CKD Patients (Supervisor: Youcef Derbal)

Chronic Kidney Disease (CKD) leading to End-Stage Renal Disease (ESRD) is very prevalent today. Over 37 millions of Americans have CKD. CKD/ESRD and interrelated diseases cause a majority of the early deaths. Many research studies have investigated the effects of drugs on CKD. However, less attention has been given to the study of the dietary patterns on CKD. This research study has uncovered significant correlations between dietary patterns and CKD mortality as well as identified diagnostic markers for CKD such as the Albumin to Creatinine Ratio (ACR). In this project, Dietary surveys from NHANES, and CKD Mortality dataset from USRDS were utilized to study the correlation between dietary patterns and morbidity of CKD patients. Principal Component Analysis and Regression were utilized to find the effect. Machine Learning Approaches including Regression, and Bayesian were applied to predict ACR values. Grains, Other Vegetables showed positive correlations with Mortality whereas Alcohol, Sugar, and Nuts showed negative correlations. ACR values were not found strongly correlated with dietary patterns. For ACR value prediction, 10 Fold Cross Validations with Polynomial Regression showed 95% accuracy.

Barolia, Imran – Synonym Detection with Knowledge Bases (Supervisor: Pawel Pralat)

This study presents distributed and pattern based approach to identify similar words in given tweets, using low level vector embedding in vector space model. To achieve good results using distributed approach, Bilinear scoring function has been calculated. Score (u,v) = 𝑥𝑥𝑢𝑢𝑊𝑊xvt . 𝑥𝑥𝑢𝑢 is the potential source word embedding and 𝑥𝑥𝑣𝑣 are knowledge base seeds. Synonym seeds have been used from existing knowledge base (WordNet) and have been generated more synonyms that are not present in knowledge base but can be potential synonyms in given corpus. Term-Relevance Computational algorithm has also been used to identify synonyms that are specific to given corpus. Another approach that has been presented is pattern base approach. Co-occurrence matrix mas been prepared and it calculate the probability of occurring 𝑥𝑥𝑢𝑢 and 𝑥𝑥𝑣𝑣 within window size of 10. Low level embedding has been learned using conditional probability of 𝑥𝑥𝑢𝑢 and 𝑥𝑥𝑣𝑣. Result have been presented for both approach and best result has been achieved by combining both approaches. These approached have been evaluated by regenerating same and more synonyms from dataset and evaluated against existing knowledge base. Using distribution and pattern based approach with bilinear scoring function and conditional probability the precision and recall were 74% and 55% respectively which is quite good as other study find 60% precision and lower recall.

Boland, Daniel – Battery Dispatching for Peak Shaving Using Reinforcement Learning Approaches (Supervisor: Mucahit Cevik)

Economic dispatch of energy resources such as batteries is an important and current problem. We apply three reinforcement learning algorithms, the Monte Carlo On-Policy and Off-Policy algorithms, and the DynaQ Planning algorithm, to a load-connected battery with time-of-use charges and a demand rate, to study the agent’s ability to converge towards a least-cost policy including peak-shaving. In two simple cases we use a fixed daily load profile, and in a third case we use 31-days of data to reflect uncertainty in the demand. In the simple cases, we observe the Monte Carlo agents converge more quickly and achieve better savings than the DynaQ agent, but all agents typically yield savings of only 40-50% of what is demonstrated to be possible after a 10,000-episode training time. The DynaQ agent significantly outperforms the Monte Carlo agents in the case of 31-days of data, highlighting planning behavior by reserving some charge and consistently achieving a higher degree of peak load reduction.

Cai, Yutian - Musculoskeletal Disorders Detection With Convolutional Neural Network (Supervisor: Naimul Khan)

Musculoskeletal disorder is a common cause of chronic pain and movement impairment, which is diagnosed with medical imaging technologies such as X-rays. Due to the limited supply of skilled radiologists, the detection is expensive and time-consuming. In this project, we propose a model using machine learning or neural network techniques to perform the same task as radiologists in detecting abnormalities in Musculoskeletal X-ray. Musculoskeletal radiographs (MURA) is a large open-source radiograph image dataset that is used to develop and test our model. It contains labeled images for training and validation, as well as a hidden test to evaluate the model. Python will be employed in the project as it has a variety of package choices from statistical analysis to data visualization. We hope that the model can distinguish between the normal and abnormal X-ray studies and lead to significant advances in medical imaging technologies.

Choi, Claudia – Using Deep Learning and Satellite Imagery to Predict Road Safety (Supervisor: Tony Hernandez)

This paper expands on previous work combining satellite imagery and deep learning to predict road safety. Studies have shown support for the hypothesis that features of the built environment have an impact on city-scale issues and can be observed through satellite imagery. In this paper, a labelled dataset of satellite imagery was generated for the City of Toronto. Class balancing techniques were then used to mitigate model bias - the best technique was used for the experiments. A Convolution Neural Network (CNN) was trained for overall road accidents, pedestrian accidents and cyclist accidents. Each CNN model followed the ResNet50 framework pre-trained on ImageNet. The resulting high accuracy scores and low macro F1 scores indicate model sensitivity towards the majority class. The models were able to use observable features of the built environment to predict ‘highly safe’ regions but show poor performance on regions labelled has ‘highly risky’.

Chowdhury, Md Mushfique – Forecasting Sales and Return Products For Retail Corporation and Bridging Among Them (Supervisor: Saman Hassanzadeh Amin)

The purpose is to show how we can bridge between sales and return forecast data for each and every product of retail store by using the best model among several forecasting models & how management can utilize this information to improve customer satisfaction, inventory management or re-define policy for after sales support for specific products. The way of doing multi-product sales & return forecasting by choosing the best forecasting model (among several forecasting models) for every product was shown. Several machine learning algorithms has been used – ARIMA, Holt-Winters, STLF, Bagged Model, timetk, Prophet. For every product, best forecasting model was chosen after comparing all of these models to generate sales and return forecast data which was then used to classify every product as “Profitable”, “Risky” and “Neutral”. Experiment showed that 3% of total products was identified as “Risky” items in future. Management can use this information to take some crucial decisions. This paper showed how to compare different models to choose the best one for each and every product and dynamically choose the best model to generate sales and return forecast data without giving more focus in optimizing the models. This is completely a new approach of utilizing sales and return forecast data to give a unique insight to the management for taking informed decision for different crucial aspects as identified above.

Ensafi, Yasaman – Neural Network Approach For Seasonal Items Forecasting of a Retail Store (Supervisor: Saman Hassanzadeh Amin)

In recent years, there has been growing interest in the field of Neural Networks. However, for the task of seasonal time-series forecasting which has many real-world applications, different researches have shown varied results. In this paper, the performance of Neural Network methods in seasonal time-series forecasting has been compared with other methods. At first, classical timeseries forecasting methods like Seasonal ARIMA and Triple Exponential Smoothing have been used and then, more current methods like recently published model Prophet, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) have been applied. The dataset is public and consists of the sales history of retail store. The performance of different models has been compared to each other using different accuracy measurement methods such as RMSE and MAPE. The results showed the superiority of the Stacked LSTM over other methods and also, indicated the good performance of the Prophet model and CNN model.

Etwaroo, Rochelle - A Non-Factoid Question Answering System for Prior Art Search (Supervisor: Morteza Zihayat)

A patent gives the owner of an invention the exclusive rights to make, use and sell their invention. Before a new patent application is filed, patent lawyers are required to engage in Prior Art Research to determine the likelihood that an invention is novel, valid or to make sense of the domain. To perform this search, existing platforms utilize keywords and Boolean Logic, which disregards the syntax and semantics of Natural Language and thus, making the search extremely difficult. Consequently, studies regarding semantics using neural embeddings exist, but these only consider a narrow number of unidirectional words. As a solution, we present a framework which considers bidirectional semantics, syntax and the thematic nature of natural language. The content of this paper is two-fold; BERT pre-trained embedding is used address the semantics and syntax of language, followed by the second component, which uses Topic Modelling to return a diverse combination of answers that covers all themes across domains.

Hosmani, Chaitra – User Interest Detection in Social Media Using Dynamic Link Prediction (Supervisor: Ebrahim Bagheri)

Social media provides a platform for users to interact freely and share their opinions and ideas. Several researches have been conducted to predict user interests in social media. Because of the dynamic nature of social media, user interests change over time. In this paper, given a set of emerging topics and user’s interest profile over these emerging topics we are interested to predict the user interest profile for the future. We conducted this experiment on twitter data captured for 2 months from 1 November 2011 to 1 January 2012. We will be using temporal latent space to infer characteristics of users and then predict user’s future interests over these given topics. We will evaluate the results with different ranking metrics like MAP and nDCG. We will also compare our results with the results of Zhu et al. temporal latent space which uses the same methodology but on a different dataset.

House-Senapati, Kristie - The Use of Recommender Systems for Defect Rediscoveries (Supervisor: Andriy Miranskyy)

Software defects are a known issue in the world of technological advancement. Software defects lead to the disruption of services for a customer, which in turn results in customer dissatisfaction. It is not feasible for all customers to install a fix for every known defect as this requires extra resources. Our goal is to predict which future defects a customer may discover, so that a fix can be put into place before the customer discovers that defect. We use recommender systems to build a predictive model. We evaluate our approach with publicly available datasets mined from Bugzilla (Apache, Eclipse and KDE). The datasets contain information about approximately 914,000 defects over a period of 18 years. From our experiments, we find that the popular algorithm performs the best with average Matthew Correlation Coefficient of 0.051. We also observe that the Funk SVD, apriori, eclat and random algorithm perform poorly.

Husna, Asma - Demand Forecasting in Supply Chain Management Using Different Deep Learning Methods (Supervisor: Saman Hassanzadeh Amin)

Supply Chain Management (SCM) is a very fast growing and largely studied field of research that is gaining in popularity and importance. Most organizations focus on cost optimization and maintaining optimum inventory levels for consumer satisfaction, where Machine Learning techniques helps these companies. The main goal of this paper is to forecast the unit sales of thousands of items sold at different chain stores located in Ecuador. Three deep learning approaches Artificial Neural Network (ANN), Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are adopted here for better predictions from the Corporación Favorita Grocery Sales Forecasting dataset collected from Kaggle website. Finally, the performances are evaluated and compared. The results show the LSTM network tends to outperform the other two approaches in terms of performance. All experiments have been conducted using Python's deep learning library and Keras and Tensorflow packages.

Islam, Samiul – Product Backorder Prediction Using Machine Learning Techniques to Minimize Revenue Loss With Efficient Inventory Control (Supervisor: Saman Hassanzadeh Amin)

Prediction of backorders of products boosts up companies’ revenues in many ways. In this work, we have predicted the backorder of products using two machine learning models named Distributed Random Forest (DRF) and Gradient Boosting Machine (GBM) in H2O platform and have compared their performances. We have observed that the GBM successfully identified approximately 94 products out of every 100 products those go on backorder. We have noticed that the current stock level and the lead time of products act as the deciding factor of products’ backorder in approximately 45% of cases. We have shown how this model can be used to predict the probable backorder products before actual backorder can happen and visualize the impact on inventory management. Moreover, we have identified that the decision threshold below 0.3 for high probable backorder products and the threshold between 0.2 to 0.8 for low probable backorder products maximizes organizational profit.

Lee, Veson – Estimating Volatility Using A LSTM Hybrid Neural Network (Supervisor: Alexey Rubtsov)

Volatility estimates of market traded financial instruments are used in risk management models and portfolio selection. Hybrid neural networks are neural networks which combine a traditional parametric model such as GJR-GARCH with a neural network component and have been shown to improve volatility predictions. This paper will examine hybrid neural networks incorporating two different neural architectures, one with a LSTM, without exogenous explanatory variables and measure its performance using data from the Toronto Stock Exchange and S&P 500. We found that, in a neural network without exogenous explanatory variables, hybridizing the network by incorporating the parameters from a GJRGARCH(1,1,1) model does seem to have some possible benefits.

Matta, Rafik - Deep Learning to Enhance Momentum Trading Strategies using LSTM on the Canadian Stock Market (Supervisor: Alexey Rubtsov)

Applying machine learning techniques to historical stock market data has recently gained traction, mostly focusing on the American stock market. We add to the literature by applying similar methods to the Canadian stock market, focusing on time series analysis for basic momentum as a starting point. We apply long-short term memory networks (LSTMs), a type of recurrent neural network and do a comparative analysis of the results of a LSTM to a logistic regression (LOG) approach as well as a basic momentum strategy for portfolio formation. Our results show that the LSTM financially outperforms both the LOG and basic momentum strategy, however the area under the curve of the receiver operating characteristic curves show the results do not outperform a random walk selection. We conclude that there might not be enough data in monthly returns for the LSTM in its current configuration.

Natarajan, Rajaram - Road Networks – Intersections and Traffic (Supervisor: Pawel Pralat)

For this MRP, Research will be conducted on traffic/congestion in the Road Network. Our goal was to identify the most critical intersection, looking for ways to improve traffic and what can make the traffic worse. I will build a simulation model, research the city road network and the simulated traffic, I will show the areas which has the highest and lowest congestion based on the simulation, and which areas has the most efficient flows in traffic. I will also evaluate the impact on traffic when a critical node is brought down, impact on high congestion area and in the overall network. I will talk about the connections and relationships between the most critical nodes. Based on the research, recommendation will be provided on the changes in the road including new road/bridge construction that reduce the traffic and ways to reduce congestion. I will be using the Ottawa-Gatineau Dataset and Julia language.

Ozyegen, Ozan – Experimental Results on the Impact of Memory in Neural Networks for Spectrum Prediction in Land Mobile Radio Bands (Supervisor: Ayse Bener)

Land-Mobile radio systems support many vital communication functions supporting government and private operators, some related to public safety and mission critical functions. The models produced will help in understanding the usage patterns at different time periods to predict occupancy and demand by different channels across the spectrum. CRC (Canadian Research Corporation) is providing Layer 3 data sampled every hour. This data is further explored and processed under this MRP. A powerful learning algorithm called Long Short Term Memory Networks is used to predict the occupancy of LMR bands over multiple time horizons. The results are compared with a seasonal ARIMA model and a Time Delay Neural Network. Results show that LSTM prediction models that remember long term dependencies and thus designed to work with time series data provide a better alternative for accurately predicting spectrum occupancy in bands that exhibit similar characteristics to LMR channels, especially as the forecast horizon gets longer.

Patel, Eisha – Generating Stylistic Images Using Convolutional Neural Networks (Supervisor: Sridhar Krishnan)

Fine arts have long been considered a reserved mastery for the minority of talented individuals in society. The ability to create paintings using unique visual components such as color, stroke, theme, etc. is currently beyond the reach of computer algorithms. However, there exist algorithms which have the capability of imitating an artist’s painting style and stamping it on to virtually any image to create a one-of-a-kind piece. This paper introduces the concept of using a convolution neural network (ConvNet or CNN) to individually separate and recombine the style and content of arbitrary images to generate perceptually striking “art” [2]. Given a content and style image as reference, a pre-trained VGG-16 ConvNet can extract feature maps from various layers. Feature maps hold semantic information about both reference images. Loss functions can be developed for content and style by minimizing the mean-square-error between the feature maps used. These loss functions can be additively combined and optimized to render a stylistic image [6]. This technique is called Neural Style Transfer (NST) and it was originally developed by Leon Gatys in his 2015 research paper, “A Neural Algorithm of Artistic Style”. My MRP research attempts to replicate and improve upon the work done by Leon Gatys. The purpose of this research is to experiment using a variety of feature maps and tweaking the loss function to identify visually appealing results. A total variation loss factor is also included to minimize pixilation and sharpen feature formation. Images generated have been assigned a Mean Opinion Score (MOS) by a group of non-bias individuals to affirm the attractiveness of the results.

Peachey Higdon, Ben – Time-Series-Based Classification of Financial Forecasting Discrepancies (Supervisor: Ayse Bener)

We aim to classify financial discrepancies between actual and forecasted performance into categories of commentaries that an analyst would write when describing the variation. We propose analyzing financial time series leading up to the discrepancy in order to perform the classification. We investigate what models are best suited towards this problem. Two simple time series classification algorithms – 1-nearest neighbour with dynamic time warping (1-NN DTW) and time series forest – and long short-term memory (LSTM) networks are compared to common machine learning algorithms. We perform our experiments for two cases: binary and multiclass classification. We examine the effect of including supporting datasets such as customer sales data and inventory. We also consider augmenting the data with noise as an alternative to random oversampling. The LSTM and 1-NN DTW models are found to be the strongest, suggesting that the time series approach is appropriate. Including the inventory dataset improves multiclass classification. Data augmentation grants a slight improvement for some models over random oversampling.

Postma, Cassandra - Netflix Movie Recommendation Using Hybrid Learning Algorithms and Link Prediction (Supervisor: Pawel Pralat)

Netflix, a streaming service that allows customers to watch a wide variety of movies, is constantly updating and optimizing their search and recommendation algorithms to improve user experience. The aim of this paper is to recommend movies to users using different link prediction methods as well as predict a user’s movie rating using a comparison of various learning algorithms. First, an exploratory analysis was conducted to find any correlations between variables and users. Then several algorithms were used to predict a user’s movie rating such as KNN, SVM, and hybrid learning algorithms. Finally, the data was represented as a graph and several link prediction algorithms were run to compare different recommender systems.

Ragbeer, Julien – Peak Tracker (Supervisor: Bilal Farooq)

IESO is the Crown Corporation responsible for operating the electricity market in the province of Ontario, Canada. IESO publishes (every-changing) forecasts for what it expects Ontario electrical demand to be in the near future. In this paper, we focus on short term time-series forecasting (within 24 hours). This solution hopes to forecast better than IESO, so that large commercial customers can feel surer about what the upcoming demand is, and when to shave power if they are Class A customers. The solution combines many data sources (weather forecast data, historical weather data and historical demand data) and aggregates them. The project uses numerous regressors (both linear and non-linear) on the aggregated data to come up with a prediction which is compared to IESO’s forecast using 3 metrics, coefficient of determination, mean absolute error and the number of times it correctly predicts the hour of the highest daily demand. The results of this paper (10%-40% more accurate than IESO in some cases) show that there’s value in out-predicting IESO’s free model – being more accurate can have a positive effect on the bottom-line.

Raja, Abdur Rehman – Rating Prediction of Movielens Dataset (Supervisor: Tony Hernandez)

In the modern world convenience has become the biggest factor in our modern-day lives. Due to the overwhelming choices each consumer has there is a need to filter, prioritize and efficiently deliver recommendations to consumers. This project aims to look at one of the most famous datasets provided by GroupLens for research purposes called MovieLens 20M. GroupLens is a research lab trying to advance the theory and practice of social computing. GroupLens has collected and made available rating datasets from the MovieLens website which is a free movie recommendation service. The project will look at finding the best solution to predict the movie ratings to be used in a recommender system algorithm. One of the main algorithms we use that is discussed in detail is BellKor’s Solution which is the algorithm the winner used in the Netflix competition to predict movie ratings. A comparison of BellKor’s solution and other algorithms take place to find the best algorithm suited for this dataset.

Roginsky, Sophie – Radio Coverage Prediction in Urban Environment Using Regression-Based Machine Learning Models (Supervisor: Ayse Bener)

Having a reliable predictive model of radio signal strength is an essential tool for planning and designing a radio network. The propagation model is often used to determine the optimal location of radio transmitters in order to optimize the power coverage in a geographic area of interest. This research proposes a Generalized Linear Model for radio signal strength prediction. Using feature engineering methods, the performance of the linear model was optimized to offer predictive accuracy comparable to more complex regression models, i.e. Multi-Layer Perceptrons and Support Vector Regressors, found in existing literature. Beyond computational efficiency, the advantage of the GLM is that it is linear in parameters, making it a viable option for coverage optimization applications.

Saeed, Usman - Digital Text Forensic: Bots and Gender Classification on Twitter Data (Supervisor: Farid Shirazi)

This research work describes the contribution of the Data Science department of Toronto Metropolitan University, Canada in task bots and gender profiling at CLEF PAN-191 evaluation lab. The goal of this paper is to detect (A) if the author of a Tweet is a bot or a human, (B) if human, identify the gender of that particular author. Data set was made available by PAN lab. We participated in the research of English language data set only. In the proposed approach, before applying machine learning models, we used different word vectorization techniques after applying various preprocessing techniques (stemming, stop words removal, lowercase, etc.) on the data set. On independent evaluation of PAN lab test dataset, we got best accuracies 79.51 on task A (binary class) by using MultinomialNB and 56.55 on task B (multi-class) by using Decision Tree classifier.

Sokalska, Iwona - Boosting Bug Localization with Visual Input and Self-Attention (Supervisor: Andriy Miranskyy)

Deep Learning (DL) methods have been shown to achieve higher Mean Reciprocal Rank (MRR) scores in bug-localization compared to Information Retrieval (IR) methods alone. A combination of both can boost scores by 6% for MRR of 48%. The DL model consists of Recurrent Neural Network. In natural language research, it has been demonstrated that RNN neural networks with visual input and ‘attention mechanism’ are more robust at tasks that require incorporation of distant information. The objective is to examine whether an RNN network with attention mechanism using images of code snippets can achieve higher scores than an RNN alone. Moreover, to see if the improved performance is in a similar range as the improved performance between standalone RNN vs. RNN + IR. Using the data gathered from the open source Spring-Boot project, covering data from 2013-2018, a baseline RNN model was compared to an enhanced model RNN with a supportive convolutional neural network that analyses the image of the source code. A 5-fold experiment was conducted to compare the baseline model with the 2 test models. Two test models differ only with the usage of self-attention in the convolutional branch. The test model with self-attention had the highest mean accuracy across the 5 folds, of 61.98 in comparison to 60.70 of the base model. The two-tailed Welch t-test reveals that this difference between the means is not statistically significant. In contrast the IR methods on average provided 6% boost to the scores.

Tabassum, Anika – Developing a Confidence Measure Based Evaluation Metric for Breast Cancer Screening Using Bayesian Neural Networks (Supervisor: Naimul Khan)

Screening mammograms is the gold standard for detecting breast cancer early. While a good amount of work has been performed on mammography image classification and many of the recent ones have made use of deep neural networks successfully, there has not been much exploration into the confidence or uncertainty measurement of the classification, especially with Bayesian neural networks. In this paper, we propose a new evaluation criterion based on confidence measurement for breast cancer mammography image classification, so that in addition to classification accuracy, it provides a few numeric parameters that can be tuned to adjust the confidence level of the classification. We demonstrate the use of Bayesian neural networks and transfer learning in the process of achieving that. We also demonstrate the expected behaviour resulting from tuning of the parameters and conclude by saying that the approach is extendable to any domain in general and any number of classes.

Zhang, Shulin – Artificial Neural Networks in Modelling the Term Structure of Interest Rates (Supervisor: Alexey Rubtsov)

In this paper, we applied Artificial Neural Network (ANN) to model the term structure of interest rates. In the exploratory analysis we observed the trend of the yield curve since 1991 to understand the underlying pattern. The Principle Component Analysis is employed to construct the input dataset as well as serving as a baseline model. We used different hyper-parameters, customized loss function and implemented regularization to tune the ANN model. The result section discussed the selection of best model and the prediction differences between PCA and ANN. ANN can match PCA results in a very limited case of strong regularization. The ANN has the potential to replace PCA but a careful design needs to be reviewed. This project is a continuation of an existing study that Dr. Alexey Rubtsov started for Global Risk Institute. The predicted analysis is used to provide more insights on financial applications of ANNs.

Zhao, Xin – Station Based Bike Sharing Demand Prediction (Supervisor: Bilal Farooq)

Bike-sharing have been increasing popularity in recent years due to its usage flexibility, reduction in traffic congestion and carbon footprint. Being able to accurately predict each bike-sharing station’s demand at any given hour is crucial for inventory management. This report first manipulated Bike Share Toronto ridership data with Toronto City Center weather data from 2016 Quarter 4 to 2017 Quarter 4, then implemented machine learning algorithms in particular Regression Trees, Random Forest, and Gradient Boosting Machine (GBM) to forecast station based hourly bike-sharing demand in the City of Toronto. The results indicated that Random Forest based prediction model was the most accurate model by comparing Root Mean Square Error (RMSE) of all bike-sharing stations.

2018

Amadou, Angelina – Geospatial Simulation and Modelling of Out-Of-Home Advertising Viewing Opportunity (Supervisor: Pawel Pralat)

Companies use out-of-home (OOH) advertising to promote their products. The purpose of the project is to build an integrated multi-source simulation model that allows Environics Analytics to establish optimal locations for marketing campaigns. The study area is located in the province of Manitoba in Canada. It concerns mainly the Winnipeg Metropolitan Area, which includes the city of Winnipeg and its surrounding municipalities. Using Dijkstra's algorithm for finding shortest paths, a simulation algorithm is developed. The top ten busiest intersections are retrieved and used as recommended locations for out-of-home (OOH) advertising. Additionally, a Wilcoxon signed rank test is used to validate the simulation output against empirical data. In general, there is no statistically significant difference between the simulated data and the empirical set. The study has shown that multi-agent-based models, although in their infancy, represent a viable approach to modelling population dynamic. Results from the simulation can be used to develop a new model which may include demographic profiles of the population for further studies.

Arabi, Aliasghar – Text Classification Using Deep Learning in Reddit Reply/Comments (Supervisor: Anatoliy Gruzd)

In this paper, I implemented several deep learning models to automatically classify posts from subreddit of “askhistorians” into defined classes using pre-trained word embeddings vectors. The training data is taken from the research done at Toronto Metropolitan University Social Media lab. I used one-vs-the-rest classifier (OvR) to train separate model for each of eight classes. Keras library from Python is used to develop deep learning frameworks starting from individual models such as CNN, and LSTM and finishing up by combining the individual models to form more complex versions such as CNN+LSTM and LSTM+CNN. When compared with previous work using traditional models and N-grams as features, improvement in all three accuracy, recall and precision is observed. However, the best model considering all evaluation metrics, stability/ranges of results for all iteration/fold, and run time found to be CNN for all categories.

Arjumand, Isra – Stock Market Prediction Using Machine Learning (Supervisor: Kaamran Raahemifar)

This project focused to find efficient prediction of the Apple Inc. (AAPL) stock price movement to make effective investment decisions by generating trading decisions, comparing of SVM, KNN and RF machine learning algorithms and profit comparison. Research shows that use of machine learning algorithms with technical analysis gave good results. Technical analysis was implied on the data to generate trading signals and algorithms were trained on them to predict future stock trends. By applying trading rules, decision points (buy, sell and hold) are generated. SVM performed better on experiment 1 and RF was efficient in experiment 2. Performance was evaluated using profit percentage. Adding more technical indicators improved the profit percentage. In conclusion, better profits are generated when technical indicators were used along with machine learning techniques in contrast with technical indicators alone.

Beqaj, Inela – Diabetic Retinopathy Detection Using Convolutional Neural Networks (Supervisor: Naimul Khan)

Machine learning techniques are becoming more and more helpful in many areas of our everyday life such as education, healthcare, etc. One of the main applications of these techniques in healthcare is computer-aided diagnosis which are systems that assist doctors in the interpretation of medical images. This project is focused on medical image analysis of retinal images to identify the type Diabetic Retinopathy eye disorder which is the leading cause of blindness among people diagnosed with diabetes. In this project are used supervised techniques and semi-supervised ones to classify the images. The two types of convolutional neural networks architectures applied in supervised learning are VGG16 and DenseNet121, while the architecture used in semi-supervised mode is Adversarial Autoencoder. The semi-supervised techniques achieve the same accuracy as the supervised one, but they are more efficient because they achieve the same accuracy using 10 percent of the labeled data

Chowdhury, Kakoli – Binary Classification on Clustered Data (Supervisor: Ayse Bener)

Land-Mobile radio systems support many vital communication functions supporting government and private operators, some related to public safety and mission critical functions. The models produced will help in understanding the usage patterns at different time periods to predict occupancy and demand by different channels across the spectrum. CRC (Canadian Research Corporation) is providing Layer 1 data sampled every three milliseconds. This data is further explored and processed under this MRP. Sub-setting of data is conducted based on clustering and descriptive statistical analysis designed to differentiate between channels exhibiting different occupancy % patterns. Applying algorithms on clustered data is expected to show distinct behaviors that are further utilized to find the best prediction model for spectrum availability.

Fadel, Fady – Organizing Web Search Results Using Best Clusters Separation (Supervisor: Ozgur Turetken)

In this paper, we applied a novel idea to utilize machine learning techniques to automatically organize web search results from search engine queries. Using text mining analytics, we first conducted analysis to identify features that can be used for clustering. Second part of analysis was an evaluation of the best clusters separation method and the performance comparison of the selected features against different clustering algorithms.

Gupta, Vasudev – Predicting Gold Prices Using Neural Networks (Supervisor: Tony Hernandez)

The aim of the study in this paper is to predict Gold future prices using neural networks. Prices of Gold change rapidly in real time across the globe, making the price prediction interesting and challenging. Predicting Gold prices stresses the machine learning algorithms and technology and is a good test case. The North American perspective on Gold price prediction was used within this study. Gold is used as an investment vehicle by large number of investors across the world and successful predictions can be very helpful. Five input variables were used to predict the price, which are: Silver Future price, Copper Future price, Dow Jones Industrial Average, US Dollar Index and VIX volatility Index. Two Types of Neural networks models were used to predict the Gold Prices: Feedforward Neural Network (FNN) and Recurrent Neural Network - Long Short-Term Memory (RNN-LSTM). As well, different variations of training data - weekly/daily, short/long term were tried. Experimentation was also undertaken with USD/CNH (US Dollar to Chinese Renminbi exchange rate) as an additional input variable.

He, Xin – Movie Recommender System: Using Ratings and Reviews (Supervisor: Cherie Ding)

Because of information overload, it is becoming increasingly difficult for users to find the content that they are interested in. Usually, the actual ratings are used to implement a Recommender model. Currently, many item evaluation systems not only have the ratings but also the reviews. In this report, we mainly describe how to use both ratings and reviews to implement a recommender model. Additionally, the project investigates the relationship between the ratings and reviews.

Hyder, Md Khaled – Sentiment Analysis of Twitter Data For Top Canadian Retailers (Supervisor: Tony Hernandez)

The competition among retail companies is visible in all communication channels. Most companies are now focusing on social media marketing to reach the vast consumer. In parallel to aggressive communication retailers also want to measure their own and competitor’s performance.
This project aims to measure the performance of retail companies in Canada by analyzing users sentiment from tweets and build a machine learning model that can predict with higher accuracy, also conduct exploratory analysis to find user engagement and other hidden patterns.
286,668 tweets were collected for the top five retailers. After processing and cleaning the dataset, an exploratory analysis was conducted to find hidden patterns, and a sentiment classifier model developed using five algorithms experimented with two vectorizers.
Among the five retailers, Sobeys has the highest positive score than others. Initially, Linear SVM with count vectorizer produced the highest accuracy, then random oversampling with TF-IDF vectorizer produced high and balanced precision and recall values. This solution will help retailers to compare their performance with competitors.

Jain, Sachin – Binary Classification Prediction on Time Aggregated Data (Supervisor: Ayse Bener)

The main objective of this Major Research Project (MRP) is to find out the effect of time resolution on the prediction of the channel occupancy of Land Mobile Radio channels to facilitate dynamic spectrum allocation that increases overall spectrum efficiency. This project is a collaboration between Canadian Research Corporation and Data science lab in TMU.
Layer 1 data is measured for occupancy percent of more than 7000 channels approximately every three milliseconds. This MRP specifically looks to generate aggregate dataset, generated from Layer 1 data, to predict channel occupancy. Further predictive classification is conducted using Naïve Bayes and Logistic Regression algorithms on the datasets. The ultimate goal of this project is to build spectrum occupancy prediction model that is known to work best in given conditions.

Jandu, Arshnoor – Neural Style Transfer With Image Super Resolution (Supervisor: Naimul Khan)

In fine art, humans have mastered the skill to create unique visual representations through combining content and style of an image. However, rendering the semantic content of an image in different styles is a difficult image processing task. Recent success of Deep Learning in computer vision has demonstrated the power in creating imagery by technique of separating and recombining the image content and style called Neural Style Transfer (NST). Several online and offline optimization methods are proposed that produces new images of high perceptual quality. However, these existing methods do not offer flexibility of creating high resolution upscaled images. In this project, I have implemented deep neural networks for Neural Style Transfer and Single Image Super Resolution, where users can transform photos into desired paintings and further upscale them in high resolution quality. This project also demonstrates experimentation with several parameters of NST to create amazing photo effects.

Kashyap, Askhat – Stock Price Movement Prediction Using Social Media (Reddit) Analysis (Supervisor: Kaamran Raahemifar)

In this paper, we applied different machine learning techniques to predict stock price movement based on metrics derived from reddit posts of a subreddit called “economy”. As part of exploratory data analysis, I tried to identify patterns in stock price movement, performed data cleanup on reddit posts and identified important topics discussed in reddit posts. We categorized stock market data in 3 classes i.e. positive, negative and steady, we marked data as positive/negative if the market direction is upward/downward and more than a certain threshold (above +/- 1%) else we marked it as ‘steady’, we considered volume changes while calculating this percentage.

Khan, Ghazala – 6-Month Infant Brain MRI Segmentation With Convolutional Neural Network (Supervisor: Naimul Khan)

Brain MRI segmentation and analysis is one of the most important and initial steps in measuring brain’s anatomical structure and visualizing any changes and developments in the brain. Early stage of brain development is “critical in many neurodevelopmental and neuropsychiatric disorders, such as Schizophrenia and autism.” These abnormalities and disorders are detectable at early infant age and early interventions are possible to control a life at risk.
To investigate the problem this paper proposed two models 2D Conv and 3D FCNN for the brain MRI tissues segmentation of 6 months infants into GM, WM and CSF with multi-modality T-1 and T-2 weighted images by using MICCAI grand challenge iSeg2017. The architecture of 2D Conv was inspired by VGG model with modifications. The architecture of 3D Fully Convolutional Neural Networks was inspired by the recent work on infant brain MRI segmentation.
The quantitative evaluation of 3D FCNN exhibited substantial advantages of the proposed method in terms of accuracy of tissue segmentation with efficient use of parameters. 3D FCNN has shown comparative performance with 21 state-of-the-art international teams of the iSeg2017 challenge and acquired DSC score of 93%.

Luo, Jiefan – Twitter Bots Detection Utilizing Multiple Machine Learning Algorithms (Supervisor: Anatoliy Gruzd)

The purpose of this paper is to apply multiple machine learning algorithms to develop bot-detection models for Twitter. Using exploratory analysis, I explored the Twitter metadata and found useful behavior features to distinguish between normal users and bots. For the training models, I found optimal hyperparameters to tune the different models. I applied five algorithms including Naive Bayes, Decision Tree, Random Forest, Linear Support Vector Machine (SVM), and Radial Basis Function SVM to classify bots and humans. The results of the classification are the account identities, and I measured the classification performance by accuracy, sensitivity, specificity, and area under the receiver operating characteristics curve (AUC). The results present that the Random Forest algorithm was most effective in detecting bots and identifying normal users.

Najlis, Bernardo – Applications of Deep Learning and Parallel Processing Frameworks in Data Matching (Supervisor: Kaamran Raahemifar)

Most of Data Science research work assumes a clean, deduplicated dataset as a pre-condition. In reality, 80% of the time spent in data science work is dedicated to data deduplication, cleanup and wrangling. Not enough research papers focus on data preparation and quality, even though it is one of the major issues to apply Data Science. The research subject of this paper is to improve Data Matching techniques on multiple datasets containing duplicate data using parallel programming and Deep Neural Networks. Parallel programming frameworks (like MapReduce, Apache Spark and Apache Beam) can dramatically increase the performance of computing pair comparisons to find potential duplicate record matches, due to O(n2) complexity of the problem. Deep Neural Networks have shown great results to improve accuracy on many traditional machine learning applications. The problem and solution researched are of general applicability to multiple data domains (healthcare, business).

Ong, Liza Robee – Predicting Depression Using Personality and Social Network Data (Supervisor: Morteza Zihayat)

Over 300 million people worldwide suffer from depression. With the advent of social network, our goal is to apply a novel approach to identify depression by investigating what relationships exist between an individual’s social network information and speech features, their personality, and depression levels. The study was conducted using a publicly available dataset, called myPersonality, which contains more than 6,000,000 test results, and over 4,000,000 individual Facebook profiles. From the dataset, we used depression risk and personality assessment scores, Facebook network and linguistic measures. We created a classifier to extract a feature that indicates the speech act of a status update. We applied several machine learning methods and feature sets to predict depression risk based on personality type, speech acts, and network influence. Our results show that the best predictors included personality dimension scores on neuroticism, conscientiousness, and extraversion, and the usage scores for the assertive and expressive speech acts.

Rafayal, Syed – Tucker 2 Tensor Decomposition Model Implementation on Visual Dataset Using Tensor Factorization Toolbox (Supervisor: Ayse Bener)

The main goal of this paper is to recognize and classify the images utilizing Tucker2 decomposition technique. In the first part of the experiment, the exploratory analysis is conducted. The second part includes building training models and automatically correctly label the testing images. In training and validation phase, different folds and values of indices (i, j) are used to have the best performance using accuracy. In addition, two approaches are adopted here for testing. The first approach randomly selects training samples from core tensors. In the second approach, the similarity score table is created and sorted in ascending order. The larger score of the image means more noise and core indexes are collected from every level of noise by a certain interval. Experimental results show the training models for the indices (i=8, j=8) obtained more success and the second testing approach is more consistent. All experiments have been conducted on a visual dataset which is a publicly available dataset called Fashion MNIST using MATLAB factorization software package known as Tensor Toolbox.

Rodrigue, Sami – Experiments With External Data and Non-Linearity for Channel Usage Prediction (Supervisor: Ayse Bener)

Neural Networks are one of the most popular models for predicting channel usage in the telecommunication spectrum. They commonly use spectral, temporal or spatial information from simulated or cellular data. However, these sources can fail to capture the full array of user behavior. We will use fully connected Neural Networks and Perceptrons on LMR data collected in Ottawa to explore whether enriching the input space through the use of external data, such as weather data, applying non-linear transformation to the input space improves the predictive power of the models. Based on our initial analysis, we have failed to identify any improvement in prediction using weather data, however the benefit of non-linear transformation is dependent on channel behavior. The later point can be further explored via other models such as Recurrent Neural Networks and different grouping of the channels.

Sharma, Suansh – Spectrum Occupancy Prediction in Land Mobile Radio Using Multiple Hidden Markov Models (Supervisor: Ayse Bener)

In this paper, we seek to predict the occupancy status of Land Mobile Radio channels based on real life spectrum measurements using machine learning techniques. Cognitive radios are essential for implementing dynamic spectrum sharing, which has been gaining attention as a promising solution to alleviate the problem of spectrum scarcity. HMMs are widely studied in the literature for spectrum prediction and by design, HMMs learn from the sequential nature of the data, which is directly applicable to case of temporal spectrum occupancy prediction. We implement a model made up of multiple HMMs to perform spectrum occupancy prediction. We use submodels to capture the primary user activity then the submodels are used to initialize a high-level HMM, which is trained over an LMR channel’s occupancy over the time. We validate the performance of the multiple Hidden Markov Models on LMR bands and show that the multiple HMM model performs better than single HMM on predicting occupancy status for the next hour. By training multiple HMMs, which capture not only channel occupancy patterns over time but also low-level user activity patterns, size-able gains can be made in the performance of data driven spectrum prediction techniques.

Sirwani, Naresh Kumar – Prediction of Query Hardness (Supervisor: Ebrahim Bagheri)

Information retrieval (IR) became an important part of today’s data driven world, although most IR systems suffers with high variance in their retrieval performance & results quality due to several reasons, even the system who performs better on average can still return poor results for some queries. Understanding such hard queries and in-fact predicting their difficulty level before the search is taken place can bring many improvements in performance of IR systems including but not limited to providing direct user feedback on expected quality of results, federation or metasearch, content enhancement and query expansion etc. In this paper, we systematically study & implement various TF-IDF based pre-retrieval method to determine queries difficulty level on different TREC’s data collection, we then compare the results of our experiments with Neural embedding and SELM (Semantic Enabled Language Model) based models for which results are already available from other similar studies and find out which methods performs better, more relevant and accurate.

Taylor, Kisha – Automated Stock Trading Based on Predicted Direction of Next-Day Closing Prices – S&P 500 Index (Supervisor: Kaamran Raahemifar)

This paper develops a model that tries to mimic a trader based on predicted direction of the next-day closing price of the S&P 500 ETF (Exchange Traded Fund) and can be applied to a single stock/index.

Three approaches are used:

(1) Technical analysis only

(2) Machine Learning (ML) using only closing prices as inputs (baseline models) and

(3) ML model (“hybridized inputs”) that use a combination of technical indicator(s) and raw closing price as inputs.

This classification problem uses Accuracy (main metrics), Precision & Recall and return metrics. The data (sourced from Yahoo Finance) uses 3 ½ years of trading data (Open & Closing Prices) from 2-January-2015 to 06-June-2018.

The paper also explores the use of a buffer, examining its predictive impact. The buffer is essentially a threshold used to derive the signal generated by the technical indicator.

Tomini, Emmalie – Load Forecasting Using Recurrent Neural Networks in Ontario Energy Markets (Supervisor: Pawel Pralat)

Reliable electricity load forecasting is essential for industry to devise efficient energy management plans as well as in guiding conservation efforts. Rising market demand and unpredictable behaviours has resulted in traditional methods of electricity prediction being no longer robust enough to accurately forecast market demand. The aim of this project was to use machine learning approaches to create a model for effective load forecasting. Implemented in python, a recurrent neural network was trained on a variety of input features in order to determine what information is necessary to model Ontario load patterns. Calendar variables such as day, month year, day of the week and time, as well as relative humidity and dew point temperature were determined to produce the most accurate results when the RNN model was trained on this input space, yielding a MAPE of 5.19% on the test set. The results obtained from the models implemented in this study produce reasonably accurate day ahead electricity forecasts. However, there is possibility for improvement in this field, and machine learning approaches provide an excellent application in this area of study.

Walia, Harneet – Customer Acquisition Through Direct Marketing Campaign Analysis (Supervisor: Morteza Zihayat)

In this research, we analyze a direct marketing campaign dataset obtained from a Portuguese Financial institution to predict if a customer will subscriber to a fixed deposit(Upsell) along with predicting the month (time aware) to best reach out to the customer. To solve the issue of time-aware upselling, we have implemented Time Aware Upsell Prediction Framework (TAUPF) using two different approaches, with an aim to find the best approach and technique to build the prediction model. TAUPF is implemented using Upsell Prediction Approach (UPA) and Clustered Upsell Prediction Approach (CUPA). We have also tried to answer the data imbalance problem by examining and comparing different methods of sampling (Up-sampling and down-sampling). For decision tree, K-Nearest neighbor and Random forest it was observed that CUPA has higher F-Score than UPA. It is also observed that prediction of the month, the number of calls being made to the customer before the customer subscribes for a fixed deposit can be reduced by a significant number.

Wan, Alexander – Learning About Tensor Decomposition to Determine Length of Stay (Supervisor: Ayse Bener)

Tensor decomposition is a section of data science that can be used to build prediction models. By using tensor decomposition on St. Michael’s Hospital data, a model can be developed to identify patient’s length of stay. An application of tensor decomposition known as generalized tensor product is applied on the St. Michael’s Hospital dataset. The dataset is assembled based on measuring variable’s performance, correlations for pre-processing data and keeping variables the hospital deems important. Using performance metrics like comparing other machine learning algorithms to measure model performance. The average error was around 80 hours from tensor decomposition. However, in comparison to the other machine learning algorithms, tensor decomposition was more accurate. A major problem was the computer used for this project was not powerful enough to test for higher dimensions. This could mean that the data needs to be looked at again to make a better dataset to analyze.

Wu, Xinjie – Validation and Sensitivity Study of a LSTM Model for Stock Price Prediction (Supervisor: Bilal Farooq)

Time series data is everywhere in everyday life as well as in many business sectors. Ability to predict the performance of a process in the future will help reduce uncertainty, risk, make the highest profit and best performance from many industries. Stock price sequence is an easy accessed on-going time series dataset. The “unpredictable” feature make it a good source to challenge emerging algorithms.
LSTM (Long Short Term Memory) is an algorithm well designed for time series data forecast. In this project, a recent proposed LSTM model for stock future price/movement prediction was studied and compared to other available models. Then the model was applied to couple selected stocks for validation. The sensitivity study of parameters in the model was also presented.
The study done showed the presented model had advantage over other models but are still not universal. Perfect prediction was not guaranteed.

2017

Abu-Ata, Muad Mustafa Husein - Optimization of Decision Model Microsimulation in Health Care (Supervisor: Andriy Miranskyy)

Microsimulation is used in health care to evaluate cost-effectiveness of different diagnosis and treatment procedures. Producing valid and statistically significant simulation results requires large input size. The aim of this project is to speed up an existing decision model simulation for Obstructive Sleep Apnea (OSA) study and to generalize the simulation to any diagnostic/treatment methods. Additionally, as mortality prediction is an important feature in such models, we aim to accurately incorporate mortality prediction into the simulation model. Parallelization and code refactoring are utilized to scale up the microsimulation model. We could scale up the simulation model to simulate four million patients in 21.8 minutes (reducing computational time by a factor of 14). Moreover, we applied the Lee-Carter model to future predict mortality rates where the fitted model resulted in small residual errors.

Chen, Yilin – New York City Green Taxi Trip Optimization (Supervisor: Konstantinos Georgiou)

Most of people think driving a taxi is all about the driving skills, and there is no special rules or tricks to follow in order to let a taxi driver earn an outstanding amount of income.How much a taxi driver could earn all depends on luck and long time working hours. However, what if there are some hidden tricks that could help a taxi driver to increase the daily revenue? This research paper aims to reveal those tricks and rules by digging into the big data world. In this paper, I use the 2016 New York Green Taxi trip data from NYC open data source to generate an algorithm that takes expected starting location, the time of the day and date of the year as inputs, and outputs recommendations to taxi drivers on if the chosen expected starting location could earn the maximum revenue or the adjacent locations could earn higher revenues. The machine learning technique, random forest, is used to predict the factors that could affect the total revenue. The final simulation results indicate that by taking the recommendations provided by the algorithm, the revenue of a taxi driver is most likely to increase.

Durrani, Afsah – Filtering of Tweets to Identify and Remove Un-Informative Concepts (Supervisor: Ebrahim Bagheri)

Due to the recent technological advancements, there is a large increase in the number of online users and the social media content generated. The abundance of online social media data is used by multiple stakeholders to identify the public opinions, trending topics and user segmentation. The large amount of data requires high computational power which is traditionally dealt with, by removing uninformative words using preprocessing techniques, such as stopword removal, before analysis. We present approaches using two correlation algorithms to identify the uninformative concepts. The effectiveness of the approach is evaluated by measuring the performance of the LDA models applied on the new datasets derived from the experiments. Correlation with the sum of all concepts performs better as compared to the correlation with the noise signal. Varying correlation threshold values are experimented with of which higher thresholds provide with better LDA performance.

Fatima, Hira - Analysis of Reddit Groups (Subreddits) Using Classification of Subreddit Posts (Supervisor: Anatoliy Gruzd)

In this paper, we applied a novel idea to utilize machine learning techniques to automatically label subreddit posts from a subreddit called “askhistorians”. Using descriptive analytics, I first conducted an exploratory analysis to see if I can find any patterns, correlations or relationships that could be used to generalize posting pattern and behaviour of reddit users. The second part of my analysis comes from training and evaluating eight classifiers that could correctly categorize reddit posts with a positive or negative label for the eight category codes listed in Appendix A. I used 3 different algorithms and compared their performance using accuracy, precision and recall. This research is a continuation of an existing study that started in TMU Social Media Lab (RSML) [1]. The dataset that was used to train and evaluate the classifiers was coded manually by (TMU Social Media Lab) RSML. The predicted classification results were used to provide more insights about the subreddit group.

Ghaderi, Amir - Credit Card Fraud Detection Using Parallelized Bayesian Network Inferecing (Supervisor: Youcef Derbal)

The number of credit card transactions is growing, taking an ever-larger share of the worlds payment system. Improved credit card fraud detection techniques are required to maintain the viability of the worlds payment system. The aim of this Major Research project is (1) to develop a Bayesian network model that is able to predict fraudulent credit card transactions with minimal false positive predictions and (2) to reduce the processing time through the parallelization of the inferencing process. The Bayesian network was trained on credit card transaction data obtained from European cardholders for the month of September 2013. The results determined that Bayesian networks are able to be trained to predict fraudulent credit card transaction with zero false positive predictions. In addition, Bayesian network inferencing can be efficiently parallelized to reduce the overall processing time.

Ghaly, John - A Defect Prediction Model Using Delta Static Metrics (Supervisor: Ayse Bener)

Dependence on software to automate, optimize and manage our daily tasks is growing every day. As the demand for higher software functionalities increase, the software size and complexity also increase. Maintaining and finding software defects are a hard and time-consuming job. We propose a machine learning model to identify and predict defect prone modules. We use an industrial dataset to build 8 classifiers from 5 different categories based on Static, Churn, and Delta metrics. We found that the addition of delta metrics significantly reduced the probability of false alarm while improving the probability of detection. Our results validate our hypothesis on the added value of delta metrics for improving results. We found that most algorithms achieved reasonable performance giving the suitable technique.

Hon, Marcia - Alzeheimer's Diagnosis with Convnet (Supervisor: Naimul Khan)

Alzheimer’s is a serious disease characterized by a progressive degeneration of the brain affecting 60 to 80 percent of dementia cases. The ability to automate this diagnosis is very important to accelerate treatment.
In this project, convolutional neural networks (convnets) are used in order to automate the classification of Alzheimer’s disease from MRI images. 6400 MRI images were taken from http://www.oasis-brains.org (external link) using 5-fold with 80% to 20% test/validation. VGG16 won the ImageNet competition and it is thus used in this project. Its classification layer is retrained borrowing code from https://keras.io/applications/ and https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html using Keras, TensorFlow, and Python. A very high accuracy of 92% was achieved. This success proves that machine learning can successfully and readily be applied to medicine. Future projects could involve classifying skin cancer and other diseases with a visual component with different convnets. Additionally, if there is sufficient longitudinal data, MRI could be used to predict Alzheimer’s instead of mere classification.

Islam, Md Shariful - Opinion Mining Classification of Twitter Data for Major Telecom Operators in Canada (Supervisor: Naimul Khan)

A fierce competition is visible among the telecom operators to acquire more subscribers through advertisements and campaigns, especially in social media. Now the question arises how to measure the performance of operators based on customer response. The goal of this project is to measure the competitive performance of mobile phone operators by analyzing the customer sentiment from twitter data and to build a classifier model by using different machine learning algorithms. 9000 tweets are collected for top three mobile operators in Canada. After data cleaning and text processing, sentiment analysis was completed. Then sentiments are classified and compared by using three different algorithms. Among the three operators, Telus has the highest Positive sentiment than others. Among the algorithms, SVM and RF has better accuracy than decision tree. This will help wireless operators to know about the negative experiences and to turn it into positive experience by improving the particular service.

Kundu, Somnath - Graph Theory Perspective of Stock Market Behaviour (Supervisor: Pawel Pralat)

It is often noticed in practice that the prices of different Stocks and other financial investment instruments move together. It is not too surprising since many different companies engage in similar type of business and one business depends on other businesses. So those companies are assumed to be tied together by an invisible thread of relationship. Though it is difficult to find the actual relationship of the companies we can always measure the strength of their relationship by the similarity of movement of their attributes. We may assume that, if the correlation coefficient of one or more attributes of two stocks is larger than some chosen threshold value, then those two stocks are connected by an edge in the relationship graph. In this project our objective is to explore these relationships between the stocks from the Graph Theory perspective and investigate various properties of this Stock Relationships Graph, including clustering.

Patel, Jitesh - Predicting Breast Cancer Survival Based on Gene Expression and Clinical Variables (Supervisor: Youcef Derbal)

Survival of breast cancer patients is irregular. Several gene sets are directly or indirectly involved in breast cancer. I explored whether the combination of mRNA expression of such sets may improve prediction of triple-negative breast cancer. I have used TCGA gene expression data for this study and classified 19 genes into two sets based on the relationship between death-risk and expression of each gene using Cox model. The up-regulation of the first gene-set combined with the down-regulation of the second gene-set is correlated with a high risk for the triple-negative breast cancer. The triple-negative breast cancer is classified based on the expressions of estrogen, progesterone and HER2 receptors. The combined effect of gene set1 and set2 on survival was predicted on overall data for triple-negative and luminal class of breast cancer using Kaplan-Meier model. Combining the effect of multiple gene signatures improves prediction of triple-negative breast cancer survival. This methodology can be relevant for different cancer types and target therapies.

Rizvi, Syed Ali Mutahir - Prediction of the Directional Change or Strength of Forex Rates (Supervisor: Ozgur Turetken)

Predicting FOREX pairs direction or strength is extremely hard and predicting trends has been an area of interest for researchers for many years due to its complex and dynamic nature. There are hundreds of trend indicators for the prediction of FOREX but the accuracy is not reliable. In this project, a combination of indicators (Day Close Strategy, Moving Average Crossover, Fractal Strategy, Renko charts, ATR & Breakout) and machine learning algorithms (Naïve Bayes, Support Vector Machine, Deep Neural Networks) are used for better prediction accuracy and the results suggest that this approach is helpful in providing decision support.

Shi, Pengshuai - Population Counting with Convolutional Neural Networks (Supervisor: Kosta Derpanis)

In this project, we explore the challenge of automatic population counting from single images. Most recent work apply Neural Networks to extract visual features and regress the population count either explicitly or implicitly. This type of model has been shown to perform better than traditional counting methods that require localizing each object and hand-crafted image feature representations. In this work, we compare two different types of CNN-based counting models. The first model consists of a fully Convolutional Neural Network (CNN) that predicts a pixel-wise density map, where the entity counts are realized by post-hoc summing over the density map. The second model is a neural network that consists of an initial set of convolutional layers, followed by fully connected layers that directly regresses the entity count. Our empirical evaluation considers three diverse data sets: (i) cells captured under a microscope, (ii) aerial views of sea lions, and (iii) aerial views of crowds of people. We find that the direct count regression approach generally performs better than the indirect one. In addition, we explore a saliency map approach to visualize the location of the count entities.

Trikha, Anil Kumar - Enhancing User Interest Representation in Social Media (Supervisor: Ebrahim Bagheri)

User interest detection in social media is valuable for providing recommendations of goods and services, modeling users, and supporting online advertising. Only recently have models for inferring implicit interests and predicting future interests been proposed. We extend these models by specifying a technique that yields an improved representation of user interests for these purposes. We evaluate the solution on publicly available Twitter data. The research question we address is whether user interests derived from micro-blogging posts can be more accurately represented using a data mining approach that utilizes association rules.

Yadav, Shailendra Kadhka - Risk Prediction of Collisions in Toronto (Supervisor: Ayse Bener)

Collision prediction models are used for a variety of purposes; most frequently to estimate the expected accident frequencies from various roadway entities like aggressive driving, traffic control, road class, speeding etc. and also to identify factors that are associated with the occurrence of accidents. In this study, the Decision Tree, Random Forest and ARIMA time series model are implemented and analyzed over the Killed or Seriously Injured (KSI) Traffic Data so as to predict the severity of injury type, number of collisions in Toronto for future 12 months. The ARIMA model gives accuracy of 85% for the prediction of number of collisions. The Decision Tree using CART and Radom Forest models returns accuracy of 57% and 67% respectively for the classification of injury types.

Yan, Bingsen - Automatic Sentiment Analysis Process: Amazon Online Shopping (Supervisor: Ozgur Turetken)

Background: Sentiment Analysis appears to be significantly helpful offer of time-saving and efficiency enhancement especially given the fact that customers nowadays tend to rely more and more on products reviews when shopping online. Aim: Develop a Sentiment Analysis Tool as a Decision Aid for Online Shopping Experience Improvement. Methodology: We use web scraping technology to collect online real-time data, sentiment analysis technology to get sentiment score for each review and machine learning model to predict star rating. Results: We developed an automatic process to generate a product report including price, star rating, reviews and sentiment scores. Also, we analyzed the relationship between star rating and the three sentiment scores. In addition, a prediction model has been built up to predict star rating using sentiment score. Conclusion: The Tool enables customers to value a product in a more efficient way. Also, this is a powerful tool for star rating prediction.

Yueh, Ming-hui - Determining Factors Influencing Prediction of Length of Stay (Supervisor: Ayse Bener)

Having a predictive model helps doctors identify short stay patients more objectively. To identify important features for predicting a patient’s length of stay of 72 hours or less, three stages of data processing were performed from obtaining initial variables to applying feature selection methods for determining a subset of features, which were then fed into several learning algorithms. AUC and precision-recall curves were used to measure model performance. Regardless of the selection method and impute approach, ALB (Albumin) value, Age and HGB value were found to influence model performance the most.