Data Science Research Centre

"Data-Information-Knowledge-Application"

DSRC Big Data Day

Time 09:00 to 16:00

Location: EV 2.260

Big Data Day - Talk Session

Time 09:00 to 12:00

Location: EV 2.260

Talks

09:00 Welcome

09:15 Data-driven Healthcare

Abstract How best to utilize Big Data from hospitals, insurance, technology providers, universities, safety boards, and business? The Working Group on Predictive Health of Canada's Big Data Consortium considers the impact of Big Data on healthcare and how best to utilize those opportunities for data-driven healthcare in the context of Canada.

Speaker Gregory Butler, Professor of Computer Science and Software Engineering at Concordia University, Montreal, Canada. Member of the Working Group.

10:15 Markov Decision Problems in finance and high dimensionality

Abstract Markov Decision Problems (MDP) have many applications in finance, such as portfolio management optimization, dynamic hedging and optimal liquidation. To obtain a realistic representation of the dynamics of market assets, it is necessary to include multiple features in the financial optimization schemes, such as multivariate volatilities, liquidity risk, basis risk, stochastic term structure, dynamic volatility surfaces, transaction costs, etc. Embedding each of these features into MDP's produces high dimensional optimization problems which cannot be solved efficiently through traditional dynamic approaches. We will briefly discuss potential future research work in this area and potential solutions to handle the high dimensional problems.

Speaker Frédéric Godin, Assistant Professor, Mathematics & Statistics, Concordia University, Montreal, Canada.

11:15 The Open Itinerum Smartphone Travel Survey Platform

Abstract The Itinerum smart phone application records locational information on user trips through their smartphone.

Speaker Zachary Patterson, Canada Research Chair, Geography, Planning & Environment, Concordia University, Montreal, Canada.

Big Data Day - Poster Session

Time 14:00 to 16:00

Poster Set Up 13:00 to 14:00

Location: EV 2.260

Posters

Suggested poster size is 36 inches by 48 inches.

italic author name indicates presenter

P01 TranCEP: Predicting transmembrane transport proteins using information on amino acid composition, evolution, and specificity-determining positions, Munira Alballa, Faizah Aplop, Gregory Butler

Abstract Transporters mediate the movement of compounds across the membranes that separate the cell from its environment, and across inner membranes surrounding cellular compartments. It is estimated that one third of a proteome consists of transmembrane proteins, and many of these are transport proteins. Given the increase in the number of genomes being sequenced, there is a need for computation tools that predict the substrates which are transported by the transmembrane transport proteins. TranCEP is a predictor of the type of substrate transported by a transmembrane transport protein. TranCEP combines the traditional use of the amino acid composition of the protein, with evolutionary information captured in a multiple sequence alignment, and restriction to important positions of the alignment that play a role in determining specificity of the protein. Our experimental results show that TranCEP significantly outperforms the state of the art. The results quantify the contribution made by each of the kinds of information used.

P02 Improving GraphChi for Large Graph Processing: Fast Radix Sort in Pre-Processing, Stuart Thiel, Gregory Butler, Larry Thiel

Abstract We consider GraphChi, a system for processing large graphs on single modern PCs, as a case study. GraphChi uses a "parallel sliding window" method. This requires that vertices and edges be assigned into similarly sized shards whose contents are sorted to allow for sliding, maintaining locality of access to nodes. GraphChi uses an expensive onetime pre-processing step. For example, the twitter-2010 graph with 42M nodes and 1.5B edges takes ten minutes to pre-process and just under three minutes for a single-pass of the PageRank algorithm. We introduce Fast Radix, an improved Least Significant Digit (LSD) radix sort to replace the existing sorting algorithm in the pre-processing. This yields 20-40% speed improvements over GraphChi's sort implementation, and upwards of 10% improvement on the total pre-processing times of large graphs.

P03 An Automated Approach from GPS Traces to Complete Trip Information, Ali Yazdizadeh, Bilal Farooq and Zachary Patterson

Abstract Recent advances in communication technologies have enabled researchers to collect travel data based on ubiquitous and location-aware smartphones. These advances hold out the promise of allowing the automatic detection of the critical aspects (mode of transport, purpose, etc.) of people's trips. Until now, efforts have concentrated on one aspect of trips (e.g. mode) at a time. Such methods have typically been developed on small data sets, often with data collected by researchers themselves and not in large-scale real-world data collection initiatives. This research develops a machine learning-based framework to identify complete trip information based on smartphone location data as well as online data from GTFS (General Transit Feed Specification) and Foursquare data. The framework has the potential to be integrated with smartphone travel surveys to produce all trip characteristics traditionally collected through household travel surveys. We use data from a recent, large-scale smartphone travel survey in Montréal, Canada. The collected smartphone data, augmented with GTFS and Foursquare data are used to train and validate three random forest models to predict mode of transport, transit itinerary as well as trip purpose (activity). According to cross-validation analysis, the random forest models show prediction accuracies of 87%, 81% and 71% for mode, transit itinerary and activity, respectively. The results compare favorably with previous studies, especially when taking the large, real-world nature of the dataset into account. Furthermore, the cross validation results show that the machine learning-based framework is an effective and automated tool to support trip information extraction for large-scale smartphone travel surveys, which have the potential to be a reliable and efficient (in terms of cost and human resources) data collection technique.

P04 Travel Mode Detection from Smartphone Data: Semi-supervised vs. Supervised Learning, Mohsen Rezaie, Zachary Patterson, Jia Yuan Yu and Ali Yazdizadeh

Abstract With the advent of the incorporation of GPS receivers and then GPS-enabled smartphones in transportation data collection, many studies have looked at how to infer meaningful information from this data. Research in this field has concentrated on the use of heuristics and supervised machine learning methods to detect: trip ends, trip itineraries, travel mode and trip purpose. All the methods used until now have depended on methods relying uniquely on fully-validated data. However, respondent burden associated with validation lowers participation rates and less usable data. In this paper we propose the application of semi-supervised methods. This lets researcher and planner to use both validated and un-validated data. We compare the accuracy for three popular supervised methods (decision tree, random forest and logistic regression) with a simple semi-supervised method (label propagation with KNN kernel). Simple features such as speed, duration and length of trip and also closeness of start and end point to transit network are used for model estimation. The results show that the semi-supervised method outperforms the supervised methods in the presence of high proportion of un-validated data.

P05 Transit Network Complexity in the Context of Transit Itinerary Inference with Smartphone Travel Survey Data, Marshall Davey and Zachary Patterson

Abstract Measuring the efficiency and service level of public transit networks has long relied on user surveys as the primary data source for gaining insight into the operations of a network. Thanks to recent advances in smartphone and GPS technologies however, new data collection methods are becoming increasingly viable alternatives. Travel survey applications for smartphones and GPS loggers are both becoming commonplace amongst network analysts' tools. This is largely due to the precision of the spatial and temporal data collected, and the ability to capture much larger sample sizes than previous methods. Along with the increased usage of these tools has come the opportunity to expand the repertoire of indices used by researchers to measure these networks. The research proposed in this paper aims to revisit certain applicable metrics which originate in graph theory and network analysis, while also suggesting a new metric of our own design. Our newly proposed metric, the "active routes on links" count (AROL), aims to provide a more nuanced description of a network than the currently available indices. The AROL index is especially handy for describing complex networks on a fine scale, a scale at which other metrics simply don't provide information. While a myriad of network analysis metrics currently exist; we focused our research on those which help researchers develop better itinerary inference algorithms such as those employed in mode-detection and trip-type-detection protocols. Most importantly, the AROL metric will provide network analysts with a measure of how reliably the GPS data-points collected from their travel surveys can correctly be attributed to one specific transit route. The problem of correct route inference typically occurs in areas of high transit overlap, and while the concept of transit overlap has been explored by Vuchic and Musso (2005), their "line overlapping index" only provides one measure for the entire network. Our metric expands upon this idea by locating at a fine scale the individual road links with contain the increased overlap, furthermore, a time series can be generated for the amount of overlap occurring at any given time of the day. By performing a comparative study of the transit networks of Montréal, Toronto, Calgary, and Vancouver, contrasted with validated survey data for Montreal, we create a ranking of these networks based on how reliably a route detection algorithm will function. By using GTFS datasets, provided free of charge by transit agencies, as well as GIS road shapefiles, the metrics described in this paper are intended to be readily available to researchers and planners from a variety of backgrounds and industries.

P06 Link Analysis in E-Commerce, Sariga Santosh, Rahul Panchal, Kartik Narayanan

Abstract Not only does the e-commerce websites such as Amazon capture its users' review about products and services during/after their shopping experience but also uses these information rich transactional data sources to create customized and personalized marketing offerings and product recommendations to each customer. However, such large retailers have millions of customers and millions of different products, which raises the need for providing high quality recommendations in real time while the user is browsing to shop. Link Analysis node in SAS Miner visualizes a network of items by detecting the linkages among items in transactional data or the linkages among levels of different variables in training data or raw data and develop item-cluster induced segmentation of customers and provide the next best offer recommendations.

P07 A Neural Network Conversational Model for Farsi, Farhood Farahnak

Abstract The field of Conversational Modeling (or chatbot) has experienced a significant interest in recent years both in the research community and in the industry. However, most work in this area still use hand-crafted rules and are designed to handle conversations in a specific language and domain such as travel booking. Unfortunately, these rules are not easily portable to other languages or domain, hence significant effort is required to build a new set of rules for a new application. On the other hand, recent breakthroughs in Deep Learning for Natural Language Processing and the availability of datasets has lead to the development of end-to-end systems that can overcome these issues of language and domain-specificity. The goal of our project was to develop an open-domain chatbot for Farsi. To do this, we trained a deep neural network with a sequence-to-sequence architecture. As training data, we used Persian subtitles of movies from www.opensubtitle.com. This dataset, after preprocessing and removal of non-conversational texts, contains almost six million sentences. To train our chatbot, we created pairs of questions and responses by using every sentence in a subtitle as a response to the previous one. Using these pairs of question-response, we trained the sequence-to-sequence model to predict an appropriate response for a given question. The model consists of two main modules: an encoder and a decoder. The question is fed to the encoder and the output is used as the hidden state of the decoder. Then the decoder generates the response word-by-word. Using the measure of perplexity, our model for Farsi underperforms when compared to a similar approach for English. Our model performs much better (in term of perplexity) for short responses but degrades with longer ones. We believe that a larger dataset and an objective function better adapted to the task may improve our result.

P08 Deep Rectifier linguistic model for opinion spam detection, Zeinab Sedighi

Abstract Today, online comments and reviews have become invaluable resources in consumer decision making. For this reason, deceptive spam reviews are often used to deliberately promote or demote popular opinion, and can seriously affect customers and organizations. Thus, in the past few years, the automatic detection of deceptive reviews has attracted much attention from the research community. Most of the methods proposed in previous work, have addressed this task as a standard classification problem, where reviews are classified as spam and non-spam. Therefore, much work has focused on feature learning to enhance the classification performance. However, the use of hand-crafted features may fail to identify novel relevant ones. To this end, we propose the use of deep structures and unsupervised feature learning to address this challenge. Novel features are derived as well as reducing the model complexity. Our proposed model achieved a F-score of 87.6% on the Yelp data set and 75.23% on the Three-domain data set.

P09 Using Data Mining to Predict Employee Turnover with Emphasis on the Support Vector Machine, Sunny Truong, Saeedeh Mohtasham, Anthony Santaguida, Grégoire Fabre

Abstract This project purpose is to explore how data mining models can be used within human resources. More specifically, it aims to do so by tackling the problem of employee defection at company XYZ by predicting which employees are likely to leave next. The effectiveness of the support vector machine is tested against other data mining models such as random forests, decision trees, logistic regression, gradient boosting, neural networks, ensembles, and K-NN, and their results were compared. The analyses were conducted within SAS EM. The results show that using SVM and other similar performing models can improve the selection of which employees are leaving by approximately four times the naive which can lead to cost savings in regards to hiring and retaining employees.

P10 Big Data in Small Time Efficient Analysis of Large Data Sets using SAS Indexes, Views and Hash Objects, Omar Awwad, Romina Arrieta, Zheqin Zhang, Kayla Eisenstat, Sabrina Bant

Abstract SAS programs require a sufficient amount of memory on your computer, as well as ample CPU run time. To ensure programs run smoothly and as quickly as possible, the idea of efficiency must be introduced within your SAS program. Three techniques to increase efficiency are using SAS views, SAS indexes and hash objects. By implementing these techniques, we have reduced memory used and CPU time in our analysis.

P11 Using Convolutional Neural Networks for Code Smell Detection based on Change History Information, Antoine Barbez and Yann-Gaël Guéhéneuc

Abstract Code and design smells are poor solutions to recurring implementation and design problems of a software system. Without necessarily having an impact during execution, they may be very problematic for comprehension and maintainability. Several approaches have been proposed to detect those smells using static code analysis. In this study, we are trying to detect and extract information about design smells by applying machine learning technics on change history information mined from versioning systems. More precisely, our approach is based on the use of convolutional neural networks to extract deep features from historical data and the analysis of the convolution filters.

P12 Debugging the Numerical error propagation in the HCP structural pre-processing pipelines, Ali Salari, Lalet Scaria, Gregory Kiar, Tristan Glatard

Abstract Operating systems are known to have an effect on the results produced by neuroimaging pipelines, presumably due to the creation, propagation and amplification of small numerical errors across the pipelines. Such errors highlight numerical instability which is also likely to appear as a result of other types of small perturbations such as acquisition and parametric noise. In the previous studies, we showed that pre-processing pipelines of the Human Connectome Project were sensitive to operating system variation. However, the precise causes of such instabilities and the path along which they propagate in the pipelines are unclear. We present a technique to identify the processes in a pipeline that create numerical errors along the execution, and we apply this technique to the HCP structural pre-processing pipelines.

P13 Distribution Estimation in MDPs via a Transformation, Shuai Ma and Jia Yuan Yu

Abstract Most, if not all, risk-sensitive objectives concern the reward distribution instead of the expectation only in reinforcement learning. Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. Considering most techniques work for MDPs with deterministic state-based reward functions, we proposed a transformation for other types of reward functions, and estimate the (discounted) reward distribution in finite- and infinite-horizon MDPs with finite state and action spaces, with the aid of the generalized transformation and the normal distribution assumption.

P14 Accelerating reproducibility estimations with collaborative filteringTitle, Soudabeh Barghi and Tristan Glatard

Abstract We focus on computational reproducibility of Big Data analyses, which can be disrupted by different compilation and execution environments as well as variations in hardware architectures and software versions. We aim at reducing the number and duration of executions required to estimate the computational reproducibility of analysis pipelines on large databases. We model the problem as a Collaborative Filtering problem, where underlying utility matrix represents the reproducibility of the files generated during pipeline execution over data acquired in different subjects. We focus in particular on the data and pipeline acquired by the Human Connectome project (HCP), a project that maps human brain circuits and their behavioral connection in a large population of healthy adults.