Data Science Research Centre

"Data-Information-Knowledge-Application"

DSRC Poster Session

Time 14:00 to 17:00

Poster Set Up 13:00 to 14:00

Location: EV 11.119 and its foyer area

Guest

Rossella Blatt Vittal, PhD, Vice-President, Data Science Team Lead, and Senior Data Mining Architect, Societe Generale Corporate and Investment Banking.

Posters

Suggested poster size is 36 inches by 48 inches.

italic author name indicates presenter

P01 Non-Explicit Discourse Relation Recognition with Convolutional Neural Networks, Majid Laali, Andre Cianflone, Leila Kosseim

Abstract This poster describes our submission (CLaC) to the CoNLL-2016 shared task on shallow discourse parsing. We used two complementary approaches for the task. A standard machine learning approach for the parsing of explicit relations, and a deep learning approach for non-explicit relations. Overall, our parser achieves an F1-score of 0.2106 on the identification of discourse relations (0.3110 for explicit relations and 0.1219 for non-explicit relations) on the blind CoNLL-2016 test set.

P02 Computational Assessment of Text Complexity, from Lexical to Discourse level, Elnaz Davoodi and Leila Kosseim

Abstract Our work investigates the influence of discourse features on text complexity assessment. To do so, we created two data sets based on the Penn Discourse Treebank and the Simple English Wikipedia corpora and compared the influence of coherence, cohesion, surface, lexical and syntactic features to assess text complexity. Results show that with both data sets coherence features are more correlated to text complexity than the other types of features. In addition, feature selection revealed that with both data sets the top most discriminating feature is a coherence feature.

P03 Automatic Labeling of French Discourse Connectives, Majid Laali and Leila Kosseim

Abstract Discourse connectives (e.g. however, because) are terms that can explicitly convey a discourse relation within a text. While discourse connectives have been shown to be an effective clue to automatically identify discourse relations, they are not always used to convey such relations, thus they should first be disambiguated between discourse-usage and non-discourse-usage. In this poster, we investigate the applicability of features proposed for the disambiguation of English discourse connectives for French. Our results with the French Discourse Treebank (FDTB) show that syntactic and lexical features developed for English texts are as effective for French and allow the disambiguation of French discourse connectives with an accuracy of 94.2%.

P04 Carpooling strategy for emergency evacuation, Jia Yuan Yu

Abstract PoolSych is a computer system improving urban evacuation processes. It optimizes traffic by setting up a global cars synchronisation that will be materialized in a personalized set of itinerary instructions from the car perspective. Through a carpooling strategy, the PoolSych system manages to minimize traffic congestion by providing drivers with an alternative itinerary that includes pedestrian pickup. PoolSych guarantees the prompt evacuation of the driver, taking into account the extra time it takes for pickup. As a result, more people could be evacuated without exceeding the planned evacuation time.

P05 Human and Vehicle Detection from Unorganized Video Footage, Jia Yuan Yu

Abstract If we closely try to see the oblivious goal of almost all the research in the field of Computer Science is to make Computers smart enough to act like humans. The recent advances in the MEMS and data processing algorithms have opened up a huge space for the researchers in the field of Computer vision and Machine Learning which has directly bring us to develop a computer capable or even better than humans. Here we are proposing a system, which is smart enough to detect the humans and vehicles and take decisions based on the detection. The proposed system is able to efficiently detect the humans and classify them based on their age and gender. It is also capable enough to detect vehicles like car and trucks and provide an approx. estimate of the speed. These techniques can be utilized from the designing of advance homes, which can adjust the ambient environment as per the available people in the room to smart traffic lights, which can issue tickets automatically. It can be also utilized in the field of medical science where we can take decision about person’s illness.

P06 Data-driven Risk-aware Constrained Markov Decision Processes, Jia Yuan Yu

Abstract The Markov decision process (MDP) is a powerful mathematical framework for formulating the environment of reinforcement learning. For different settings (time horizons, information integrity, criteria, etc.), different techniques can be used to solve the problem. In this study we focus on the impact of reward function on MDPs with value at risk (VaR) criterion. For short-term MDPs with a given target level, we can use the backward induction to solve the problem by converting an orthodox MDP to an augmented state 0-1 MDP. For short-term MDPs with a given percentile, enumeration can bring us the optimal policy. For long-term MDPs, a sub-optimal policy is studied with estimating the CDF of total reward. For infinite MDPs, the optimality of stationary policy is studied.

P07 Solving Optimization Problems with Little or No Data, Jia Yuan Yu

Abstract We aim to solve the optimization problem of multi-resource allocation in data centers with minimum communication overhead. We will provide a solution in which a user is unaware of the resource demands of other users and it's utility function is private to itself. In our research we will use additive increase multiplicative decrease(AIMD) algorithm and basic probability theory to solve the problem and obtain the social optimum values. Our solution will guarantee the non-trivial properties of multi-resource allocation like sharing incentive, strategy proofness, Pareto efficiency, envy freeness, etc.

P08 Arbitrage Free Regularization for Forward Rates, Cody Hyndman and Anastasis Kratsios

Abstract We develop a novel functional regularization penalty which encodes the arbitrage-free condition in term-structure models. Using this penalty we introduce a functional regularization problem that fits an infinite-dimensional consistent HJM model to forward rate data in a parsimonious way, while respecting the no-arbitrage assumption. We explicitly develop a sequence of computationally tractable finite-dimensional realizations which are asymptotically equivalent to their infinite dimensional limiting HJM model.

P09 DataMobile: A Smartphone Travel Survey Experiment, Zachary Patterson

Abstract This paper describes an experiment using a pragmatic smartphone travel survey applications developed to: minimize respondent burden while collecting primarily passive data between destinations, and invite a known population (of Concordia University) to participate in the study. Respondent burden was reduced by optimizing battery usage, requiring little from respondents apart from downloading and installing an app, filling out a short survey and allowing the app to run in the background. The experiment showed that a surprisingly large number of people (892) contacted by e-mail were willing to participate in the study, resulting in a surprisingly large amount of data as well (4,154 respondent days). Moreover, the overall age distribution of the sample was found to be closer to the true population than a traditional OD survey capturing the same population. Differences in travel behavior results from the OD survey appear plausible given what is known about both smartphone and traditional surveys. The fact that respondents were not asked to validate their data reduced respondent burden, but it is clear that some validated data is necessary to derive meaningful information from collected data. The collection of some less accurate data when GPS is not available is an important avenue to reduce the identification of missing trips. We believe that this experiment should be seen as a data point, among others, in trying to understand the trade-offs involved in the development of smartphone applications. It will hopefully contribute to their use on a larger scale in data collection initiatives.

P10 Transit Trip Itinerary Inference with GTFS and Smartphone Data, Zachary Patterson

Abstract Recently, a myriad of emerging technologies have been developed to supplement and contribute to conventional household travel surveys for transport-related data collection. While a great deal of research has concentrated on the inference of information from Global Positioning Systems (GPS) and mobile phone-collected data (e.g. trip detection, mode detection, etc.), to our knowledge, methods for inferring transit itinerary have not received much attention. This paper describes our research on transit itinerary inference pairing data collected from the smartphone travel survey application DataMobile with GTFS data in Montreal, Canada. Transit trips from the 2013 household travel survey were recreated and recorded with DataMobile from May to July 2016. Transit itineraries (i.e. the sequence of routes) routes were then validated. That is, collected data was associated with transit routes for all parts of the trips. A transit itinerary inference algorithm was then applied to the collected data. Our approach is based upon the notion transit route ambiguity. That is, since transit routes can overlap on significant portions of their routes, any attempts to associate GPS data to routes, when routes overlap, will necessarily result in ‘ambiguity’ with respect to which routes were actually used. Using this notion of ambiguity, we calculate the proportion of transit trips whose associated transit routes are ambiguous (i.e. cannot be associated with only one route) under different assumptions, rules and eventually a simple algorithm. We find that using this approach, 94.2% of transit trip distance is assigned to either one transit route or walking and is thus unambiguous. This results in 87% correct prediction of transit route.

P11 Big Data infrastructures for science automation, Tristan Glatard

Abstract We describe a research program to automate Big Data analyses from data processing to knowledge publication. We focus on the following three aspects of this immense challenge: (1) enabling interoperability among Big Data platforms, (2) ensuring reproducibility of Big Data analyses over time and space, (3) optimizing performance of Big Data computations. The potential applications of our research span the whole spectrum of scientific disciplines engaged in data science. As a worked example, we focus on neurosciences, exploiting our collaborations in this field.

P12 DFA Minimization in Map-Reduce, Iraj Hedayati, Shahab Harrafi, Ali Moallemi, Gösta Grahne

Abstract We describe Map-Reduce implementations of two of the most prominent DFA minimization methods, namely Moore's and Hopcroft's algorithms. Our analysis shows that the Moore-implementation dominates the Hopcroft one by a factor of the underlying alphabet size, both in terms of running time and communication cost. This is validated by our extensive experiments on various types of DFA's, with up to 2^17 states. It also turn out that both algorithms are sensitive to skewed input, Hopcroft's algorithm being intrinsically so.

P13 Rovering Exchanged Data, Ali Moallemi, Adrian Onet, Gösta Grahne

Abstract The inversion of data exchange mappings is one of the thorniest issues in data exchange. In this paper we study inverse data exchange from a novel perspective. Previous work has dealt with the static problem of finding a target-to-source mapping that captures the “inverse” of a source-to-target data exchange mapping. As we will show this approach has some drawbacks when it come actually applying the inverse mapping in order to recover a source instance from a materialized target instance. More specifically (1): As is well known, the inverse mappings have to be expressed in a much more powerful language than the mappings they invert. (2): There are simple cases where a source instance computed by the inverse mapping misses sound information that one may easily obtain when the particular target instance is available. (3): In some cases the inverse mapping can introduce unsound information in the recovered source instance.

To overcome these drawbacks we focus on the dynamic problem of recovering the source instance using the source-to-target mapping as well as a given target instance. Similarly to the problem of finding “good” target instances in forward data exchange, we look for “good” source instances to restore, i.e. to materialize. For this we introduce a new semantics to capture instance based recovery. We then show that given a target instance and a source-to-target mapping expressed as set of tuple generating dependencies, there are chase-based algorithms to compute a representative finite set of source instances that can be used to get certain answers to any union of conjunctive source queries. We also show that the instance based source recovery problem unfortunately is coNP-complete. We therefore present a polynomial time algorithm that computes a “small” set of source instances that can be used to get sound certain answers to any union of conjunctive source queries. This algorithm is then extended to extract more sound information for the case when only conjunctive source queries are allowed.

P14 Spare

P15 Spare

P16 Bracketology: NCAA March Madness, Scott Carr, William Chak Lim Chan, Alvaro Sanchez Guadarrama, Krzysztof Dzieciolowski

Abstract The National Collegiate Athletic Association (NCAA) is a not for profit association which regulates athletes from over 1,281 institutions, conferences, organizations, and individuals. One of the NCAA’s most popular and most followed sports is the Men's Division I Basketball Tournament, also known as, March Madness. After qualification, 32 teams from Division I schools play head-to-head knockout basketball games throughout the month of March and the first month of April to determine a national champion. In a 64-team tournament, there are a possible 9.2 quintillion possibilities to predict the national winner. Numerous prizes have been given away by companies, but the most notable prize offering, a whopping $1 billion, was offered by Warren Buffet, to anyone who could predict the outcome of the tournament. To date no one has completed the challenge. Our project focuses on developing a model to accurate predict the tournament matchups, and ultimately the winner of the tournament using the datasets provided to us by Kaggle. Our project will illustrate the methodologies and datasets used, the limitations, results, future improvements, and conclusions.

P17 Producing Credit Scorecards the Hard Way, Adam Deluca, Joshua Murphy, and Nadia Lam, Krzysztof Dzieciolowski

Abstract Credit scorecards are essential tools for lenders to quickly and transparently assess the creditworthiness of applicants. Increasingly, the development of these scorecards is being automated using software tools such as the SAS Credit Scoring module or STATISTICA. Lenders who automate their scorecard development may not understand the mechanics behind their software tools, and those who develop their scorecards without the aid of such programs may not realize the full benefits of what they offer.

Since small improvements in credit scoring models can lead to significant increases in their profits, lenders should strive to understand the logic and processes that underlie credit scorecard development. This paper demonstrates the development of a credit scorecard using Excel and base modules from Enterprise Miner in the place of the SAS Credit Scoring module so as to make these processes visible. This demonstration also serves to illustrate the limitations of developing scorecards without the aid of credit scoring software tools.

The data set used in this analysis is from Kaggle’s Give Me Some Credit contest, and it contains 150,000 observations and 11 variables. The data set is available at https://www.kaggle.com/c/GiveMeSomeCredit/.

P18 Spare

P19 Clustering of Yeast genes expression data using Spectral Clustering, Stephanie Kamgnia and Greg Butler

Abstract Gene expression data has permitted to explore the expression level of many genes of an organism at the same time under a variety of conditions. Their analysis has permitted to extract useful information for understanding the complex process of regulation of genes and the functional annotation of unknown genes. In this work, we propose to used spectral clustering algorithm on the yeast cell cycle expression data as first step toward the reconstitution of regulation mechanism of cell cycle expression regulation.

P20 Recent Papers, Faizah Aplop, Stuart Thiel, Larry Thiel, Greg Butler

Paper On predicting transport proteins and their substrates for the reconstruction of metabolic networks Faizah Aplop and Gregory Butler, IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2015), Niagara falls, Canada, 12--15 August 2015

Paper Improving GraphChi for Large Graph Processing: Fast Radix Sort in Pre-Processing Stuart Thiel, Greg Butler, Larry Thiel, 20th International Database Engineering and Applications Symposium, IDEAS '16, Montreal, Canada, July 11-13, 2016

Paper TransATH: Transporter Prediction via Annotation Transfer by Homology Faizah Aplop and Gregory Butler, 4th International Conference on Advances in Intelligent Systems in Bioinformatics, Chem-Informatics, Business Intelligence, Social Media and Cybernetics 2016 (InteliSys 2016), Bandung, Indonesia, 20--21 August 2016

Paper Metabolic Pathway Reconstruction of Fungal Genomes Faizah Aplop and Gregory Butler, 4th International Conference on Advances in Intelligent Systems in Bioinformatics, Chem-Informatics, Business Intelligence, Social Media and Cybernetics 2016 (InteliSys 2016), Bandung, Indonesia, 20--21 August 2016

P21 Short term traffic flow prediction on urban motorway networks, Taiwo O. Adetiloye and Anjali Awasthi

Abstract In recent years, there has been increased research interest on modeling urban traffic congestions. This come in various forms from finding right type of traffic recording equipment, the techniques of data collection, cleaning data, accurate and reliable analytic methods to adequate means of simulating traffic scenarios before putting them to actual use.

We investigate the use of data mining for modeling short-term traffic congestion on urban motorway networks under two main categories: neural networks, and random forest classifiers. The neural networks can be further classified into back propagation neural network, neuro-fuzzy, and the deep belief network. Our preliminary experimental tests showed that they can both offer a reliable and effective means of predicting short term traffic congestion towards better traffic management. We assume that while there may be some limitations, such as obtaining real-time traffic data, our practical solution can engender better ways to improve traffic flow in municipalities.

P22 On q-Bernstein polynomials for density estimation, Yogendra Chaubey and Qi Zhang

Abstract A q-Bernstein polynomial is the generalized version of the classic Bernstein polynomial, which is proposed in 1912 by Russian mathematician S.N. Bernstein in his proofs for the Weierstrass Approximation Theorem. These polynomials require two constants M and q in (0, 1]. The degree of approximation becomes better as M tends to infinity and q tends to 1. This projects investigates the nature of approximation for density estimation by taking M very large and selecting q numerical methods. The advantage of this approach is the simplicity of investigation with respect to a single parameter q. This project will demonstrate the implementation for selecting proper q and obtain smooth density estimator for a given data.