Data Science Research Centre

"Data-Information-Knowledge-Application"

F.A.I.R. Guidelines for Machine Learning

How to do reproducible science for A.I.

An ARRE Event: Aid to Research Related Events, Exhibition, Publication and Dissemination Activities Program

Funded by Vice-President, Research and Graduate Studies, & Gina Cody School of Engineering and Computer Science, Concordia University

Organizers: Gregory Butler and Tristan Glatard

Morning Session

Monday 25 March 2019

Time 09:00 to 12:00

Loyola Campus: GE 1.110

7141 Sherbrooke Street West

Talks

09:00 Why Science needs to be Open and FAIR

Abstract Open Science is a relatively common term that not all users know how to practise or to make the best use of. The life sciences research landscape is adapting to these new calls for "openness", but does the scientific community really know what they are getting into, and why they are doing it? The increasing calls for "openness" in research methodology, ethics, data, distribution and publication is actually necessary hard work, and essential for our community. The premise of Open Science is that scientific knowledge is to be shared freely, openly, and in its digital form as early in the discovery process as is practical and helps to make research represents an approach to research that is collaborative, transparent and accessible. In the presentation, I will present how and why the research community needs to make Open FAIR science its mantra, and what some of the best ways are to achieve this.

Speaker Francis Ouellette, Department of Cell and Systems Biology, University of Toronto, Toronto, Canada.

BF Francis Ouellette is a leading proponent of open science in genomics. He is Associate Professor at the University of Toronto, Department of Cell and Systems Biology. Francis was one of the co-founders of the Canadian Bioinformatics Workshop in 1998. His teams have been involved in the development of high-throughput sequence analysis methods, as well as the development of platforms to integrate data from various open databases. Francis continues to be interested in computational biology and genomics, and the integration of all data types to help our understanding of biology.

10:00 From Big Data to FAIR Big Data to Better Cancer Care

Abstract Big data, artificial intelligence, machine learning and data science are new fields which are expected to have a major impact on day-to-day oncology practice. Big data based services such as automated contouring and planning, radiomics, decision support systems and literature mining are products already available to our community and these are expected to rapidly change the way we practice medicine. However, issues such as lack of reproducibility or barriers to data sharing are currently limiting the application of the above-mentioned technologies in the clinic. FAIR (Findable Accessible Interoperable Reusable) principles and related technologies can help spreading the usage of big data in cancer care.

Gaining a basic understanding of Big Data and technologies based in it, including their strengths and weaknesses, is the overall aim of this teaching lecture. In particular: (1) What is the rationale behind using Big Data for data driven medicine and what is the relation to evidence base medicine? What is Big Data in (radiation) oncology? (2) How big data is big? (3) How do we learn from big data? (4) What is FAIR and how you apply Big Data results into daily practice?

Speaker Alberto Traverso, MAASTRO Clinic, Department of Radiotherapy, Maastricht, Netherlands.

Dr Traverso is a Medical Physicist at the department of radiotherapy, MAASTRO Clinic (NL). He has a PhD in Physics from Polytechnic University of Turin. His research interests are: quantitative imaging (radiomics) for image-based prediction modeling; big data in radiation oncology.

11:00 The Brave New World of Semantics and Smart Data in the Microbial Life Sciences

Abstract Managing the volume and variety of biodata enabled by recent technological advances currently is one of the major challenges in the Life Sciences. A prerequisite for direct and large-scale functional comparisons of information captured in sequence data is a consistent semantically interoperable annotation of encoded genetic elements with evidence statements. The current standard exchange format, however, provides only limited support for evidence statements, limiting data interoperability and hampering comparative analyses at large scale.

Employing Semantic Web technologies, we have developed the Genome Biology Ontology Language (GBOL), associated stack and a Semantic Annotation Platform with Provenance (SAPP) which allowed us to make a significant step forward in obtaining FAIR genome annotation thereby incorporating evidence statements at different levels. A FAIR genome annotation is the essential next step in large-scale comparative genomics and can be used as a basis for a wide variety of other applications such as in silico phenotype predictions and bioprospecting.

Speaker Jasper Koehorst, Systems and Synthetic Biology, Department of Agrotechnology and Food Sciences, Wageningen University, Netherlands.

Dr Koehorst completed his PhD (with distinction) at Wageningen University in January 2019, entitled "A FAIR approach to genomics" on the development and usage of FAIR genomic data in the field of microbial genomics. During his PhD he developed a new undergraduate course with the aim to increase the computer literacy of experimental biologists to bridge the gap between experimental and computational life sciences aimed at improving the cross-talk and interdisciplinary collaboration between the wet and dry lab. Currently he is involved in numerous Linked Data projects in the Life Sciences, standardization and tool development for FAIR by design computational predictions.

Afternoon Session

Monday 25 March 2019

Time 14:00 to 17:00

SGW Campus (Downtown): EV 12.163

1515 Ste Catherine St, West

Talks

14:00 Scalable methods for genomic and epigenomic data analyses

Abstract High-throughput technologies, and in particular next-generation sequencing (NGS), have been revolutionizing biomedical research by enabling the characterization of the genetic and epigenetic components of the molecular processes of the cell with unprecedented resolution. Although these developments promise to have a significant impact on life sciences and health care, an immediate challenge is that the current computing infrastructure and techniques to store, process, analyze and share the vast volumes of data generated by these platforms frequently represents a major bottleneck. In this presentation, we will present various components of the scalable high-performance computing environment that we have put in place to support the processing of these large datasets. We will also describe some of the software solutions that we have developed to facilitate large-scale data analysis such as the Genetics and genomics Analysis Platform (GenAP, www.genap.ca), which includes open-source data analysis pipelines for whole-genome sequencing, exome sequencing, transcriptome sequencing, metagenomics. We will also present the IHEC Data Portal, which collects data for the International Human Epigenome Consortium (IHEC) and can be used to explore more than 10,000 reference epigenomics maps. Finally, we will describe the EpiShare project, which has recently been selected as a Global Alliance for Genomics and Health Driver Project. The aim of EpiShare is to facilitate international data sharing of epigenomic datasets.

Speaker Guillaume Bourque, Department of Human Genetic and McGill University & Genome Quebec Innovation Center, McGill University, Montreal, Canada.

Guillaume Bourque is an Associate Professor in the Department of Human Genetics at McGill University and the Director of Bioinformatics at the McGill University & Genome Quebec Innovation Center. He is a member of the Research Advisory Board of CIHR's Institute of Genetics, of the Research Advisory Council of Compute Canada, the national platform for high-performance computing, of CANARIE, responsible for Canada's ultra-fast network backbone and on the External Consultant Panel of ENCODE. He leads the Canadian Center for Computational Genomics (C3G), a Genome Canada bioinformatics platform, and the McGill initiative in Computational Medicine (MiCM). He is also the head of the Epigenomics Mapping Center at McGill, a project that oversees data generation and processing as part of the Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC), which is associated with the International Human Epigenome Consoritum (IHEC).

15:00 Making tools FAIR for reproducible neuroimaging

Abstract Reproducibility has become an important concern in computational data science. In this talk, we present reproducibility challenges arising from data analysis pipelines; in particular, we show how small changes introduced in the computational infrastructure, analysis software, or data impact results reproducibility. We describe preliminary solutions to address these issues using system-level pipeline analysis and bootstrap aggregation of results. We finally present Boutiques, a containerization framework to make data analysis pipelines Findable, Accessible, Interoperable and Reusable. Boutiques describes analysis pipelines with globally persistent records to make them searchable and accessible, and it links them to container images to make them reusable across a variety of computational platforms.

Speaker Tristan Glatard, Department of Computer Science & Software Engineering, Concordia University, Montreal, Canada.

Dr Glatard is Assistant Professor at the Department of Computer Science and Software Engineering, and Canada Research Chair (Tier II) on Big Data Infrastructures for Neuroinformatics. He heads the Big Data Infrastructures for Neuroinformatics lab, and is a member of the PERFORM Centre and Data Science Research Centre at Concordia University. He is also Adjunct Professor at the School of Computer Science at McGill. Before, he was Researcher at the French National Center for Scientific Research.

His research goal is to build platforms for the efficient and automated processing of Big Data. The main applications of my work are in medical image analysis, in particular neuroimaging.

16:00 The Toot Suite Project: Being FAIR when applying machine learning in bioinformatics

Abstract We have just begun a Genome Canada project "TooT Suite: Predication and classification of membrane transport proteins" to annotate the membrane transport proteins both in an organism, be it plant or animal, and in a microbiome, thus providing information about potential interactions between them. The project involves extensive experimentation with machine learning, and is investigating issues of repreducibility of results by being open and FAIR (Findable, Accessible, Interoperable, Reusable) with the tools, their software, and the experimental platform.

Speaker Gregory Butler, Department of Computer Science & Software Engineering, Concordia University, Montreal, Canada.

Dr Butler is Professor of Computer Science and Software Engineering at Concordia University, Montreal, Canada. His research focuses on the transformation of data to knowledge, particularly for knowledge-based bioinformatics. He is currently working on distributed computation with large-scale graphs for the reconstruction of networks for metabolism and regulation of microbial communities, and how to construct, mine, and manage the graph of knowledge provided by a provenance network with links into the scientific literature.

Dr Butler obtained his PhD from the University of Sydney in 1980. He was a faculty member at the University of Sydney for nine years, prior to joining Concordia University in 1992.