QUT - Information retrieval and coding methods for large scale bioinformatics

Study levels

PhD
Master of Philosophy
Honours

Faculty/School

Faculty of Science

School of Computer Science

Topic status

We're looking for students to study this topic.

Research centre

Centre for Data Science

Supervisors

Dr Timothy Chappell

Position: Lecturer (TIEA)
Division / Faculty: Faculty of Science

Adjunct Professor Shlomo Geva

Position: Adjunct Professor
Division / Faculty: Faculty of Science

Associate Professor Jim Hogan

Position: Associate Professor
Division / Faculty: Faculty of Science

Professor David Lovell

Position: Professor
Division / Faculty: Faculty of Science

Associate Professor Dimitri Perrin

Position: Associate Professor
Division / Faculty: Faculty of Science

External supervisors

Dr Andrew Trotman, University of Otago

Overview

Advances in sequencing technologies over the past two decades have led to an explosion in the availability of genomic sequence data and an increasingly urgent need for scalable clustering and search facilities. One approach is to encode sequences as binary vectors in a high-dimensional space, simplifying the comparison and allowing it to be computed very rapidly using bit-level operations.

Coupled with these ideas is the need to provide clustering methods and efficient indexing and lookup in response to search queries. One approach to doing this is to use ideas from text-based information retrieval, optimised to work with the distribution of k-mers - words of length k - within the genomic collection.

Research activities

The work undertaken will depend on the level of the student, but the main activities will include:

indexing large scale sequence collections and experimenting with clustering and search
developing new encodings and analysing the results
developing and extending software tools to make these approaches usable for biologists
benchmarking against other tools and approaches.

Outcomes

We are looking to develop new algorithms and tools that will make precise search of large scale sequence collections much faster than it currently is. So we are seeking to implement and publish new encodings and new approaches to clustering and search and to prove that they are faster than others.

Skills and experience

For this project we are looking for students with good programming skills, an ability to work with complex datasets and to understand machine learning algorithms, and a willingness to learn the biology needed to understand the domain. Most of our students have studied or are studying computer science, but we welcome anyone who comes with a mix of skills that can attack the problem. Those with a joint degree involving molecular biology and computer science are especially welcome, but please get in touch if this sounds like you.

It isn't necessary for you to be an extraordinary software developer but you need to comfortable in python or C# or Java or F# or other modern languages. This isn't a project where you can learn to program. We will teach you the biology and the machine learning as the project takes shape.

If you are undertaking this project as an Honours or PhD student then you may be eligible to apply for a scholarship.

Scholarships

You may be eligible to apply for a research scholarship.

Explore our research scholarships

Keywords

Contact

Contact the supervisors for more information.