MEPI group meeting, 2017

Alignment-free subtyping of
HIV sequences

Mukarram Hossain

Department of Veterinary Medicine
University of Cambridge

MEPI group meeting, March 2017

Subtype classification

Viruses are often grouped into subtypes.
Subtypes have wide implications on the following studies of viruses:
- clinical
- epidemiological
- structural
- functional
Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.

Alignment uncertainty

Alignment-free classification

Lossless compression techniques have shown promising results for biological sequence classification:
- Protein family prediction (Begleiter et al., 2004)
- Protein structure prediction (Ferragina et al., 2007)

COMET

COMET is an ultrafast alignment free subtyping tool
Uses Prediction by Partial Matching (PPM)
Initially designed for HIV-1
COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
Sensitivity and specificity were comparable to or higher than:
- REGA (de Oliveira et al., 2005) and
- SCUEAL (Pond et al., 2009)
Detected and identified new recombinant forms

COMET algorithm

Builds variable-order Markov models for each reference sequence
Given a query, COMET calculates log likelihood of observing a base at each positions
This results in a matrix of likelihood values
Subtype call is done using a decision tree

The decision tree

Lossless compression using ANN

Schmidhuber and Heil (1996) presented 'Sequential neural text compression'
Can have similar performance as PPM (MV Mahoney, 2000)
The predictor network upon seeing a stream of input characters:
- assigns probability distribution for the next character
Probabilities can be used in the same way COMET likelihoods work

Classification using Neural Networks

Neural networks are computational system mimicking biological brain
Consists of a cluster of neural units organised in layers
All the neural units between layers are fully connected with weights attached
Each neuron has activation function and a bias associated with them

ANN: design

The input layer consists of 32 neurons getting values from the context
Each nucleotide is represented by a one-hot vector
- A is [1,0,0,0] while T is [0,0,0,1]
Hidden layer consists of N neurons
- processes inputs coming from the input layer using wights and biases
Output layer consists of 4 neurons
- uses softmax funnction to generate probabilities for the Nucleotide bases A, C, G, T

ANN : implementation

Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
Written in Python3

ANN: training

We use the reference sequence set used in COMET to train the ANN
Cross-validation is done using randomly removing one sequence from the training set for each subset
Cross-entropy cost function is used to update network weights and biases
Multiple epochs are used and the one producing best validation accuracy is stored

ANN: subtyping

For each nucleotide positions in the query sequence:
- ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
The decision tree used in COMET is used to predict the subtype of the query sequence

Reference dataset

Cross-validation

Test datasets

Reference (215 sequences)
Vanderbilt (2779 sequences)
PR-RT (727 sequences)

Accuracy comparison

Future direction

Optimise neural network parameters
Use tanh function?
Larger context size?
Recurrent Neural Networks (RNN)?
Report breakpoints for potential novel recombinants
Implement using TensorFlow

Acknowledgements

Simon Frost
Richard Dybowski