Alignment-free subtyping of
HIV sequences
Mukarram Hossain
Department of Veterinary Medicine
University of Cambridge
MEPI group meeting, March 2017
Subtype classification
- Viruses are often grouped into subtypes.
- Subtypes have wide implications on the following studies of viruses:
- clinical
- epidemiological
- structural
- functional
- Existing classification techniques mostly rely on alignments followed by phylogenetic and/or statistical algorithms.
Alignment uncertainty
Alignment-free classification
- Lossless compression techniques have shown promising results for biological sequence classification:
- Protein family prediction (Begleiter et al., 2004)
- Protein structure prediction (Ferragina et al., 2007)
COMET
- COMET is an ultrafast alignment free subtyping tool
- Uses Prediction by Partial Matching (PPM)
- Initially designed for HIV-1
- COMET was tested on both synthetic (1090698) and clinical (10625) HIV datasets
- Sensitivity and specificity were comparable to or higher than:
- REGA (de Oliveira et al., 2005) and
- SCUEAL (Pond et al., 2009)
- Detected and identified new recombinant forms
COMET algorithm
- Builds variable-order Markov models for each reference sequence
- Given a query, COMET calculates log likelihood of observing a base at each positions
- This results in a matrix of likelihood values
- Subtype call is done using a decision tree
The decision tree
Lossless compression using ANN
- Schmidhuber and Heil (1996) presented 'Sequential neural text compression'
- Can have similar performance as PPM (MV Mahoney, 2000)
- The predictor network upon seeing a stream of input characters:
- assigns probability distribution for the next character
- Probabilities can be used in the same way COMET likelihoods work
Classification using Neural Networks
- Neural networks are computational system mimicking biological brain
- Consists of a cluster of neural units organised in layers
- All the neural units between layers are fully connected with weights attached
- Each neuron has activation function and a bias associated with them
ANN: design
- The input layer consists of 32 neurons getting values from the context
- Each nucleotide is represented by a one-hot vector
- A is [1,0,0,0] while T is [0,0,0,1]
- Hidden layer consists of N neurons
- processes inputs coming from the input layer using wights and biases
- Output layer consists of 4 neurons
- uses softmax funnction to generate probabilities for the Nucleotide bases A, C, G, T
ANN : implementation
- Based on the example code from the book 'Neural networks and deep learning' by Michael Nielsen
- Written in Python3
ANN: training
- We use the reference sequence set used in COMET to train the ANN
- Cross-validation is done using randomly removing one sequence from the training set for each subset
- Cross-entropy cost function is used to update network weights and biases
- Multiple epochs are used and the one producing best validation accuracy is stored
ANN: subtyping
- For each nucleotide positions in the query sequence:
- ANNs from each subtype generates probabilities of seeing the nucleotide given previous context
- The decision tree used in COMET is used to predict the subtype of the query sequence
Reference dataset
Cross-validation
Test datasets
- Reference (215 sequences)
- Vanderbilt (2779 sequences)
- PR-RT (727 sequences)
Accuracy comparison
Future direction
- Optimise neural network parameters
- Use
tanh
function?
- Larger context size?
- Recurrent Neural Networks (RNN)?
- Report breakpoints for potential novel recombinants
- Implement using TensorFlow
Acknowledgements
- Simon Frost
- Richard Dybowski