Machine Learning Platform for the Classification of Nucleotide Sequences
Université du Québec à Montréal

programs manipulate three different types of data: Sequence, Label and Classifier. Each program needs either one, two or three types for its input and generates one type. For example requires a set of sequences and a one classifier and outputs a set of labels. However requires a set of sequences and labels and generates a set of classifiers.
Sequence type and format
tools use DNA (Deoxyribonucleic acid) sequences to build classifiers and predict unknown classes. DNA sequences can be complete genomes (short to medium size), genes or fragments from the same genomic region. The sequences should be in FASTA format. They must have a unique ID and optionally a description.
The sequences should not be aligned. Dashes of gaps should be removed.
Procedure to import sequences
FASTA formatted sequences can be entered directly into the text area. If the data is large, it is advisable to upload a file containing the sequences in FASTA format with the button. However there is a limit of 200 MB of data.
load sequences
Labels (Annotations/Classes)
Label type and format
Labels are the target features which the classifiers or models attempt to predict. In the training step, each reference sequence must be annotated with one label. Labels could be a taxonomic category (e.g., genus level) or extrinsic traits (e.g., geographic locations).
Submitted label data must be formatted as follows: each line contains one pair seqeunce ID - label separated by tabulations \t+, spaces \s+, comma , or semicolon ;. Neither Sequence IDs nor labels must contain these separators. Lines beggining with a # will be omitted.
label, tab separator
label, space separator
label, semicolon separator
label, comma separator
Procedure to import labels
Formatted label data can be entered directly into the text area. If the data is large, it is advisable to upload a file with the same format with the button. However there is a limit of 200 MB of data.
load labels
Classifier ID
A classifier is a model trained and built with one of training tools ( and ). Each classifier has a unique identifier (Not to be confused with the JOB ID). The classifier ID could be found in the Classifier viewer.
classifier id
Personal classifier IDs have a prefix (MD, RD, EX or BM) and end with the name of the used machine learning algorithms, e.g., MD00EXAMPLE1_SVM.
Shared classifier IDs in are seven characters and begin with a PM prefix , e.g.,PM01PW5.
Classifier file
Classifier files are a way to persist classification models. Once the classifier is built by one of training tools, the user could download the classifier file with the button in the Classifier viewer. It is a compressed file (.tar.gz) containing several files, among them the training model file and a metadata JSON file.
Users could upload a previously constructed classifier file (.tar.gz file) from their local machines with in the Classifier viewer.
Procedure to load classifiers with Classifier viewer
The Classifier viewer can load a classifier:
  • From a personal job folder: Enter the classifier ID in the input area and press on or on Enter
  • From : Press on to select a classifier from the database
  • From a local file: Upload a classifier file (created with platform) with and press on
classifier viewer
platform offers different applications to predict new sequence labels and train new classifiers.
This is the principal application that allows user to annotate a viral sequences according to a chosen classifier. It also serves as evaluation module for classifiers with a labeled test sets. The results are provided with enriched graphics and performance measures.
Procedure to classify sequences
Input: Classifier and Sequences
Output: Labels
Select and upload a suitable classifier for the classification task (see uploading classifier procedure), then import sequences in FASTA format (see import sequence procedure). DNA sequences should not be labaled. After that, press on button.
Select Evaluation mode to test a classifier with a set of labeled sequences. Labels should be embedded into the description of the sequences (>IDSeq label).
The program allows a user to the create and train new classifiers from a set of labeled DNA sequences. It contains default parameters and advanced options letting a user to customize the classifier parameters. It can be used also to update the parameters or input sequences of an already built classifier. The constructed classifiers can be saved in an exportable file locally or publish to the community via .
Procedure to build classifier from new data
Input: Sequences and Labels
Output: Classifiers
Procedure to build classifier from other one
Input: Classifier
Output: Classifiers
It constructs improved classifiers. unlike CASTOR-build that allows user to define metrics, algorithms and feature selection models, It assesses all combinations of the classification parameters and provides the best fitting classifier according to the input data.
Procedure to build improved classifier from new data
Input: Sequences and Labels
Output: Classifiers
Procedure to improve a built classifier
Input: Classifier
Output: Classifiers
This is a public database of classifiers which allow the community to share their expertise and models. It facilitates experience reproducibility and models refinement. A search engine and classifier properties viewer are also implemented. Hence, from the interface of CASTOR-database, users can download, reuse, update and comment the published classifiers.