A deep learning method for miRNA/isomiR target detection

Identification of miRNA–mRNA and isomiR–mRNA interactions in CLASH

To train and test DMISO, we obtained miRNA–mRNA and isomiR–mRNA interactions from the CLASH experiments18. We downloaded the CLASH data in the HEK293 cell line (GSE50452), which contained raw reads from six human samples. Each sample consisted of both single and chimeric reads. Only the chimeric reads comprised the miRNA/isomiR sequences and their interacting target site sequences in mRNAs. Here and henceforth, we refer miRNAs as the traditional ones in miRBase25 and use “miRNAs/isomiRs” to represent the miRNAs and their isomiRs.

We identified miRNA–mRNA interactions from the chimeric reads similar to the original study18 (Fig. 1A and Supplementary Table S1). In brief, we downloaded raw reads, removed adapters from the reads, discarded duplicate reads, and finally mapped the remaining reads against two databases separately with BLAST version 2.10.1+26. One database was the protein-coding transcript sequences from GENCODE version 3827. The other was the human mature miRNA sequences from miRBase version 22.125. We required a BLAST hit with an e-value ≤ 0.1 to claim the mapping of a read. The mapped reads along the antisense strand of a transcript or having alignment loops were discarded. To control the mapping quality of the miRNA portion of a chimeric read, which was much shorter than the mRNA portion and thus can be harder to differentiate from the sequencing errors, we required that the same miRNA portion occurred in at least 10 chimeric reads. We chose 10 here because the chance to observe the same isomiR at least 10 times was about 8.87E−08, given the average miRNA length 22nt, the Illumina sequencer error rate 0.001, and the number of reads mapped to a miRNA was fewer than 1000 for most (> 95%) miRNAs. Moreover, this choice enabled enough data for training the deep learning models. We allowed a maximum gap or distance of 4 nt between the mapped miRNA portion and the mapped mRNA portion in a chimeric read as previously18. The miRNA and mRNA portion of a chimeric read may be mapped to multiple miRNA and mRNA transcripts, respectively. If a read was mappable to multiple miRNA or mRNA transcripts, to retain the most significant miRNA–mRNA pair, we used the following criteria in order: (1) the pair with the smaller BLAST e-values; and (2) the pair with the larger BLAST bit scores if the e-values were the same.

Figure 1
figure 1

(A) The pipeline to obtain miRNA/isomiR–mRNA interactions. (B) The DMISO model structure.

With the identified miRNA–mRNA candidate pairs, we compared the aligned miRNA portion of the chimeric reads with the corresponding miRNAs to define miRNA–mRNA and isomiR–mRNA interactions. If a read perfectly matched a miRNA, we claimed this candidate pair as a miRNA–mRNA pair. Otherwise, if the nt sequencing quality scores at the variation positions (compared with the miRNA sequence) are larger than 30, this candidate pair is an isomiR–mRNA pair. To select isomiRs with confidence, we also required that the isomiR sequences were seen in at least 10 chimeric reads. We further classified these isomiRs in isomiR–mRNA pairs into the following eight types: 5′ isomiR (addition, deletion and replacement), 3′ isomiR (addition, deletion and replacement), single nucleotide polymorphic (SNP) isomiR, multiple nucleotide polymorphic (MNP) isomiR. An isomiR may belong to multiple types.

Training data and cross-validation

For the obtained miRNA/isomiR–mRNA pairs, we extended the 3′ end of the mRNA section of the chimeric sequences by 25 nt to have more complete target sites. The extended mRNA target sites shorter than 30 nt were filtered out as previously18. The extended pairs were considered as positive interaction pairs. For every positive pair, a negative pair was generated with the same miRNA or isomiR and a negative site in the 3′ untranslated region of the corresponding positive mRNA transcript as previously28,29. The negative site was required to be at least 10 nt away from the positive sites and had free folding energy < 10 kcal/mol measured by the RNACoFold tool30. We created our training dataset by randomly choosing 80% of the positive and negative interactions. We tested DMISO on the training data using tenfold cross-validation. We also tested DMISO on the remaining 20% interactions that were not used for training.

Independent data

In addition to the remaining 20% CLASH test data, we extracted miRNA/isomiR–mRNA pairs from CLEAR-CLIP data as independent test data. Similar to the above analysis of the CLASH data, we analyzed the CLEAR-CLIP chimeric reads in 12 human samples from the hepatocyte-derived carcinoma cell line HuH-7.5 (GSE73059)19. We defined 14,684 positive miRNA/isomiR–mRNA pairs, all of them involved isomiRs instead of the conventional miRNAs.

We also obtained another independent dataset from the recently updated miRTarBase database release 8.031. This database contains experimentally validated functional and non-functional miRNA target sites, which are considered as positives in this dataset. We extended the 3′ end of the mRNAs in the interactions and discarded the interactions that did not have mapping mRNA and miRNA ids in the respective databases used here and those interactions where the mRNA sequences were shorter than 30 nt. After this filtering, we obtained 14,144 miRNA–mRNA sequence pairs, 13,926 of which were functional and 226 of which were non-functional interactions based on the original study31. This dataset did not have any negatives.

Deep learning model

We designed a deep learning method called DMISO for miRNA/isomiR target sites and target mRNA identification. DMISO takes the miRNAs/isomiRs and their corresponding mRNA target site sequences as input and outputs a binary number to indicate whether a miRNA/isomiR interacts with its corresponding mRNA site. The architecture of DMISO is composed of two separate branches containing convolutional neural network (CNN) layers, a long short-term memory (LSTM) layer, and a fully connected neural network layer (Fig. 1B). The two convolutional layers are for the miRNA/isomiR and target site sequences, respectively. The LSTM layer combines the features detected by the two convolutional layers. The output of the LSTM layer is fed into a fully connected neural network to predict the label of the interaction.

The convolutional layer in each branch is 1-dimensional and consists of an array of 10 kernels, each with a size of 4 × 8. The kernels act as sliding windows to capture spatial features in input sequences by scanning the sequences. The convolutional layer does not have any padding around the input (padding = “valid”). The kernels are convolved across the input by 1 step (stride = 1). After the 10 kernels, the outputs of the two convolutional layers become the matrices of size 10 × 23 and 10 × 53, respectively. The next layer in each branch is a 1-dimensional max pooling layer with a pooling size 4, which captures the maximum values within each 10 × 4 window, sliding by 1 step (stride = 1), across the output of the respective convolutional layers. The output of the max-pooling layers in the miRNA/isomiR and target site branches is 10 × 20 and 10 × 50 matrices, respectively. Rectified Linear Unit activates the neurons in the convolutional layers of the two branches and the neurons in the dense layer. After the max-pooling step, the outputs of the two branches are merged to create a 10 × 70 matrix and fed into a bidirectional LSTM (BLSTM) layer. The BLSTM layer processes the spatially connected features from both left to right and from right to left, generating a 20 × 70 matrix output, which is then flattened to a vector of length 1400 and fed into a dense layer. The dense layer is a fully connected neural network with 100 neurons, which outputs a vector of size 100. This vector is used as an input to a logistic regression unit to generate the final prediction, where the sigmoid function is used.

Before training DMISO, the miRNA/isomiR and target site sequences are converted into 4 × 30 and 4 × 60 matrices, respectively, by applying one-hot encoding on every nucleotide in the sequences. That is, ‘A’, ‘T’, ‘C’, ‘G’ and ‘N’ are encoded into [1, 0, 0, 0]T, [0, 1, 0, 0]T, [0, 0, 1, 0]T, [0, 0, 0, 1]T and [0.25, 0.25, 0.25, 0.25]T, respectively. The fixed lengths 30 and 60 are the average length of the processed miRNAs/isomiRs and target sites in chimeric reads, respectively. To keep the fixed lengths, we removed extra nt from the ends of longer sequences and added additional “N”s to the ends of shorter sequences.

Batch normalization was used to train DMISO with mini-batches of 100 samples at a time. We calculated the loss of each prediction using the binary cross-entropy loss function, which is minimized by the Adam optimizer with a learning rate of 0.00132. To avoid overfitting, we had a dropout layer with 25% dropout rate after merging the two branches and two dropout layers with 50% dropout rate after the BLSTM layer and the dense layer. L1 regularization with the parameter value 0.01 was applied on the two convolutional layers and the dense layers to reduce overfitting. For the implementation of the deep learning model, Keras 2.3.1 version was used (https://github.com/keras-team/keras/releases/tag/2.3.1). DMISO model is executed with two inputs: miRNA/isomiR sequence and mRNA sequence. The model provides an output probability score from 0 to 1 and a binary prediction value of 0 and 1.

Feature identification

Many machine learning methods have been developed to select features33,34,35,36,37,38,39,40,41,42,43. Deep Learning models are infamous for being a black box when it comes to understanding the underlying features. But recent studies have focused on various strategies that can reveal the features or patterns learned by different types of machine learning models39,40,41,42,43. Here, two of the most popular feature identification methods, convolutional kernel analysis and input perturbation, were applied to discover important features for miRNA/isomiR–mRNA interactions24.

The convolutional kernel analysis method is suitable for a deep learning model that contains a convolutional layer24,39,40,41. This method is used to interpret the kernel weights of the convolutional layer after training the model. In this study, the miRNA/isomiR and mRNA sequences were scanned separately by the k length kernels of the two convolutional layers in DMISO, which captured the composition of k-mers in sequences that were important for the interaction between the miRNA/isomiR and mRNA sequences. Since the convolutional layers are the first layers in DMISO, the k-mer patterns captured should represent important features specific to the miRNA/isomiR and target sequences.

The input modification technique is another popular feature interpretation method24,39,41, where a part of the input is perturbed with random noise and the changes in the model prediction is recorded. The change in the model prediction after the modification to a part of the input represents the sensitivity of the model to that part of the input. Therefore, this method can help reveal the model’s sensitivity patterns to different regions in input sequences. Here, we masked every contiguous region of length 4 in input sequences with “N” and recorded the respective changes in the prediction probabilities of the output layer. The changes should show important regions in terms of target binding.

Comparison with existing tools

DMISO was compared with three popular tools, TargetScan version 7.244, miRanda 3.3a45 and RNA22 version 246, and two recently published deep learning-based tools, miRAW47 and miTAR48, on the 20% CLASH test data, the CLEAR-CLIP data, and the miRTarBase data. TargetScan and miRanda take two separate files for the miRNA and mRNA sequences as inputs, while miRAW and miTAR take the interactions (miRNA–mRNA sequence pairs) as inputs. In the case of isomiR–mRNA pairs in the testing data, we used the isomiR sequences in place of miRNA sequences in the input. To run RNA22, the input miRNA/isomiR and mRNA sequences must be uploaded the RNA22 server with day-wise traffic restrictions. Because of this, obtaining results from the RNA22 server on a large dataset like ours is a time-consuming process. Therefore, while the other four tools were executed on the test datasets, RNA22 was evaluated by overlapping the test data sets with the pre-computed predictions of RNA22 on human (https://cm.jefferson.edu/rna22-full-sets-of-predictions/). A test interaction was considered a predicted positive by a tool, if the miRNA id and mRNA gene id of the test interaction matched with any of the predicted interactions as well as the mRNA target sequence locations overlapped with the corresponding predicted target sites.

Source link

Stay in Touch

To follow the best weight loss journeys, success stories and inspirational interviews with the industry's top coaches and specialists. Start changing your life today!

spot_img

Related Articles