Thapa-F24.tex



\section{Introduction : Sahil Thapa}
I am Sahil Thapa, a PhD in Computer Science. My advisor, Dr. Oluwatosin Oluwadare, guides my research, and I also serve as a graduate research assistant in his bioinformatics lab. My research focuses on utilizing deep learning and artificial intelligence to develop bioinformatics tools, with a current emphasis on splice site detection methods.\\

From a young age, I have enjoyed walking and traveling, a passion instilled in me by my maternal uncle. He often took me to rivers and hills, where we would sit and talk, fostering my love for trekking. I have explored many beautiful places in Nepal, including the Annapurna Mountain Range and the Langtang Circuit. My favorite films are "The Legend of 1900" and "The Forrest Gump."

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.5\linewidth]{images/Sahil_Thapa.jpg}
    \caption{Caption}
    \label{fig:enter-label}
\end{figure}

\section{Output of CNNSplice: Robust models for splice site prediction using convolutional neural networks.}
I explored the following Git repository related to my research: https://github.com/OluwadareLab/CNNSplice.git\\

CNNSplice is a set of deep convolutional neural network models designed for splice site prediction. Developed at the University of Colorado Colorado Springs, CNNSplice aims to efficiently predict true and false splice sites using robust machine learning techniques.

\subsection{test\_logfile\_metrics}
These are Output\_logfile Results of CNNModel.\\
$\{'precision': 0.9027834069851564, 'recall': 0.9206222222222222, 'f1': 0.9111579557303049, 'class_accuracy': 0.9318666666666666, 'accuracy': 0.9318666458129883\}$\\

$\{'precision': 0.9273229649052265, 'recall': 0.9525333333333332, 'f1': 0.9389124679859917, 'class_accuracy': 0.9528, 'accuracy': 0.9527999758720398\}$\\

$\{'precision': 0.9229379328722263, 'recall': 0.9288888888888889, 'f1': 0.925857755161124, 'class_accuracy': 0.944, 'accuracy': 0.9440000057220459\}$\\

$\{'precision': 0.9280761354666826, 'recall': 0.9274666666666667, 'f1': 0.927770831864634, 'class_accuracy': 0.9458666666666666, 'accuracy': 0.9458666443824768\}$\\

$\{'precision': 0.9037206096479873, 'recall': 0.9141333333333334, 'f1': 0.908744230763471, 'class_accuracy': 0.9306666666666666, 'accuracy': 0.9306666851043701\}$

% \begin{figure}[h!]
%     \centering
%     \includegraphics[width=0.5\linewidth]{LogFile.png}
%     \caption{LogFile}
%     \label{fig:enter-label}
% \end{figure}


\section{Questions for me}

\subsection{Questions from Aaron McKay}
1. Looking at your CNNSplice test results, the precision and recall metrics are consistently high (above 90%). In your research with Dr. Oluwadare, how do you balance the trade-off between false positives and false negatives when detecting splice sites, considering the potential biological implications?
Balancing the trade-off between false positives and false negatives in splice site detection is crucial, especially given the biological implications of misclassifications. Here are some strategies we considered in our research with Dr. Oluwadare:\\
\begin{enumerate}
\item \textbf{Threshold Adjustment:} By adjusting the decision threshold of our model, we can control the sensitivity (recall) and specificity (precision). A higher threshold might reduce false positives but increase false negatives, and vice versa. We often analyze the precision-recall curve to find an optimal balance for our specific application.
    \item \textbf{Biological Context:} Incorporating biological knowledge can help refine predictions. For example, prioritizing splice sites that are more likely to be functionally relevant based on known splice variant patterns can help mitigate the impact of false positives.
    \item \textbf{Cross-Validation:} Using k-fold cross-validation ensures that our model generalizes well across different datasets, helping us identify potential biases that could lead to either type of error.
    \item \textbf{Ensemble Methods:} Combining multiple models can enhance performance by capturing different aspects of the data, which can reduce both false positives and false negatives.
    \item \textbf{Post-Processing:} Implementing rules or filters based on biological annotations can help eliminate unlikely splice site predictions after the initial model output.
    \item \textbf{User Feedback:} Collaborating with biologists to review and validate predicted splice sites can provide insights into which errors are more acceptable in practical applications.
\end{enumerate}
By carefully considering these factors, we can optimize our approach to detect splice sites with a balance that aligns with the biological significance of our findings.

2. Given your work on CNNSplice and your focus on deep learning in bioinformatics, what motivated your choice of convolutional neural networks over other deep learning architectures (like transformers or RNNs) for splice site detection?
Our choice of convolutional neural networks for splice site detection was motivated by several factors:
\begin{enumerate}
    \item \textbf{Spatial Hierarchies:} CNNs excel at capturing local patterns and hierarchical features, which is particularly useful for sequence data like DNA. The convolutional layers can effectively identify motifs and features relevant to splice sites, such as specific nucleotide combinations.
    \item \textbf{Reduced Complexity:} Compared to recurrent neural networks (RNNs) and transformers, CNNs typically require fewer computational resources for training and inference. This efficiency is important when working with large genomic datasets.
    \item \textbf{Translation Invariance:} The nature of splice site sequences means that relevant features can occur at different positions within the input. CNNs’ ability to learn translation-invariant features makes them well-suited for this task.
    \item \textbf{Robustness to Noise:} CNNs are generally more robust to noise and variations in input data, which is essential in bioinformatics, where sequencing errors and biological variability can introduce inconsistencies.
    \item \textbf{Empirical Success:} Prior studies and benchmarks showed that CNNs performed well in similar tasks, such as motif discovery and sequence classification, which encouraged us to adopt this architecture for splice site detection.
\end{enumerate}

While we remain open to exploring other architectures like transformers or RNNs in future work, CNNs provided a strong foundation for our initial investigations into splice site prediction.

\subsection{Questions from Raja Kantheti: }
How do you manage to incorporate biological knowledge into your deep learning models for splice site detection?

What are some challenges so far in your research on splice site detection, and how do you plan to address them?