Since this blog was last updated the semester has ended and my peers and I are now on Winter Break. After the Thanksgiving Break we continued to do some work on the research during our last week of classes and then took a break for Finals Week and the beginning of Winter Break. In the new year I will continue to work on the project from where I left off and next semester work will continue in earnest. During this past semester this work was completed as part of Dr. Papamichail's research class. During the last week of class we were required to write a final report on what we had achieved during the semester. The report I wrote can be found after the break.
Over the past week I have been working on making random synonymous changes within the sequence to see if it has any impact on the folding energy of the RNA. I have not reached the stage where I will be testing it at least a hundred times as I am still trying to get the synonymous changes to work correctly. However, I am close and should hopefully be able to get there in the next week. My classmate Joie has some of Dr. Papamichail's code that might have something that could help me so we'll have a look at that can be used. Also, JR is looking into a similar problem so we will be working together to find a solution.
This week I worked on the first stages of estimating folding energy for the RNA sequence. What I did was find the frequencies of different k-mers in both hashes and use them to calculate a final score for the sequence as a whole. This involved looping through the hashes and using the hamming distance function I wrote to determine if the compared were the same k-mers.
Over the next week I will begin making random synonymous changes with the aim of increasing the frequency score I have already calculated. I will then be using Rfold to compare the folding energy of the sequence before and after the mutations. I will be tracking to see if there is a correlation between changing the frequency score and changing the folding energy is a given direction. Since I last updated this blog I have successfully written the hamming distance function for my code although I continue to experiment with it to see if there is more than one way to do it. I was looking through BioPerl to see if there was anything useful there but I wasn't able to find anything as far as hamming distance is concerned. However, I'm sure BioPerl will be useful at some point in the future. Additionally, I have downloaded and experimented with RFold as I need it for the next section of my research.
On Thursday I began working on estimating the folding energy of an RNA molecule. Like with the hamming distance I am using substrings of length k or kmers of the sequence to do this. Dr. Papamichail says he is unsure of where this line of inquiry will lead as he was unable to find any other papers about experiments using a similar approach. It will be interesting to see where this leads. Over the past few days I have been laying the ground work for this part which for now will focus on decreasing the folding energy of the RNA molecule. Once I have successfully found a way to estimate the folding energy I will perform some synonymous changes the part of the RNA molecule I'm working with and see if it decreases the folding energy by using RFold. I will do this experiment a large number of times (100 at minimum) and record my results. This week I spend a lot of time exploring BioPerl hoping to find something that might help me calculate hamming distance and anything else I will need in the future. To begin with I installed CPAN and BioPerl. Then I read through more of the BioPerl documentation. Currently I have not been able to find anything obviously useful but I am still searching. There was a mention of hamming distance in Bio::PhyloNetwork and my goal for this coming week is to see whether or not it is applicable. In addition to experimenting with BioPerl I continued to work with my own code to calculate hamming distance and got some interesting results although they were not accurate 100% of the time.
This week I continued working on creating a function to find the distance between two kmers, specifically the hamming distance. I experimented with a few different methods and although I have not found an optimal solution I feel that I am getting closer. I also worked on installing BioPerl and exploring its features. Hopefully I will find some functions that will assist me in reaching my objective.
For this next week I will continue to developing a way to calculate the distance between two kmers and continue exploring BioPerl. This week I started working on this project's second objective. While still in the early stages I did begin work on a program that would address the goals of this objective. As I mentioned last week I had to construct a simple Perl program that read a FASTA file and found substrings of length five located in the genetic sequence. This week I began adapting that program so that it could begin to meet the needs outlined in the second objective. First I parameterized the k, meaning that instead of finding substrings of length five the program now finds substrings of length k also known as kmers. Additionally, I modified the program so that it found the genetic sequence's reverse compliment. Both sequences are stored in hashes. I also began to explore ways to compare the kmers found in the genetic sequence and a way to score them based on distance. Right now I'm working on just a general distance function that will be changed as the project progresses.
My objective for next week is to perhaps finalize kmer comparison or at least find a solution that will work for the early stages of the project. I also intend to explore BioPerl to see what parts of it will be useful. Over the past couple weeks I have been learning about the topics that will be covered in this research project. One of the main things I learned is Perl. I had to construct a simple program that read a FASTA file and found substrings of length five located in the genetic sequence. The program had to print the substrings and how many times these patterns occurred. Additionally it had to count and print how many unique substrings there were in the genetic sequence. To test the program I downloaded several viral genetic sequences from the National Center for Biotechnology Information (NCBI) and had the program read them. This project was a good introduction to Perl's syntax and semantics because my peers and I had to learn Perl while completing the project.
I also had to read some papers and abstracts to learn more about RNA folding. My peers and I had to read “How do RNA folding algorithms work?” by Sean Eddy which was published in 2004 Nature Biotechnology. It gave an overview of how RNA folds and the challenges presented when trying to predict how it will fold. Folding RNA evidently takes a lot of computational power. I also had to read a paper written by a previous student who had conducted some research similar to ours. When the group meets again tomorrow we will discuss what we have read and have any questions answered. |