Naya26 - Bioinformatics (Level 1) Pathway

Module 1

Technical Area:

  • R Studio: Understanding the fundamentals of R code and getting familiar with its interface
  • Learning how to read research papers efficiently
  • Understanding more about the bioinformatics field, such as methods and tools used

Tools:

  • R studio and R installer
    • ggplot2
    • Bioconductor (specifically EnhancedVolcano)
  • Resources:
    • Official ggplot2 website
    • YouTube
    • Stack Overflow
    • RStudio Community

Soft Skills:

  • Time management: pacing myself between reading the given material and performing tasks myself on application
  • Utilizing resources: Researching and clarifying my own problem with resources at hand:
    • Official ggplot2 website
    • YouTube
    • Stack Overflow
    • RStudio Community
  • Navigation: Understanding how to use the Stem-Away website
  • Communication: Communicating with the leads/founder to clarify any doubts
  • Determination: At first I thought I wouldn’t be able to understand R, but after going through the material provided, I understand it very well to the point I was able to do problem solving and enjoyed writing code with it very much

Three Achievement Highlights:

  • Reading through the materials given and practiced on the application itself. This allowed me to grasp the fundamentals of R and understand its interface
  • Errors: Experimenting with the list and tasks of different errors provided in addition to fixing errors that I personally occurred when I tried to execute code
  • Understand the basics of how to read scientific papers and getting a better understanding of the bioinformatics field

Tasks Completed:

Originally I was confused about the platform, but I was able to navigate and accomplish the given tasks for this module successfully. I was able to navigate through the STEM-Away website and make use of the resources given to us. I read the given material and was able to understand the basic fundamentals of R: intro to R (download R studio and R installer and understand its interface), syntax and data structures, functions and arguments, and data wrangling, and visualization (creating basic scatterplots, barplots, histograms, box plots, and volcano plots and changing their appearance such as labeling, font, color, and themes) . Whenever I encountered any problems while executing the code or wanted more clarification on certain commands, I was successfully able to use available resources on the internet, such Stack Overflow, RStudio Community, the official ggplot2 website, and YouTube. I was also able to familiarize myself with the different types of errors (syntax, semantic, and logical errors). There were some errors and issues that I personally encountered when executing code, which I was able to find the solution for based on what I learned. Outside of R, I learned efficient ways to read a research paper and get an understanding of how all the information in it can be condensed to something that encompasses all the important points. In addition to this, I was able to understand more about the bioinformatics field and the resources and methods that are used to accomplish many tasks that are required by the field (ex: microarrays, RNA sequencing, FASTQC, GEO2R, and R).

Module 2:

Technical Area:

  • Was able to understand the GEO Database platform and how it works; was able to download necessary data
  • Developed better understanding of what Bioconducter was and the packages it provides
  • Learn how to import data into R and create metadata

Tools:

  • GEO Database
    • GSE19804 samples
  • Express Zip
  • RStudio
  • Google slides (presentation for my team)
  • Other resources:
    • Presentations and videos on Stem-Away Platform

Soft Skills:

  • Determination and Utilizing Resources: This module definitely felt more engaging than Module 1, which was learning the basics and implementing them. Since I am still new to coding, I thought I wouldn’t be able to continue further. But in the end, thanks to the resources on the STEM-Away platform, I was able to push myself through and finish the tasks
  • Communication: Contacted leads whenever I had a problem
  • Teamwork + Presentation Skills: I worked with a group member on a presentation regarding transcriptomics and the two methods that are used (Microarray Analysis and RNA Sequencing)

Three Achievement Highlights:

  • Was able to download the .CEL files from the GSE19804 dataset
  • Imported .CEL files using ReadAffy() command
  • Got metadata by downloading the matrix text file and created and assigned an object to be a dataframe that contains some information from the matrix file (sample names and type of tissue)

Tasks Completed:

Using the resources given, I was able to understand what GEO is, such as how the records of data are organized (original submitter-supplied records vs curated records). I had an error when downloading simpleaffy and affyQCreport. So I contacted Anya who helped resolve the problem by letting me know this was as a result of having R installer 4.0+. Therefore, I downloaded R installer version 3.5.0, (since that was the minimum version of the R installer RStudio can be run on in order support Bioconductor and the packages I wasn’t able to download) and was able to download all the packages, which were : affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCReport, sva, GEOquery, and pheatmap. I was originally confused on how to load the data into RStudio and obtain the metadata, but I checked a previous meeting video in which Anya guided through the process step-by-step and I was able to understand it. I was successfully able to download the .CEL files for the GSE19804 dataset and import it into RStudio through the ReadAffy() command. Afterwards, I downloaded the matrix text file and imported that into RStudio too to obtain the metadata through the getGEO() command. Since the metadata has to be simplistic, I created a dataframe that limited the columns to showcase the important information: sample name and kind of tissue, since the rest of the information was the same. I looked over the presentation attached that gave a more detailed description of the Bioconductor package and the various tools it offers, which allowed me to get a stronger understanding of it. Separately from the module, this week I worked with my partner to create a presentation on transcriptomics and the methods of microarray analysis and RNA sequencing.

Module 3:

Technical Area:

  • Ran various types of quality control methods on GS19804 data; better understanding of the purpose of these various analyses and how outliers are detected
  • Background corrected and normalized the GSE19804 data using gcrma()
  • Created several visualizations of the GSE19804 data, including: a boxplot, PCA plot, and a heatmap
  • Submitted deliverables through GitHub

Tools:

  • RStudio
  • GitHub
  • Stack Overflow
  • Other resources:
    • Guided PDF instructions and videos on Stem-Away Platform

Soft Skills:

  • Time management: Efficiently pacing myself so I can understand the tasks for each module while finishing it on time
  • Organization: I started to organize my code by using comments and organize scripts by having one for each module
  • Communication: Approached the leads whenever I had encountered a problem

Three Achievement Highlights:

  • Was able to successfully reports of the GSE19804 data using all four QC methods: simpleaffy(), arrayQualityMetrics(), affyQCReport(), affyPLM()
  • Ran background correction and normalized GSE19804 data; successfully exported the results as a .csv file of it in addition to assigning it to an object
  • Created a boxplot and heatmap of the normalized GSE19804 data, created a PCA plot of both the raw and normalized GSE19804 data

Tasks Completed:

Through this module I learned about quality control, data normalization, and data visualization. I already had the affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCReport, pheatmap packages from module 2 and the data from the GSE19804 dataset, so I simply loaded the libraries and the object into the current session. Using simpleaffy(), arrayQualityMetrics(), affyQCReport(), affyPLM(), I ran quality control checks on the data and was able to see the analysis through various visuals. I was able to get a faint idea at what the possible outliers are. Although it became more evident once I normalized the data using the gcrma() method and created data visualizations including: box plots, PCA plots (created both a raw and normalized data graphs), and heatmaps. I originally had an error when making a PCA plot, so I contacted Anya who helped me understand what it meant and I was able to resolve it afterwards. I also was curious about the difference between using “@” vs “$” when accessing sub-levels and the purpose of accessing the “rotation” data part of the GSE19804 dataset. Anya then clarified that “@” is usually to access sub-levels in lists while “$” is usually to access sublevels in data frames and the “rotation” data in the GSE19804 dataset is important as it is a matrix of eigenvectors. Later on, I faced a problem with the creation of the heatmap, so I referred back to Anya’s video where she went through module 3, and I saved my data as a .rds file and used the exprs() function when saving my normalized data as a .csv, which helped get rid of the error. Once I finished all the tasks, I submitted my code and outputs (the necessary deliverables) onto GitHub.

Module 4:

Technical Area:

  • Downloaded and/or loaded libraries of the following packages: ggplot2, pheatmap, EnhancedVolcano, hgu133plus2.db, limma, WGCNA, magrittr, dplyr
  • Removed sample outliers through the IAC method
  • Annotated the probe IDs with their respective gene SYMBOLS
  • Removed duplicate probe IDs, NA values, and symbols
  • Filtered genes such that they are above the 2nd percentile
  • Used limma() to perform a DGE analysis on the data
  • Data Visualizations: Created a volcano plot and a heat map for the top 50 DEG
  • Committed deliverables to GitHub

Tools:

  • RStudio
  • GitHub
  • Google
    • Stack Overflow
    • YouTube
    • RDocumentation
    • rdrr.io
    • Data Novia
    • ProgrammingR
    • Listen Data
    • Horvath Lab UCLA
  • Other resources:
    • Guided PDF instructions and videos on Stem-Away Platform

Soft Skills:

  • Perseverance/GRIT: I felt like by far this was the hardest module, since it seemed like more self-learning rather than step-by-step instructions. This made it hard for me as this is my first time coding, but I was able to push through and finish in the end.
  • Self-research/answering: Whenever I was confused on how to execute code, what it meant, and/or received errors, I researched and resolved the problems myself. In doing so, I discovered various wonderful resources I might refer back to in the future.
  • Time management: Trying to efficiently execute tasks while learning at the same time

Three Achievement Highlights:

  • Successfully identified and removed 7 outliers using the IAC method
  • Created a data frame with probe IDs and their associated gene SYMBOLS and removed any duplicate probe IDs, NA values, and symbols. Afterwards merged the data frame with the normalized data with no outliers by row names
  • Filtered genes such that only ones above the 2nd percentile remain, then analyzed it using limma and created a volcano plot and heatmap from the top 50 DEG.

Tasks Completed:

This task felt the hardest one by far, but I was able to push through. Using the IAC method, I identified 7 outlier samples and removed them (which were GSM494571.CEL.gz, GSM494572.CEL.gz, GSM494582.CEL.gz, GSM494591.CEL.gz, GSM494596.CEL.gz, GSM494654.CEL.gz, GSM494657.CEL.gz). Afterwards, I created a dataframe with annotated probe IDs to their correct gene SYMBOL through the “hgu133plus2.db” database. I removed any duplicate probe IDs, NA values, and symbols. I set the row names to the probe IDs. Then I set the row names for the data frame to be the probe IDs, which allowed me to merge the data frame with my normalized data with no outliers by row names. Then I set the row names of my newly merged data to the gene SYMBOLS. Then I filtered out the genes such that only those that were above the 2nd percentile remained, and then I performed limma analysis on it to make a linear model. I then extracted the top DEG with an adjusted p-value that is less than 0.05, from which I created a volcano plot, and then retrieved the top 50 DEG to create a heat map.

Module 5

Technical Area:

  • Downloaded and loaded libraries of the following packages: org.Hs.eg.db, clusterProfiler, enrichplot, msigdb, magrittr, tidyr, and ggnewscale
  • Created DEG vector with log fold change of upregulated genes and assigned genes with a gene ID
  • Detected enriched pathways through gene ontology and KEGG analysis
  • Identified enriched genes and pathways using KEGG and created a gene concept network it
  • Used GSEA to see overrepresented gene sets
  • Looked through what other regulatory factors that can affect processes
  • Submitted deliverables to GitHub

Tools:

Soft Skills:

  • Time management: Knowing my tight upcoming deadline, I tried to execute my tasks and solve problems as fast as I could and I paced myself
  • Self Problem-Solving: I faced several errors once again, and for some there wasn’t much advice on the internet, so before I contacted the leads I solved all my errors by myself
  • Organization: Organized all my deliverables into proper folders and committed them to GitHub

Three Achievement Highlights:

  • Created gene ontology plots for CC, PB, and MF for up-regulated DEG
  • Made gene concept network of KEGG pathways and transcription factors; 1 dot plot for KEGG pathways
  • Successively created the GSEA plot after bypassing an error, which I solved by reformatting the data frame

Tasks Completed:

I started off by downloading the following necessary packages and their libraries: org.Hs.eg.db, clusterProfiler, enrichplot, msigdb, magrittr, tidyr, and ggnewscale. I created a DEG vector byI extracting the log fold change from my top table, and only filtered out upregulated genes by setting my threshold to be 1.5. After I organized the vectors in descending order, I named these values by their gene SYMBOL and had their associated gene ID in another column. Then to find the enriched pathways associated with up-regulated genes, I used gene ontology, where I produced plots focusing on cellular components, biological processes, and molecular functions, and also KEGG analysis, in which other pathways were identified through a dotplot. In addition, a gene concept network for KEGG analysis where connections were drawn between genes and pathways they connect with. Afterwards, GSEA was performed (using a vector of all genes, not just upregulated ones like previously) to identify overrepresented gene sets by finding the variation in statistical significance between different samples, and a graph was produced in the end. Toward the end, transcription factor analysis was performed and another gene concept network map was created, which allowed transcriptional factors to be assessed so regulation of genes could be understood. Finally, I retrieved a list of the gene SYMBOLS and kept it in a .txt file for the next module.

Module 6

Technical Area:

  • Used RStudio to retrieve gene SYMBOLS
  • Explored web-based functional analysis tools
    • Enrichr
    • STRING
  • Submitted my code in GitHub

Tools:

Soft Skills:

  • Self-solving: Noticed an issue with the data I inputted from the vector to the text file, so I fixed it myself

  • Organization: I created a folder for my outputs, and properly uploaded it (in addition to the other folders) on GitHub

  • Successfully imported the text file so it included gene SYMBOLs only (originally it had log fold change numbers instead)

  • Retrieved sets of enriched biological identifiers through Enrichr

  • Created map of protein-protein relationships for genes entered

Tasks Completed:

Of all the modules, this was the most straightforward and easy. I already performed a bit of functional analysis in module 5 by using GSEA. In this module, I was introduced to web-based functional analysis tools. As per the requirement, I chose to explore one tool from each group: from Group A, I chose Enrichr and from Group B I chose STRING. To proceed to do that, I needed a text file with all the gene SYMBOLs. I accidentally just put in the log fold change numbers at first, so then I corrected my mistake afterwards and was able to have the gene SYMBOLs. I then copied and pasted them in the textboxes on both sites and searched. On Enrichr, the results were many sets of enriched biological annotations, while in STRING, a network map was provided showing connections/interactions between proteins that were associated with the genes provided for the GSE19804 data set.

Module 7

Technical Area:

  • Downloaded
  • Performed quality control through arrayQualityMetrics
    • PCA graphs for raw and normalized data
  • Performed DGE analysis
    • Identified and removed outliers
    • Annotated probe ID to gene SYMBOL
    • Removed duplicate SYMBOLs, probe IDs, and NAs
  • Functional Analysis
    • GEO
    • GSEA

Tools:

Soft Skills:

  • Self-solving: Fixed any errors myself through prior knowledge/experience I have developed from this internship
  • Organization: I created a folder for my outputs, and properly uploaded it (in addition to the other folders) on GitHub
    • Organized by code by commenting

Three Achievement Highlights:

  • Performed quality control on GSE66272 (arrayQualityMetrics); created two PCA plots
  • DGE analysis: Successfully identified and removed outliers, and annotated probe IDs to their designated gene SYMBOLS
  • Functional Analysis: performed GEO and GSEA and created plots/graphs

Tasks Completed:

In this module I did nothing new, but rather applied my skills and knowledge from previous modules to a new data set: GSE66272, which is about renal cancer, specifically CCRCC, which is a very aggressive form of it. Just like the previous lung cancer data set I analysed (GSE19804), the data is profiled by array. I downloaded the data .CEL and series matrix text file, from which I was to import data into RStudio and create metadata. From then I performed a quality control test (specifically arrayQualityMetrics) and then normalized the data using gcrma. Afterwards, I visualize the data (both the raw and normalized) using a PCA model. Then I moved onto DGE analysis, where I identified and removed the outlier probe IDs to their respectful gene SYMBOLS, and removed duplicate symbols, probe IDs, and NAs. Afterwards I merged the annotated data with the normalized data with no outliers by row names and filtered the data. On this data I performed limma analysis, from which I made a top table and made a heat map from. Then I moved onto functional analysis, where I performed gene ontology and GSEA and generated plots/graphs.