Sanisetti - Bioinformatics (Level 1) Pathway

sanisetti · June 14, 2021, 11:06pm

Self-Assessment 1, Same as April

Technical Area

An introduction to bioinformatics methods, tools, and different types of data used in bioinformatics analyses
Introduction to navigating, using, and debugging in RStudio with R
Introduction to reading and interpreting scientific papers

Tools

RStudio/R
RStudio packages
- ggplot2
- Bioconductor
Additional Resources:
- YouTube tutorials (bioinformatics
- Frontiers in Genetics (paper)
- LinkedIn (downloading latest version of R)
- ggplot2 website
- StackOverflow

Soft Skills

Conflict resolution: I ran into a few errors while working through the tutorials, I learned that googling the error and using certain phrases could help resolve logic errors in code
Flexibility and Adaptability: I was not very familiar with RStudio previous to this tutorial, and was not accustomed to its layout or the R syntax. From this tutorial, I gained skills on how to adapt to new situations and learned new skills.
Self-management: Since the tutorials are self-guided I had to be sure to learn in a way that works best for me. This often meant re-reading the tutorials multiple times and taking my time to make sure that I retained information
Attention to Detail: R is a very detail-oriented language, the most common error I ran into was confusion over capslock and lowercase. I learned to be a more active reader and picked up on problems better after this module.
Commitment: This module took me a while, but I stuck with and learned the importance of fulfilling a task, no matter how hard it is. In the end, I learned a lot and grew as an individual.

Three achievement highlights

Learned how to use ggplot2 to construct volcano plots and boxplots with a given dataset. This allowed me to learn how to work with the R syntax and also allowed me to learn what each component of code meant and how it worked in a visual way.
I also gained technical and practical knowledge in how to use R and Rstudio and also how to read a research paper to retain the most information possible.
Additionally, I learned how to troubleshoot code and how to be an effective problem solver when the code from the tutorial didn’t work on my end.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

I previously downloaded R Studio, but had not downloaded the latest version of R. Therefore, I looked at a LinkedIn article on how to update R to help me with this task.
I followed the PDF tutorials for getting familiar with R and creating a new project
I tried copying down the code from the PDF with the library(Biobase), however, I got the following error message: Error in library(Biobase): there is no package called ‘Biobase.’ I resolved this by googling the error and downloading the package according to the BioConductor web page instructions. This ended up working.
The tutorial ran smoothly after that, I ran into a minor case error with my variables, but I was able to sort it out.
Learned about different data types, vectors, variables, factors, matrices, data frames, and lists
Learned about different functions and how to seek help with functions
Installed and successfully loaded ggplot2
Learned how to read .csv files using R, I downloaded the sample data and it worked smoothly
Data visualization in R: through the tutorial I made a histogram, plots, bar plots, histogram, and boxplot (ran smoothly, but took me a while to adjust to the syntax and figure out what each component of code meant)
Created an enhanced volcano plot and boxplot using ggplot2
Looked over textbooks
Watched the 2020 student StemAway presentation on YouTube for a background in Bioinformatics
I started reading the Guo paper, but have only read the abstract and part of the introduction and taken notes

sanisetti · June 28, 2021, 9:23pm

Repost (I posted in the incorrect place):

Self-Assessment 2

Technical Area

Gain experience with using NCBI GEO to download datasets and use these datasets in R
Explore more functions in R to get started with NCBI GEO MetaData
Get familiar with utilizing more R packages such as affy, affyPLM, simpleaffy, arrayQualtiMetrics, affyQCreport, and sva

Tools

GEO dataset GSE32323
Excel
- To obtain MetaData
RDocumentation to explore merge()
RStudio and various packages (including affy, affyPLM, simpleaffy, arrayQualitMetrics, affyQCreport, and sva)
GitHub - created account

Soft Skills

Consulting Resources: I needed a refresher on some of the functions used in this module. I used Google and the RDocumentation to learn more about the merge() function. This skill will be useful in the future, as I come across more and more unknown functions that are necessary for my project.
Self-management: Once again, I am relatively new to R, so self-management and advocating for my own remote learning is important. I made sure to take my time reading through the module and exploring GEO and GitHub.
Research and Analysis: Looking through GEO taught me the importance of reading through descriptions and looking at different types of experiments to find the data most useful for my project.

Three achievement highlights

I learned how to navigate the NCBI Geo website to find datasets of interest by filtering the search options, which is useful as most bioinformatics projects require this skill
I also learned how to download various R packages, this skill is also important for getting into more advanced analyses and specifications in R
I successfully created a GitHub account, which will be useful for sharing code with other programmers and exploring existing repositories to build upon other code in future projects

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

I explored the GEO database, I have previous experience with this database and looked up ‘Brain Cancers’ and ‘Glioblastoma’ to find different datasets
Downloaded the GSE32323 dataset using instructions from the module
Unzipped .tar files
Used Excel to get MetaData, as in the past this is how I would go about obtaining MetaData, exported file to R
Installed and loaded packages: affy, affyPLM, simpleaffy, arrayQualitMetrics, affyQCreport, and sva
Looked at RDocumentation for a refresher on how to use the merge() function
Created GitHub account and commented details for 3rd module
Read Intro to DGE Analysis

sanisetti · June 28, 2021, 9:50pm

Self-Assessment 3

Technical Area

Visualize microarray data in R, conduct quality control, find outliers
Pre-process data (normalization/batch correction)

Tools

GEO dataset
R Studio
RDocumentation to explore functions
RStudio and various packages (including affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCreport, and sva)
GitHub - to submit deliverables

Soft Skills

Consulting Resources: This module was definitely more complex than previous modules. I had to use the guided instructions for more detail and I also had to look up how certain functions were used in the R documentation, especially for the PCA().
Self-management: Since this module was more intensive, I had to set aside a good amount of time to work through it. I spent a little bit of time everyday working through it and this aided me greatly in obtaining a final product.
Research and Analysis: While getting the plots or visualizations was an important aspect of this module. I also had to understand what Quality Control or PCA were. To do this I watched some StatQuest videos and meeting recordings explaining the plots and analyses.
Troubleshooting: I initially ran into an error on R studio with the simpleaffy package. To solve it I googled my error and followed instructions from an online forum where someone else had a similar issue.

*Communication: I had to work with a team to complete my deliverable for this project (the PCA plots). In order to do this we had to communicate with each other and give updates on our progress.

Three achievement highlights

I learned how to create different visualizations in R Studio with microarray data from GEO (boxplot, qc report, heatmap, pca plot).
I learned how to troubleshoot errors in R studio by looking at online forums and exploring online resources.
I learned more about common analyses used in bioinformatics such as what principal component analysis was.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

I conducted quality control on the given dataset and obtained the deliverables, however some of the functions in this step (especially those to obtain the six page document) were time intensive
I normalized the data using rma() and read online about what this function did
I conducted PCA and obtained two plots, I watched a StatQuest video on YouTube to better understand what PCA was
I created boxplots to identify outliers
I created a heatmap

sanisetti · July 8, 2021, 12:56am

Self-Assessment 4

Technical Area

Filter and pre-process genes for DGE analysis
Conduct differential expression analysis of microarray data
Create different plots, such as a volcano plot to visualize the DEGs

Tools

GEO dataset
R Studio
RDocumentation to explore functions
RStudio and various packages
GitHub - to submit deliverables
Stack overflow to troubleshoot common issues
Google Slides
Slack
YouTube

Soft Skills

Consulting Resources: Once again, this module was much more rigorous than previous modules. I had to follow the detailed guide and consult R documentation and use the help feature to learn more about certain commands and functions. Another resource I consulted was my group member, Arian. We troubleshooted common issues via Slack, to arrive at a suitable volcano plot. We also consulted with other groups to ensure that our plot was accurate.
Comprehension: To fully understand what a volcano plot was doing, I consulted a few videos on YouTube. This was helpful because I got a deeper understanding of the code and the data.
Self-management: Due to the rigor of this module, I had to manage my time effectively and set aside enough time so that I could completely understand what I was doing. I had to work through the module on my own, consult with my teammate, and present the module to my team.
Troubleshooting: Initially, our volcano plot had too many values and the wrong labels (not gene symbols). My teammate and I realized this was because we were analyzing the wrong dataset. We went back through the example on STEM-Away and also found an example on the internet to arrive at the correct solution.

*Communication: A large part of completing this module was consulting with my teammate. We had different results initially, and also wanted to find ways to include gene symbol labels. We explored different examples on the internet and tried different methods and spoke about how to make them better suited to our project. Eventually we arrived at a solution by consulting with one another.

Three achievement highlights

I learned how to interpret and create a volcano plot to analyze DEGs in R. I also learned how to change certain aspects of a plot such as labels and titles.
I learned how to effectively communicate with my teammates and collaborate to fix an issue.
I learned how to not only produce a plot, but also watched videos to understand what a volcano plot showed.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

Loaded data into R, via module 2
Removed outliers using tips from group members, initially a hurdle in the previous module. Solved by consulting with my group and finding the best solution.
Normalized data using rma() function, didn’t understand what rma was, read documentation to improve my understanding
Annotated with ProbeIDs, didn’t know how to do this initially, had to consult an online example on a forum page
Created a matrix using model.matrix()
Used lmFit (limma) and eBays to do DGE analysis
Used topTable() for plot data
Created volcano plot
Found top 10 DEGs
Created a heat map visualization

sanisetti · July 16, 2021, 2:08am

Self-Assessment 5

Technical Area

Conduct enrichment, over-representation and network analysis
Export and create plots with proper formatting

Tools

GEO dataset
R Studio
RDocumentation to explore functions
RStudio and various packages
GitHub - to submit deliverables
Google Slides

Soft Skills

Debugging code (ran into a small issue with formatting code, syntax error)
Problem-solving (deciding what threshold level to use)
Research and Analysis (researching and considering different threshold levels)
Presentation (presenting KEGG pathways to team)

Three achievement highlights

Created a KEGG pathway plot, along with a gene concept network, and GSEA plot
Learned about commonly used bioinformatics tools such as Gene Ontology and KEGG, which will be useful in the future for other projects
Presented KEGG pathways and developed a deep understanding of the dotplot

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

Annotated symbol to entrez ID using select
Identified KEGG pathway using enrichKEGG
Created a KEGG dotplot
Decided upon threshold levels, decided to use three and discuss them in our presentation
Created Gene Ontology figure, gene concept network, and GSEA plot
Link to slides: https://docs.google.com/presentation/d/15xxQvhs3BiPK7s1FGUe2u-9lRSpscxqdzH9__bVznNg/edit#slide=id.ge3be2b0727_0_50

sanisetti · July 16, 2021, 2:08am

Self-Assessment 6

Technical Area

Analyses using web-based tools such as STRING database and GEPIA
Enrichment analysing using EnrichR
Used DAVID and Metascape as well

Tools

EnrichR
DAVID
Metascape
STRING DB
GEPIA
Google Slides

Soft Skills

Comprehension (understanding what each of the tools do by reading their documentation)
Research and Analysis (analysing what each component of data means)
Problem-solving skills (understanding why my teammate and I got different results)
Communication (communicating with my teammate to create a final presentation)

Three achievement highlights

Used STRING-db to create a protein-protein interaction network
Used Metascape to create a PPI interaction network and identified top Gene Ontology pathways
Presented results and discussed them with the team

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

Obtained data from previous module, exported to a csv
Used Gene Ontology to find top terms
Used EnrichR for enrichment analysis
Used DAVID to find meaning behind genes
Found enriched KEGG pathways
Used Metascape to find different pathways and PPI network
Used STRING-db to find PPI network
Used GEPIA for survival analysis

sanisetti · August 15, 2021, 4:07am

Self-Assessment 7-8

Technical Area

Creating an R-Shiny application
Designing UX or user layout experience

Tools

R Studio
RDocumentation to explore functions
RStudio and various packages
Figma
Google Slides

Soft Skills

Communication (meeting with teammates and the STEM-away UX team)
Presentation (presenting layouts to the team)
Problem solving (learning how to efficiently and effectively fit all elements of QC necessary in layout)

Three achievement highlights

Created a Figma design and presented it to my team and later to the entire pathway during the pathway meeting
I was able to successfully make my first R Shiny application as a test and consult with teammates on future designs
Presented layouts for different audiences, groupmates, pathway, and UX interns

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

Met with partner to design a QC layout, got to learn more about QC
Finalized Figma design: https://www.figma.com/file/yobLl2egOYMB4oVpc3f3q1/GroupB2_QC-and-Normalization?node-id=0%3A1
Presented design in front of team and pathway
Received feedback from team on design
Designed an R Shiny app to test the platform