Sanisetti - Bioinformatics (Level 1) Pathway

Self-Assessment 1, Same as April

Technical Area

  • An introduction to bioinformatics methods, tools, and different types of data used in bioinformatics analyses
  • Introduction to navigating, using, and debugging in RStudio with R
  • Introduction to reading and interpreting scientific papers

Tools

  • RStudio/R
  • RStudio packages
    • ggplot2
    • Bioconductor
  • Additional Resources:
    • YouTube tutorials (bioinformatics
    • Frontiers in Genetics (paper)
    • LinkedIn (downloading latest version of R)
    • ggplot2 website
    • StackOverflow

Soft Skills

  • Conflict resolution: I ran into a few errors while working through the tutorials, I learned that googling the error and using certain phrases could help resolve logic errors in code
  • Flexibility and Adaptability: I was not very familiar with RStudio previous to this tutorial, and was not accustomed to its layout or the R syntax. From this tutorial, I gained skills on how to adapt to new situations and learned new skills.
  • Self-management: Since the tutorials are self-guided I had to be sure to learn in a way that works best for me. This often meant re-reading the tutorials multiple times and taking my time to make sure that I retained information
  • Attention to Detail: R is a very detail-oriented language, the most common error I ran into was confusion over capslock and lowercase. I learned to be a more active reader and picked up on problems better after this module.
  • Commitment: This module took me a while, but I stuck with and learned the importance of fulfilling a task, no matter how hard it is. In the end, I learned a lot and grew as an individual.

Three achievement highlights

  1. Learned how to use ggplot2 to construct volcano plots and boxplots with a given dataset. This allowed me to learn how to work with the R syntax and also allowed me to learn what each component of code meant and how it worked in a visual way.
  2. I also gained technical and practical knowledge in how to use R and Rstudio and also how to read a research paper to retain the most information possible.
  3. Additionally, I learned how to troubleshoot code and how to be an effective problem solver when the code from the tutorial didn’t work on my end.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

  1. I previously downloaded R Studio, but had not downloaded the latest version of R. Therefore, I looked at a LinkedIn article on how to update R to help me with this task.
  2. I followed the PDF tutorials for getting familiar with R and creating a new project
  3. I tried copying down the code from the PDF with the library(Biobase), however, I got the following error message: Error in library(Biobase): there is no package called ‘Biobase.’ I resolved this by googling the error and downloading the package according to the BioConductor web page instructions. This ended up working.
  4. The tutorial ran smoothly after that, I ran into a minor case error with my variables, but I was able to sort it out.
  5. Learned about different data types, vectors, variables, factors, matrices, data frames, and lists
  6. Learned about different functions and how to seek help with functions
  7. Installed and successfully loaded ggplot2
  8. Learned how to read .csv files using R, I downloaded the sample data and it worked smoothly
  9. Data visualization in R: through the tutorial I made a histogram, plots, bar plots, histogram, and boxplot (ran smoothly, but took me a while to adjust to the syntax and figure out what each component of code meant)
  10. Created an enhanced volcano plot and boxplot using ggplot2
  11. Looked over textbooks
  12. Watched the 2020 student StemAway presentation on YouTube for a background in Bioinformatics
  13. I started reading the Guo paper, but have only read the abstract and part of the introduction and taken notes

Repost (I posted in the incorrect place):

Self-Assessment 2

Technical Area

  • Gain experience with using NCBI GEO to download datasets and use these datasets in R
  • Explore more functions in R to get started with NCBI GEO MetaData
  • Get familiar with utilizing more R packages such as affy, affyPLM, simpleaffy, arrayQualtiMetrics, affyQCreport, and sva

Tools

  • GEO dataset GSE32323
  • Excel
    • To obtain MetaData
  • RDocumentation to explore merge()
  • RStudio and various packages (including affy, affyPLM, simpleaffy, arrayQualitMetrics, affyQCreport, and sva)
  • GitHub - created account

Soft Skills

  • Consulting Resources: I needed a refresher on some of the functions used in this module. I used Google and the RDocumentation to learn more about the merge() function. This skill will be useful in the future, as I come across more and more unknown functions that are necessary for my project.
  • Self-management: Once again, I am relatively new to R, so self-management and advocating for my own remote learning is important. I made sure to take my time reading through the module and exploring GEO and GitHub.
  • Research and Analysis: Looking through GEO taught me the importance of reading through descriptions and looking at different types of experiments to find the data most useful for my project.

Three achievement highlights

  1. I learned how to navigate the NCBI Geo website to find datasets of interest by filtering the search options, which is useful as most bioinformatics projects require this skill
  2. I also learned how to download various R packages, this skill is also important for getting into more advanced analyses and specifications in R
  3. I successfully created a GitHub account, which will be useful for sharing code with other programmers and exploring existing repositories to build upon other code in future projects

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

  • I explored the GEO database, I have previous experience with this database and looked up ‘Brain Cancers’ and ‘Glioblastoma’ to find different datasets
  • Downloaded the GSE32323 dataset using instructions from the module
  • Unzipped .tar files
  • Used Excel to get MetaData, as in the past this is how I would go about obtaining MetaData, exported file to R
  • Installed and loaded packages: affy, affyPLM, simpleaffy, arrayQualitMetrics, affyQCreport, and sva
  • Looked at RDocumentation for a refresher on how to use the merge() function
  • Created GitHub account and commented details for 3rd module
  • Read Intro to DGE Analysis

Self-Assessment 3

Technical Area

  • Visualize microarray data in R, conduct quality control, find outliers

  • Pre-process data (normalization/batch correction)

Tools

  • GEO dataset

  • R Studio

  • RDocumentation to explore functions

  • RStudio and various packages (including affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCreport, and sva)

  • GitHub - to submit deliverables

Soft Skills

  • Consulting Resources: This module was definitely more complex than previous modules. I had to use the guided instructions for more detail and I also had to look up how certain functions were used in the R documentation, especially for the PCA().

  • Self-management: Since this module was more intensive, I had to set aside a good amount of time to work through it. I spent a little bit of time everyday working through it and this aided me greatly in obtaining a final product.

  • Research and Analysis: While getting the plots or visualizations was an important aspect of this module. I also had to understand what Quality Control or PCA were. To do this I watched some StatQuest videos and meeting recordings explaining the plots and analyses.

  • Troubleshooting: I initially ran into an error on R studio with the simpleaffy package. To solve it I googled my error and followed instructions from an online forum where someone else had a similar issue.

*Communication: I had to work with a team to complete my deliverable for this project (the PCA plots). In order to do this we had to communicate with each other and give updates on our progress.

Three achievement highlights

  1. I learned how to create different visualizations in R Studio with microarray data from GEO (boxplot, qc report, heatmap, pca plot).

  2. I learned how to troubleshoot errors in R studio by looking at online forums and exploring online resources.

  3. I learned more about common analyses used in bioinformatics such as what principal component analysis was.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

  • I conducted quality control on the given dataset and obtained the deliverables, however some of the functions in this step (especially those to obtain the six page document) were time intensive

  • I normalized the data using rma() and read online about what this function did

  • I conducted PCA and obtained two plots, I watched a StatQuest video on YouTube to better understand what PCA was

  • I created boxplots to identify outliers

  • I created a heatmap

Self-Assessment 4

Technical Area

  • Filter and pre-process genes for DGE analysis

  • Conduct differential expression analysis of microarray data

  • Create different plots, such as a volcano plot to visualize the DEGs

Tools

  • GEO dataset

  • R Studio

  • RDocumentation to explore functions

  • RStudio and various packages

  • GitHub - to submit deliverables

  • Stack overflow to troubleshoot common issues

  • Google Slides

  • Slack

  • YouTube

Soft Skills

  • Consulting Resources: Once again, this module was much more rigorous than previous modules. I had to follow the detailed guide and consult R documentation and use the help feature to learn more about certain commands and functions. Another resource I consulted was my group member, Arian. We troubleshooted common issues via Slack, to arrive at a suitable volcano plot. We also consulted with other groups to ensure that our plot was accurate.

  • Comprehension: To fully understand what a volcano plot was doing, I consulted a few videos on YouTube. This was helpful because I got a deeper understanding of the code and the data.

  • Self-management: Due to the rigor of this module, I had to manage my time effectively and set aside enough time so that I could completely understand what I was doing. I had to work through the module on my own, consult with my teammate, and present the module to my team.

  • Troubleshooting: Initially, our volcano plot had too many values and the wrong labels (not gene symbols). My teammate and I realized this was because we were analyzing the wrong dataset. We went back through the example on STEM-Away and also found an example on the internet to arrive at the correct solution.

*Communication: A large part of completing this module was consulting with my teammate. We had different results initially, and also wanted to find ways to include gene symbol labels. We explored different examples on the internet and tried different methods and spoke about how to make them better suited to our project. Eventually we arrived at a solution by consulting with one another.

Three achievement highlights

  1. I learned how to interpret and create a volcano plot to analyze DEGs in R. I also learned how to change certain aspects of a plot such as labels and titles.

  2. I learned how to effectively communicate with my teammates and collaborate to fix an issue.

  3. I learned how to not only produce a plot, but also watched videos to understand what a volcano plot showed.

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

  • Loaded data into R, via module 2

  • Removed outliers using tips from group members, initially a hurdle in the previous module. Solved by consulting with my group and finding the best solution.

  • Normalized data using rma() function, didn’t understand what rma was, read documentation to improve my understanding

  • Annotated with ProbeIDs, didn’t know how to do this initially, had to consult an online example on a forum page

  • Created a matrix using model.matrix()

  • Used lmFit (limma) and eBays to do DGE analysis

  • Used topTable() for plot data

  • Created volcano plot

  • Found top 10 DEGs

  • Created a heat map visualization

Self-Assessment 5

Technical Area

  • Conduct enrichment, over-representation and network analysis

  • Export and create plots with proper formatting

Tools

  • GEO dataset

  • R Studio

  • RDocumentation to explore functions

  • RStudio and various packages

  • GitHub - to submit deliverables

  • Google Slides

Soft Skills

  • Debugging code (ran into a small issue with formatting code, syntax error)

  • Problem-solving (deciding what threshold level to use)

  • Research and Analysis (researching and considering different threshold levels)

  • Presentation (presenting KEGG pathways to team)

Three achievement highlights

  1. Created a KEGG pathway plot, along with a gene concept network, and GSEA plot

  2. Learned about commonly used bioinformatics tools such as Gene Ontology and KEGG, which will be useful in the future for other projects

  3. Presented KEGG pathways and developed a deep understanding of the dotplot

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

Self-Assessment 6

Technical Area

  • Analyses using web-based tools such as STRING database and GEPIA

  • Enrichment analysing using EnrichR

  • Used DAVID and Metascape as well

Tools

  • EnrichR

  • DAVID

  • Metascape

  • STRING DB

  • GEPIA

  • Google Slides

Soft Skills

  • Comprehension (understanding what each of the tools do by reading their documentation)

  • Research and Analysis (analysing what each component of data means)

  • Problem-solving skills (understanding why my teammate and I got different results)

  • Communication (communicating with my teammate to create a final presentation)

Three achievement highlights

  1. Used STRING-db to create a protein-protein interaction network

  2. Used Metascape to create a PPI interaction network and identified top Gene Ontology pathways

  3. Presented results and discussed them with the team

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.

  • Obtained data from previous module, exported to a csv

  • Used Gene Ontology to find top terms

  • Used EnrichR for enrichment analysis

  • Used DAVID to find meaning behind genes

  • Found enriched KEGG pathways

  • Used Metascape to find different pathways and PPI network

  • Used STRING-db to find PPI network

  • Used GEPIA for survival analysis

Self-Assessment 7-8

Technical Area

  • Creating an R-Shiny application

  • Designing UX or user layout experience

Tools

  • R Studio

  • RDocumentation to explore functions

  • RStudio and various packages

  • Figma

  • Google Slides

Soft Skills

  • Communication (meeting with teammates and the STEM-away UX team)

  • Presentation (presenting layouts to the team)

  • Problem solving (learning how to efficiently and effectively fit all elements of QC necessary in layout)

Three achievement highlights

  1. Created a Figma design and presented it to my team and later to the entire pathway during the pathway meeting

  2. I was able to successfully make my first R Shiny application as a test and consult with teammates on future designs

  3. Presented layouts for different audiences, groupmates, pathway, and UX interns

A detailed statement of tasks completed. State each task, hurdles faced if any, and how you solved the hurdle.