Unix essentials for NGS bioinformatics

A course to develop unix skills for next generation sequencing data handling and processing

Ratings: 4.14 / 5.00

Description

This course has been designed to introduce Unix to students as most convenient tool for working with big data in biological sciences such as next generation sequencing (NGS) data. NGS technologies are producing massive amount of data in each run which is difficult to handle through GUI based tools, even it is difficult to open raw files. That's why sequencing data are produced and stored in text format for easy handling and processing.

Unix skill is an assets for bioinformatics. It is very easy, convenient and save lot of time. Bioinformatics skilled people are knows very well to analyze data with programming language PERL/PYTHON. But all of them not realized that it is not necessary to write program all the time. With the help of unix utilities, data handling and processing, input formatting for software, and easy text processing of results for the understanding can be performed without using high end programming skill and special software's. But you will need software and programming skills for advance bioinformatics analyses. It is great skill for bio-sciences researchers and scientist and NGS beginners. Unix skills will help you in making of pipelines where you can use different software to solve your own objective such as

Counting and formatting of fasta and fastq sequences
Multiple line fasta sequences to single line fasta sequences
Extraction of desired fasta and fastq sequences from whole dataset
Splitting and subseting of large sequence file
Formatting of blast, pfam, and interpro output for analysis
Extraction of sub sequences from genome files
Sequence file cleaning: Triming and filtering of sequences
Random data set generation
Bulk data processing for common tasks
................... and many more common tasks

Here, I am intend to cover only specific aspect of unix as required for NGS data processing and project management. Whole course is divided into 4 module from basic command to script. In this course, you will have lot of practice opportunities. In 4 days, you will learn through tutorials, video lectures and assignments for practice. There could be several ways for the teaching and learning, But, i used easiest and simplest approach, and focused to develop thinking for data processing instead of advance and compact use of commands. In guide to practice commands, I have given multiple approach to perform single task. So, you will also have opportunity to use compact and advance options of commands.

Day 1 - Introduction to NGS and UNIX

Course introduction
Brief description of NGS and UNIX (video).
Unix: How to start, basic commands (Directories and files: creation, remove, navigation, listing, writing/retrieval, and unpacking of NGS data files)
System information related commands and their usages
Quick revision
Practice assignments
Challenge of the day

Day 2 – NGS bioinformatics data excursion

NGS: data source, files and file formats.
Unix command for excursion
Smart trick to solve complex problems
Quick revision
Practice assignments (with common NGS data processing related tasks)
Challenge of the day

Day 3 – Flying with commands

File streaming and redirection, stream editor, pipe, filters
Permission, symbolic linking, construction of pipeline on terminal
Practice assignments (with common NGS data processing related tasks)
Challenge of the day

Day 4 - Bulk data processing

Brief introduction of shell scripting
Pattern matching, variables, subshells and loops
Practice Assignments (with common NGS data processing related tasks)
Challenge of the day

What You Will Learn!

Knowledge and understanding of unix commands for NGS data processing
Skill to use non-GUI tools for large data processing
Smart tricks to use command for NGS data handling
Command line solutions of complex problems
Reproducibility and automation of repetitive tasks

Who Should Attend!

NGS data analysis beginner
Bio-sciences researchers and scientists, who wants to enhance working efficiency in bioinformatics