Florian Halbritter, 27-Nov-2012
Table of Contents
In this chapter, we try to explain some of the fundamental concepts underlying GeneProf.
GeneProf's workflows (cp. Concept 'Workflows'
) are centered around different types of datasets. There are six different, principal types of data: Files
, genomic regions
and 'special' data
. Each of these types is explained in more detail below.
The type of data determines in which way it may be processed in a workflow. For example, you won't be able to use a file directly as an input for a sequence alignment program or a sequence dataset for a tool that expects a set of genomic intervals to operate on. In this way, GeneProf helps you to formulate sensible analysis pipelines.
To make the distinction between different kinds of data easier, each data type is represented by a certain color and / or symbol throughout the interface. Please read on to find out more about which color represents which type of data.
Raw data files uploaded to the system are stored in file data
sets. These have to be interpreted first before they can be made use of. The process of interpreting a file in a way that makes sense to GeneProf is called 'parsing' the file (cp. SubConcept 'Parsing'
). GeneProf provides a range of parser modules (cp. ModuleCategory 'Parsers'
) to translate raw files into a compliant data type.
Within workflows, file data is represented in purple color
or this symbol:
arguably constitutes the core of GeneProf's scope. This data type comprises any type of nucleotide sequence libraries, foremost short read data produced by next-generation sequencing platforms. Sequence data may be single- or paired-end (or mate-paired), come with associated, per-base quality values (cp. SubConcept 'Quality Scores and Formats'
) and possibly with a name per sequence (for small datasets only, e.g. chromosome or transcript sequences).
Sequence data can be subjected to a range of processing steps, such as quality control measures or alignment (cp. SubConcept 'Alignment'
). Within workflows, sequence data is represented in green color
or this symbol:
Genomic Region Data
Sets of genomic regions or intervals are stored as genomic region data. A genomic region is a specific position within the genome, characterized by a chromosome or scaffold, strand information, a start- and an end-coordinate. Certain genomic regions might have more than one start- and end-coordinate, that is, if they span multiple genomic intervals.
Examples of genomic regions, are genes and other transcriptionally active units, that originate from a set of exons in the genome (this is an example of a genomic region spanning multiple intervals). Short read sequences that have been aligned to the genome or transcription factor binding sites are also genomic regions.
Within workflows, genomic region data is represented in red color
or this symbol:
Gene-centric data is stored in feature data
sets. We prefer to talk about 'features' rather than 'genes', because feature data might also include information about other transcriptionally active units, such as short RNA or pseudo-gene loci. Feature data typically comes with a rich set of annotations sourced from the associated reference (cp. SubConcept 'Reference Data'
Feature data is very flexible and might comprise expression values, information about genes regulated by certain transcription factors and so on and, accordingly, the range of operations that can be applied to feature data is vast..
Within workflows, feature data is represented in blue color
or this symbol:
sets comprise lists of known genes and features for a certain organism. Additionally, they contain the organism's genomic and transcriptomic sequence. They serve as a scaffold for many of GeneProf's operations, such as alignment (cp. SubConcept 'Alignment'
), where the reference's genome will be used to place sequences on, or as a basis for all feature datasets (cp. SubConcept 'Feature Data'
Within workflows, reference data is represented in gold color
or this symbol:
All data which does not fall into any of the categories outlined above, will be stored as 'special' data
. Special datasets are typically used to store results of summary statistics modules (see ModuleCategory 'Summary Statistics'
) and usually comprise a number of plots and / or files. Special data cannot be further processed in any way.
Within workflows, special data is represented in orange color
or this symbol:
GeneProf organizes data in so-called virtual experiments
. Virtual experiments are very much like real, wet-lab experiments: They take some input material (in this case data, typically the result of one or more runs of a next-generation sequencing machine), do something with it (here, an data analysis pipeline, cp. Concept 'Workflows'
) and produce certain results.
Importantly, GeneProf experiments do not always have to deal with the results of just one wet-lab experiment. One might also create a meta-analysis experiment, in which one imports data from a variety of other experiments and then compares those in a novel manner.
In the lab, you might keep a lab book to keep track of the steps you've carried out in an experiment. Similarly, GeneProf records a history of what you do, all changes to the experiment's workflow a tracked, so you (and others) can come back to that later and see how the results were created.
You'll find more details about the individual components of an experiment below.
Central to each experiment is a collection of input data. Typically, these will be the output file(s) produced by one or more runs on a next-generation sequencing platform. Input data can be provided to an experiment either by uploading the files from your local computer, from an external database or by importing data from another experiment (see Tutorial 'Uploading Data to GeneProf'
Uploaded files will be in file data
format and need to be parsed prior to further analysis (cp. SubConcept 'File Data'
and SubConcept 'Parsing'
). If you import data from the sequence read archive [1
] or from another experiment, the sequences will be available directly in a useful format.
It is advisable that data uploaded to an experiment is annotated in as much detail as possible. Presumably, you will be aware of what the individual files you've uploaded are all about, but GeneProf will have a hard time guessing. Also, if other people are ever to look at your experiment, they need to know where your data came from in order to make sense of it.
Experimental annotation is grouped by samples. Samples are meant to be units conceptually distinct from input data, representing one biological entity. For example, you might sequence several pools of RNA for a gene expression experiment all taken from the same cell population. In this case, there should be one sample in GeneProf's annotation linked to several input files. Each sample may then be annotated in a number of ways, e.g. by cell type, developmental stage or treatment (cp. this page
If you provide sufficient annotation, GeneProf's analysis wizards (cp. SubConcept 'Analysis Wizards'
) can make a more educated guess about how the data should be analyzed and you'll need to provide less information later on.
A typical GeneProf experimental workflow will produce many results along the way. Most of those will be intermediate results, which are not of any major concern to most people. We have therefore introduced the concept of the main results
of an experiment. The experiment creator can choose up to 30 datasets in an experiment (by linking them to the main outputs module in the workflow, cp. module 'Main Experimental Results'
) which constitute the most important finding.
The main results datasets will be included in the experiment's main overview page (cp. this page
Finalization and Publication
GeneProf experiments are dynamic and can be altered and extended all the time. When other experiments want to refer to data from one experiment, however, it is necessary to know that these data remain what one thinks they are and do not change any more. It is therefore advisable to finalize an experiment once you've completed an analysis and are satisfied with the results. After finalization, you cannot change the workflow anymore in any way. You may, however, still modify the experiment's general properties, such as the name, citations and so on, as well as the sample annotation.
In order to finalize an experiment, it is necessary that you have (i) included at least one input dataset (cp. SubConcept 'Input Data'
), (ii) provided at least rudimentary sample annotation (cp. SubConcept 'Sample Annotation'
), (iii) marked the main outputs of the experiment (cp. SubConcept 'Main Results'
) and (iv) completed the analysis workflow. A workflow counts as complete, if all modules have been processed, i.e. there may not be any pending or unconnected components remaining.
After finalization, a user may furthermore decide to make an experiment public
. Public experiments and all associated data can be accessed and viewed by other users and all users can use the data from these experiments in their own experiments. Moreover, assorted datasets from public experiments will be included in GeneProf's public data pages, i.e. the gene summary pages, the visual data explorer and the genome browser (cp. this page
, this page
and this page
are what GeneProf's all about really! A workflow, in our terms, is a collection of analysis steps that lead from a set of inputs to a set of outputs. GeneProf allows users to create workflows in two ways, one can either use an analysis wizard (cp. SubConcept 'Analysis Wizards'
), which constructs and entire workflow for you, or one can build a workflow directly using the graphical workflow designer (cp. this page
GeneProf workflows consist of instances of a range of workflow modules (cp. SubConcept 'Workflow Modules'
). Each module takes input and produces outputs and the outputs of one component may in turn be the inputs for the next. In this fashion, a complex analysis pipeline can be easily defined by combining simple(r) elements.
The elements making up a workflow (cp. Concept 'Workflows'
) are instances of a range of so-called workflow modules
. Workflow modules are small programs or algorithms that take a set of well-defined inputs, do something with them and present the outputs to the workflow (and user). Some (in fact, most) modules can be further customized by altering parameters.
GeneProf comes with a large array of modules (cp. Chapter 'Modules'
) for all the elementary steps in the analysis of next-gen data. Some of these modules have been developed by the GeneProf team themselves, others are wrapping externally developed software (i.e. they are executing the external software and parsing the output into GeneProf-compatible formats).
New modules are constantly being added and user are encouraged to request functionality they might be missing. Bioinformaticians and computer developers might even choose to develop their own modules -- which is fairly straightforward, really (please refer to Section 'Module Development: Adding new..'
for more details).
Wizards are handy tools for making the compilation of analysis workflows a little bit easier and quicker. A wizard will ask you just a few simple questions about the data at hand and the analysis you want to perform and then construct a workflow for you. Most parameters will be set to default that we have found useful in our own experience. You can always come back to the workflow later on and alter parameters or even replace entire parts of the workflow.
As far as GeneProf is concerned, the term '(sequence) alignment
' describes the process of placing short nucleotide sequences
(such as short reads from high-throughput sequencing, cp. SubConcept 'Sequence Data'
) on a longer, reference sequence
(such as the chromosomes of a reference assembly).
The term 'parsing' describes the process of reading a file
(which has to be structured in a certain, defined way, i.e. it has to adhere to a known file format) and interpreting its contents in a way that is useful in the latter processing of the file.
Accordingly, if you upload files to GeneProf (cp. SubConcept 'File Data'
), they will need to be parsed into a GeneProf-compatible dataset (cp. Concept 'Datasets'
) first, in order to make GeneProf understand what the data is all about. GeneProf provides a range of parser modules (cp. ModuleCategory 'Parsers'
) to translate raw files into a compliant data type.
Paired-End and Mate-Pair Reads
The terms 'paired-end sequencing
' or 'mate-pair sequencing
' refer to protocols developed to sequence both ends of a single DNA/RNA molecule. Both terms are often used interchangably, although they might actually refer to slightly different methods, but the distinction does not really matter too much as far as GeneProf is concerned. For more information please refer to this forum post on SeqAnswers
Quality Scores and Formats
Modern, high-throughput sequencing platforms typically produce, in addition to nucleotide sequences, a per-base quality score which gives an estimate about how certain the sequencer is that each base in the sequence is actually what it says it is. The quality scores range from (approx.) 0 to 40, where higher numbers report better (i.e. more confident) base calls (cp. this WikiPedia article on Phred-like quality scores
). For example, only 1 in 10,000 base calls with a quality score of 40 is wrong (accuracy = 99.99%
), while 1 in 10 base calls with a score of 10 will be wrong (accuracy = 90%
Now the really confusing thing: Short read sequencing data is most commonly distributed in so-called FASTQ Format
. FASTQ files are essentially text files with four lines per sequence: (1) a header / sequence name, (2) the nucleotide sequence, (3) the header repeated or blank and (4) the per-base qualtiy scores of the sequence encoded as characters. That's right: Numbers are characters here! The symbols are to be read as ASCII
-characters which each have a well-defined integer number they can be translated into (cp. this WikiPedia article about ASCII
But that's not yet the entire story! In fact, there are (at least) three different FASTQ format, which all differ slightly in the way they translate numeric quality scores into ASCII symbols. Old Illumina/Solexa sequencers used a different format from later models (before and after software version 1.3) and there's also another encoding proposed by the Sanger Institute, which appears to be the preferred standard nowadays. GeneProf uses Sanger-style encoding and attempts to convert any other format into this scale during the parsing process (cp. SubConcept 'Parsing'
). If you don't know the format of your FASTQ format, GeneProf will try to guess the encoding on the basis of the range of value it observes in the first few thousand sequences in your files. This usually works quite well, but if you do know the actual format of the files, give GeneProf a hand by specifying the format to be used in the parser modules (cp. ModuleCategory 'Parsers'
) or wizards (cp. PageCategory 'Wizard Pages'
), will you?.
Leinonen, R and Sugawara, H and Shumway, M. The sequence read archive. Nucleic Acids Res.
(2011). PMID: 21062823