GeneProf Manual


Florian Halbritter, 27-Nov-2012

Table of Contents

Chapter: Introduction

Chapter: Concepts Explained

Chapter: Tutorials

Chapter: Modules

Chapter: Pages

Chapter: Advanced Topics

Chapter: Frequently Asked Questions (FAQ)

Modules


GeneProf's analysis workflows are composed of different instances of so-called 'modules' (cp. SubConcept 'Workflow Modules'). The following pages detail the modules that currently exist, give information what they are good for and how they can be used.

General Tools

Further tools of interest are listed in this category.

Main Experimental Results

Developer(s): Halbritter F
Version: 2011-03-15
Input(s): Experimental ResultsUp to 50 datasets that you consider the main results of this experiment, e.g. a feature dataset containing differentially expressed genes or a list of potential targets for a transcription factor. Try to keep the number of selected datasets way below fifty, though, to avoid cluttering the experiment main page.
Output(s): None.
Description:
This module marks an assortment of datasets (up to 50) from an experiment as its main outputs (cp. SubConcept 'Main Results'). Ready-processed main result datasets will then be included and displayed in the experiment main page (cp. this page).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Sequence Data Operations

These modules have been designed for filtering and manipulating sequence data.

Basic Sequence Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered SequencesA set of sequences to apply a filter to.
Output(s): Compliant SequencesOnly those sequences that passed the filter criteria you postulated.
Removed SequencesAll sequences that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Removed SequencesAn user-defined name that will be assigned to the output dataset from port Removed Sequences.
Name for Compliant SequencesAn user-defined name that will be assigned to the output dataset from port Compliant Sequences.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
To add a filter criterion, simply tick the checkbox next to description of one column, select a filter operator and enter a value to test against. The filter operators available depend on the data type of the column to filter on: Numeric values can be compared in terms of equality (=), inequality (!=), greater (>) or less (<). For textual values, you can check for equality / inequality (= / !=), but also for sub-string matches (~), that is, whether the text in the column to be searched does contain the string to be compared against. For example, 'nan' does not equal 'nanog', but the first is contained in the latter.
N.B. The filter interprets all filter criteria conjunctively, i.e. ALL filters have to be true in order for a row to pass the filter. If you want to use disjuntive filter criteria (condition A OR condition B) or more complex filters you need to use the 'Complex Sequence Annotations Filter' module.

Complex Sequence Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered SequencesA set of sequences to apply a filter to.
Output(s): Compliant SequencesOnly those sequences that passed the filter criteria you postulated.
Removed SequencesAll sequences that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Removed SequencesAn user-defined name that will be assigned to the output dataset from port Removed Sequences.
Name for Compliant SequencesAn user-defined name that will be assigned to the output dataset from port Compliant Sequences.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
Complex filter criteria are defined by adding calculations that evaluate to a boolean value, i.e. that return true (the record passes the criterion) or false (the record fails the criterion).
Please refer to the documentation of module 'Calculate Additional Columns' to find out how to define calculations.

FASTX Toolkit: Artifacts Filter

Developer(s): Halbritter F (Module Developer); Gordon A, Hannon G (Software Developer)
Version: 2011-03-15 (FASTX Toolkit v0.0.12)
Related Web Page(s): FASTX-Toolkit Homepage
Input(s): Input SequencesThe input sequences.
Output(s): Complementary SequencesArtifact-filtered sequences.
Parameter(s): Name for Complementary SequencesAn user-defined name that will be assigned to the output dataset from port Complementary Sequences.
Description:
This module wraps the 'fastx_artifacts_filter ' program included in the FASTX-Toolkit (Quinlan et al). The module will attempt to remove sequencing artifacts from the input data.
Configuration:
Please refer to the FASTX Toolkit manual for a detailed explanation of all parameters.

FASTX Toolkit: Clip Adapter Sequences

Developer(s): Halbritter F (Module Developer); Gordon A, Hannon G (Software Developer)
Version: 2011-03-15 (FASTX Toolkit v0.0.12)
Related Web Page(s): FASTX-Toolkit Homepage
Input(s): Input SequencesThe input sequences (potentially contaminated with adapter sequences).
Output(s): Trimmed SequencesSequences cleaned from adapters.
Parameter(s): Adapter Sequence(s)From FASTX Toolkit documentation: ADAPTER string. default is CCTTAAGG (dummy adapter). In GeneProf, you can sequentially remove more than one adapter sequence in one module. Just separate multiple sequence by commata (e.g. AAA,TTT,CCC,..)
Min. LengthFrom FASTX Toolkit documentation: Discard sequences shorter than N nucleotides.
Keep Adapter plus..From FASTX Toolkit documentation: Keep the adapter and N bases after it.
Discard non-clippedFrom FASTX Toolkit documentation: Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
Discard clippedFrom FASTX Toolkit documentation: Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
Report adapter-onlyFrom FASTX Toolkit documentation: Report Adapter-Only sequences.
Keep N-sequencesFrom FASTX Toolkit documentation: keep sequences with unknown (N) nucleotides. default is to discard such sequences.
Name for Trimmed SequencesAn user-defined name that will be assigned to the output dataset from port Trimmed Sequences.
Description:
This module wraps the 'fastx_clipper' program included in the FASTX-Toolkit (Quinlan et al). The module will attempt to clean the sequences in the input dataset from adapter contamination, i.e. sometimes a part of the sequencing adapters will be accidentally (or as part of the protocol) be sequenced alongside the probed nucleotide sequence itself. This 'adapter contamination' can hinder the subsequent analysis process. This program will therefore match each sequence to a known adapter sequence and -- if an overlap has been found -- remove this part of the sequence.
Configuration:
Please refer to the FASTX Toolkit manual for a detailed explanation of all parameters.

FASTX Toolkit: Reverse Complement

Developer(s): Halbritter F (Module Developer); Gordon A, Hannon G (Software Developer)
Version: 2011-03-15 (FASTX Toolkit v0.0.12)
Related Web Page(s): FASTX-Toolkit Homepage
Input(s): Input SequencesThe input sequences.
Output(s): Complementary SequencesReverse-complemented sequences.
Parameter(s): Name for Complementary SequencesAn user-defined name that will be assigned to the output dataset from port Complementary Sequences.
Description:
This module wraps the 'fastx_reverse_complement' program included in the FASTX-Toolkit (Quinlan et al). The module will iterate the entire input dataset and, for each sequence, return its reverse complement, i.e. nucleotides will be transformed to the respective counterpart: A to T, T to A, C to G and G to C.
Configuration:
Please refer to the FASTX Toolkit manual for a detailed explanation of all parameters.

Merge Sequence Data

Developer(s): Halbritter F
Version: 2012-06-15
Related Citation(s): [17]
Input(s): Original SequencesTwo or more sequence data sets.
Output(s): Merged SequencesA set of sequences containing all entries from the input data sets.
Parameter(s): Missing Q-Scores?If some input sequences have quality scores, but others don\'t, one can either add mock quality scores where missing or remove quality scores all together.
Name for Merged SequencesAn user-defined name that will be assigned to the output dataset from port Merged Sequences.
Description:
This module merges several sequence datasets into one. If at least one dataset has quality scores (cp. SubConcept 'Quality Scores and Formats', all datasets' reads which are missing quality scores will be complemented with mock values indicating a perfect quality per read.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Modify and Filter Sequences

Developer(s): Halbritter F
Version: 2011-09-14
Related Citation(s): [17]
Input(s): Original SequencesA set of sequences which you would like to apply a filter to.
Output(s): Modified SequencesOnly those sequences that passed the filter criteria you postulated and modified in the way you specified.
Failed SequencesAll those sequences that failed to meet the filter criteria.
Parameter(s): Trim RightTrim a certain number of nucleotides from the end of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the end of each sequence in the pair separately.
Trim LeftTrim a certain number of nucleotides from the front of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the beginning of each sequence in the pair separately.
Fix LengthTrim sequences to a fixed length. Shorter sequences will remain unchanged. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed to the fixed length separately.
Cut after NTrim every sequence from the first occurrence of an uncertain nucleotide (N). For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
Cut after low qualityTrim every sequence from the first occurrence of a certain (low) quality value. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
AppendAppend a fixed sequence to the end of each input sequence. For paired-end / mate-pair sequences, the suffix will be appended to both sequences.
PrependAttach a fixed sequence to the front of each sequence. For paired-end / mate-pair sequences, the prefix will be appended to both sequences.
Cumulative Quality Score >The sum of all quality scores per read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Minimum Quality Score >The minimum per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Mean Quality Score >The average per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Sequence Complexity >Highly homegeneous sequences (e.g. only T's) are likely to result from error and may be discarded. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Discard N-ReadsDiscard all reads containing uncertain nucleotides (N). For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence contains an N.
Sequence Length >Discard all reads that are shorter than a given length. For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence is too short.
Trim leading..Trim all leading (i.e. those at the left of the reads) occurrences of a certain nucleotide.
Trim trailingTrim all trailing (i.e. those at the right of the reads) occurrences of a certain nucleotide.
Name for Failed SequencesAn user-defined name that will be assigned to the output dataset from port Failed Sequences.
Name for Modified SequencesAn user-defined name that will be assigned to the output dataset from port Modified Sequences.
Description:
Although error levels for high-throughput sequencing technologies from all manufacturers have been reported to be fairly low, it is important to realize that not everything that comes out of the other end is magic! We believe it might in general be a good idea to attempt to get rid of some sequences that are likely to arise from errors in the sequencing process and previous sample preparation! Sometimes it might make sense to apply a transformation globally to all sequences in your dataset. For example, you might like to get rid of the low-quality portion of a read, while maintaining the good part of it. The General Sequence Statistics module can help you to get a feel for your sequencing data and might highlight concerns that you can then address using this module (cp. module 'General Sequence Statistics')). Another example would be data obtained using modifications of the SAGE method (DeepSAGE, Tag-Seq, Digital Transcriptomics), that produce sequences of a fixed length.
Whatever the reason, this module will filter and / or modify all sequences in a given dataset. The module iterates the sequence data one sequence at a time, applies the chosen modification to the sequence (if any) and subsequently tests the modified sequence against the chosen filters (if any). The output of the module is split into two datasets, one dataset with all modified sequences that passed the filter criteria and one dataset with the (modified) sequences that failed the criteria.
Configuration:
Customization of this module is really fairly straightforward: Each modification and each filter can be activated / deactivated by ticking the checkbox next to it. Deactivated parameters will be greyed out.
The parameters are split into two categories (each in its own green box): Modifications and filters. Once activated (by ticking the checkbox for any one parameter), you can change its value using the slider control (just drag and drop it over to get the right value) or enter the desired value in the respective text box. If you enter an invalid value, the text box will be highlighted red and you will not be allowed to save your changes, so you can easily go back and correct your input.
Take care when setting the parameters for filters: The filters will be applied AFTER the sequences have been modified, so the average quality scores and other measures might be different from before modification.

Random Sample of Sequences

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): SequencesA dataset of sequences from which you would like to take random samples.
Output(s): Subset of SequencesA random subset of the input data.
Parameter(s): Sample SizeHow many random samples should be included in the output?
Name for Subset of SequencesAn user-defined name that will be assigned to the output dataset from port Subset of Sequences.
Description:
This module selects a random subset of the records in the input datasets. Warning: If you re-run the same workflow process, the results will be different each time, i.e. using this module might confound the reproducability of subsequent workflow steps!
Configuration:
Simply use the slider control to select the size of the random sample you'd like to create. Use drag & drop to move the slider handler left or right.

Select Sequences for Features

Developer(s): Halbritter F
Version: 2013-06-27
Related Citation(s): [17]
Input(s): SequencesA dataset of sequences whose internal identifiers are referred to by a column in the second input dataset.
FeaturesA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of SequencesA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of SequencesAn user-defined name that will be assigned to the output dataset from port Subset of Sequences.
Description:
This module selects a subset of the given sequences dataset , the internal identifiers of which are given in the column of the features dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the features dataset ) correspond to the identifiers in the first input (the sequences dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Select Sequences for Regions

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): SequencesA dataset of sequences whose internal identifiers are referred to by a column in the second input dataset.
RegionsA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of SequencesA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of SequencesAn user-defined name that will be assigned to the output dataset from port Subset of Sequences.
Description:
This module selects a subset of the given sequences dataset , the internal identifiers of which are given in the column of the genomic regions dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the genomic regions dataset ) correspond to the identifiers in the first input (the sequences dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Select Sequences for Sequences

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): SequencesA dataset of sequences whose internal identifiers are referred to by a column in the second input dataset.
Sequences with IDsA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of SequencesA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of SequencesAn user-defined name that will be assigned to the output dataset from port Subset of Sequences.
Description:
This module selects a subset of the given sequences dataset , the internal identifiers of which are given in the column of the sequences dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the sequences dataset ) correspond to the identifiers in the first input (the sequences dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Separate Mate Sequences

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Original SequencesA dataset of mate-pair / paired-end sequences.
Output(s): First MatesThe first mate / paired-end sequences from the input data.
Second MatesThe second mate / paired-end sequences from the input data.
Parameter(s): Name for First MatesAn user-defined name that will be assigned to the output dataset from port First Mates.
Name for Second MatesAn user-defined name that will be assigned to the output dataset from port Second Mates.
Description:
This module separates a mate-paired / paired-end short read sequence datasets into two datasets, one for each subsequence. If the input dataset is not mate-paired, the second output will be empty and the first identical to the input data.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Split Sequences into Mate Pairs

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Original SequencesA set of concatanated mate-pair sequences which you would like split into separate mate-pair sequences.
Output(s): Split SequencesA sequence dataset in which the individual sequences have been split into mate-pairs / paired-end sequences.
Parameter(s): Split 50-50?Split each sequence in half (first half = first mate, second half = second mate).
Length 1st MateThe desired length (N) of the first mate-pair / paired-end sequence in base pairs. The first N nucleotides of each sequences will be used.
Length 2nd MateThe desired length (N) of the second mate-pair / paired-end sequence in base pairs. The last N nucleotides of each sequences will be used.
Name for Split SequencesAn user-defined name that will be assigned to the output dataset from port Split Sequences.
Description:
This module takes a short read sequence datasets and splits each individual sequence into two. Sometimes the output of a mate-pair / paired-end sequencing run may be delivered to you in one file, so you can use this module to recover the mate pair separation.
Configuration:
By default, sequences will be split in the middle (50:50), but you can change this behaviour by defining the length of the first and second member of each pair yourself. To do so, change the radio button selection to 'Split sequences by fixed length' and use the slider controls to define the length of the subsequences (use drag & drop to move the slider handle).

Genomic Region Operations

These modules have been designed for filtering and manipulating genomic region data.

BEDTools: intersectBed

Developer(s): Halbritter F (Module Developer); Quinlan A, Hall I, Barnett D, Assaf G, Doktor T, Clark R, Alkan C, Arner E (Software Developer)
Version: 2012-11-14 (BEDTools v2.10.1)
Related Citation(s): [13]
Related Web Page(s): BEDTools Homepage
Input(s): Genomic Regions 1The first set of genomic regions to be intersected.
Genomic Regions 2The second set of genomic regions to be intersected.
Output(s): Intersected Genomic RegionsThe intersection of the regions in the input datasets.
Parameter(s): Min. OverlapFrom BEDTools documentation: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
Overlaps ReciprocalFrom BEDTools documentation: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
StrandedForce “strandedness”. From BEDTools documentation: That is, only report hits in B that overlap A on the same strand. By default overlaps are reported without respect to strand.
Name for Intersected Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Intersected Genomic Regions.
Description:
From the BEDTools manual: By far, the most common question asked of two sets of genomic features is whether or not any of the features in the two sets 'overlap' with one another. This is known as feature intersection. intersectBed allows one to screen for overlaps between two sets of genomic features. Moreover, it allows one to have fine control as to how the intersections are reported. intersectBed works with both BED/GFF/VCF and BAM files as input.
Configuration:
Please refer to the BEDTools manual for a detailed explanation of all parameters.

Basic Genomic Region Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered RegionsA set of genomic regions to apply a filter to.
Output(s): Compliant RegionsOnly those regions that passed the filter criteria you postulated.
Removed RegionsAll regions that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Compliant RegionsAn user-defined name that will be assigned to the output dataset from port Compliant Regions.
Name for Removed RegionsAn user-defined name that will be assigned to the output dataset from port Removed Regions.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
To add a filter criterion, simply tick the checkbox next to description of one column, select a filter operator and enter a value to test against. The filter operators available depend on the data type of the column to filter on: Numeric values can be compared in terms of equality (=), inequality (!=), greater (>) or less (<). For textual values, you can check for equality / inequality (= / !=), but also for sub-string matches (~), that is, whether the text in the column to be searched does contain the string to be compared against. For example, 'nan' does not equal 'nanog', but the first is contained in the latter.
N.B. The filter interprets all filter criteria conjunctively, i.e. ALL filters have to be true in order for a row to pass the filter. If you want to use disjuntive filter criteria (condition A OR condition B) or more complex filters you need to use the 'Complex Region Annotations Filter' module.

Calculate Additional Columns (Region Data)

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Original RegionsThe original dataset with annotations.
Output(s): Extended RegionsThe original dataset annotations extended by all additional columns you decided to calculate.
Parameter(s): Calculated ColumnsDefinitions for new columns to be calculated.
Column LabelsThe labels of the new columns.
Name for Extended RegionsAn user-defined name that will be assigned to the output dataset from port Extended Regions.
Description:
This module extends the annotations available for a dataset by adding one or more additional columns to a dataset calculated on the basis of the columns already existing. You can define arbitrary calculations using an collection of common arithmetic and statistical operator and values from all columns in the dataset and constants.
Configuration:
To add a new column you will have to define the calculation to be used to fill this column with values. The config dialog lists all operators available to use for this purpose. Operators are categorised into row and column operators. Row operators work on a per-row (i.e. per record in the dataset basis), while column operator consider all values in an entire column instead. You can refer to the value in a specific column using a wildcard. Wildcards are introduced with a dollar sign ($) followed by the identifier of an annotation column. All available wildcards are listed in the config dialog. After adding a new column calculation, you can rename the new column by entering a label into the respective textbox. Let's look at some examples:
Example: Transform all numbers in a column to logarithmic scale.Say our dataset contains a numeric annotation column A. We can refer to the values of this column using the wildcard $A. The relevant operator for calculating the logarithm is LOG, which takes two arguments, a value to operate on and the base of the logarithm to be used. So, to take the logarithm to the base of 2 of all values in column A, we use the operation: LOG($A,2)
Example: Transform all numbers in a logarithmic column to normal scale.Say our dataset contains a numeric annotation column B, which contains numbers on a logarithmic scale. We can refer to the values of this column using the wildcard $B. To convert from a logarithmic (base 2) scale to the normal scale, we calculate 2 to the power of each value. The relevant GeneProf operator is POW, which takes two arguments, the base and the exponent. Thus, we use the operation: POW(2,$B)

Complex Genomic Region Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered RegionsA set of genomic regions to apply a filter to.
Output(s): Compliant RegionsOnly those regions that passed the filter criteria you postulated.
Removed RegionsAll regions that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Compliant RegionsAn user-defined name that will be assigned to the output dataset from port Compliant Regions.
Name for Removed RegionsAn user-defined name that will be assigned to the output dataset from port Removed Regions.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
Complex filter criteria are defined by adding calculations that evaluate to a boolean value, i.e. that return true (the record passes the criterion) or false (the record fails the criterion).
Please refer to the documentation of module 'Calculate Additional Columns (..' to find out how to define calculations.

Drop Region Annotation Columns

Developer(s): Halbritter F
Version: 2012-02-21
Input(s): Original RegionsThe original dataset with annotations.
Output(s): Truncated RegionsThe original dataset annotations retaining only the annotations you want.
Parameter(s): Kept ColumnsA selection of all columns that should be kept.
Name for Truncated RegionsAn user-defined name that will be assigned to the output dataset from port Truncated Regions.
Description:
This module drops annotation columns that are no longer requried from a dataset.
Configuration:
To configure this module, simply select all columns that you would like to keep in the dataset by ticking the checkboxes next to them.

Extract Sequences from Regions

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Genomic RegionsThe regions you would like to get sequences for.
Output(s): DNA SequencesThe DNA sequences of the stretches covered by the genomic intervals in the input data.
Parameter(s): Sequence NameSet the name of extracted sequences to..
Name for DNA SequencesAn user-defined name that will be assigned to the output dataset from port DNA Sequences.
Description:
Extracts the DNA sequences from a set of genomic regions. If the genomic regions are complex, i.e. they consist of more than one interval (e.g. exons), the sequences for each of these intervals will be extracted separately and then concatenated. If that's not what you want, you will have to simplify the regions first using the 'Modify Genomic Regions' module.
Configuration:
Coming soon.

Merge Genomic Region Data

Developer(s): Halbritter F
Version: 2013-06-12
Related Citation(s): [17]
Input(s): Original RegionsTwo or more region data sets.
Output(s): Merged RegionsA set of genomic regions containing all entries from the input data sets.
Parameter(s): Merge Regions?Should neighbouring and overlapping regions be merged?
Merge DistanceMerge all regions within this number of basepairs from each other
Simplify RegionsIgnore the internal interval structure of the regions and merge them all in one (i.e. exons or similar structures will be discarded and each region simply reaches from the (merged) start to end).
Distinct Reads?For short read alignment datasets, keep only distinct regions (e.g. to avoid amplification errors)?
Name for Merged RegionsAn user-defined name that will be assigned to the output dataset from port Merged Regions.
Description:
This module combines several genomic region datasets into one. In doing so, GeneProf will look at the physical locations (genomic coordinates) of the regions in the individual datasets and combine the annotations for identical or overlapping modules into one record in the dataset. Overlaps may be 'windowed', that is, rather than looking for a strict overlap, regions within a certain distance of each other may be merged together. Output regions consisting of multiple, merged regions will usually be reported as a region of several intervals, but you may also choose to 'simplify' these regions merging them into one long block.
Configuration:
Firstly, you may choose to simplify output regions, that is, if a region consists of multiple intervals, simplification will discard the interval structure, and only remember the regions start and end.
Next, you can decide to merge overlapping regions. If you deactivate this option, only exactly identical regions from all input datasets will be combined (they must have exactly the same chromosome, start and end coordinate and strand), otherwise all regions that overlap in at least one base pair will be merged. You can be more permissive in calculating these overlaps by allowing a certain window in which to look for overlaps. We've found this quite useful, for example, to compare ChIP-binding peaks that colocalise within a certain distance of each other (cp. MTL's in [5]). You can configure the window size using the slider control, just drag & drop the slider to the desired window size (or to 0 in order to allow only actual, direct overlaps of regions).

Modify Genomic Regions

Developer(s): Halbritter F
Version: 2011-06-08
Related Citation(s): [17]
Input(s): Original RegionsA set of genomic regions which you would like modify.
Output(s): Modified RegionsThe same genomic regions as in the input with the desired modifications applied to them.
Parameter(s): Expand UpstreamExpand each region by a fixed number of base pairs upstream of its start coordinate (i.e. to the left, if on the forward strand, to the right if on the reverse strand).
Expand DownstreamExpand each region by a fixed number of base pairs downstream of its end coordinate (i.e. to the right, if on the forward strand, to the left if on the reverse strand).
Fix LengthFix the length of each region to certain number of base pairs around the center of the region.
Min. LengthExpand the region to this length in both direction if it is shorter than N bp in length.
Merge Regions?Should neighbouring and overlapping regions be merged?
Merge DistanceMerge all regions within this number of basepairs from each other
Simplify RegionsIgnore the internal interval structure of the regions and merge them all in one (i.e. exons or similar structures will be discarded and each region simply reaches from the (merged) start to end).
Name for Modified RegionsAn user-defined name that will be assigned to the output dataset from port Modified Regions.
Description:
This module modifies all regions in a genomic region dataset by extending, merging and / or simplifying the regions.
Configuration:
You may decide to expand each region upstream or downstream (or both) or to fix the length of each region to an arbirary length. Fixing the length will alternately extend / trim the region by 1bp downstream / upstream until the desired length is reached. Activate either option by ticking the checkbox in front of it, afterwards you can change the width by which the regions are to be modified using the slider control (drag and drop the handle to increase / decrease the value).
Furthermore you can choose to simplify the regions, that is, to discard the internal interval structure of regions (each genomic regions may consist of more than one stretch along a chromosome, cp. SubConcept 'Genomic Region Data') and you may merge overlapping and neighbouring regions. To merge regions, tick the checkbox in front and decide on a window size for merging: 0 means 'only merge regions that directly overlap each other in at least one basepair' and, for example, 100 means 'merge all regions that, if they were extended by 50bp in each direction, would overlap by at least 1bp' (N.B. unless you also select to simplify regions, this might create regions consisting of several intervals).

Random Sample of Genomic Regions

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): RegionsA dataset of genomic regions from which you would like to take random samples.
Output(s): Subset of RegionsA random subset of the input data.
Parameter(s): Sample SizeHow many random samples should be included in the output?
Name for Subset of RegionsAn user-defined name that will be assigned to the output dataset from port Subset of Regions.
Description:
This module selects a random subset of the records in the input datasets. Warning: If you re-run the same workflow process, the results will be different each time, i.e. using this module might confound the reproducability of subsequent workflow steps!
Configuration:
Simply use the slider control to select the size of the random sample you'd like to create. Use drag & drop to move the slider handler left or right.

RegEx Motif Scan

Developer(s): Halbritter F
Version: 2012-06-14
Input(s): Genomic IntervalsThe genomic intervals to scan for RegEx occurrences.
Output(s): Annotated IntervalsThe input regions annotated with RegEx matches.
Motif MatchesThe actual coordinates of the hits from the RegEx scan.
Parameter(s): Motif LabelsThe labels to be used for the scanned regular expressions.
Motif RegExRegular expressions to search for.
Name for Annotated IntervalsAn user-defined name that will be assigned to the output dataset from port Annotated Intervals.
Name for Motif MatchesAn user-defined name that will be assigned to the output dataset from port Motif Matches.
Description:
TODO.
Configuration:
Coming soon.

Select Regions for Features

Developer(s): Halbritter F
Version: 2013-06-27
Related Citation(s): [17]
Input(s): RegionsA dataset of genomic regions whose internal identifiers are referred to by a column in the second input dataset.
FeaturesA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of RegionsA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of RegionsAn user-defined name that will be assigned to the output dataset from port Subset of Regions.
Description:
This module selects a subset of the given genomic regions dataset , the internal identifiers of which are given in the column of the features dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the features dataset ) correspond to the identifiers in the first input (the genomic regions dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Select Regions for Regions

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Selected RegionsA dataset of genomic regions whose internal identifiers are referred to by a column in the second input dataset.
RegionsA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of Selected RegionsA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of Selected RegionsAn user-defined name that will be assigned to the output dataset from port Subset of Selected Regions.
Description:
This module selects a subset of the given genomic regions dataset , the internal identifiers of which are given in the column of the genomic regions dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the genomic regions dataset ) correspond to the identifiers in the first input (the genomic regions dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Select Regions for Sequences

Developer(s): Halbritter F
Version: 2012-07-23
Related Citation(s): [17]
Input(s): Selected RegionsA dataset of genomic regions whose internal identifiers are referred to by a column in the second input dataset.
SequencesA dataset with a column referring identifiers from the first input dataset.
Output(s): Subset of Selected RegionsA subset of the first input defined by the identifiers given as a parameter.
Parameter(s): ID ColumnThe column in the second input dataset which refers to internal identifiers in the first dataset.
Inverse SelectionSelect all entries that do NOT match the given identifiers instead?
Name for Subset of Selected RegionsAn user-defined name that will be assigned to the output dataset from port Subset of Selected Regions.
Description:
This module selects a subset of the given genomic regions dataset , the internal identifiers of which are given in the column of the sequences dataset given as a second input.
Configuration:
To configure this module, you need to tell GeneProf which column in the second input (the sequences dataset ) correspond to the identifiers in the first input (the genomic regions dataset ) -- that should be a column containing integer numbers.
You may invert the selection, i.e. select all entries NOT having a corresponding ID, by ticking the checkbox labelled Inverse Selection.

Feature Data Operations

These modules have been designed for filtering and manipulating feature data.

Basic Feature Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered FeaturesA set of feature annotations to apply a filter to.
Output(s): Compliant FeaturesOnly those features that passed the filter criteria you postulated.
Removed FeaturesAll features that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Removed FeaturesAn user-defined name that will be assigned to the output dataset from port Removed Features.
Name for Compliant FeaturesAn user-defined name that will be assigned to the output dataset from port Compliant Features.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
To add a filter criterion, simply tick the checkbox next to description of one column, select a filter operator and enter a value to test against. The filter operators available depend on the data type of the column to filter on: Numeric values can be compared in terms of equality (=), inequality (!=), greater (>) or less (<). For textual values, you can check for equality / inequality (= / !=), but also for sub-string matches (~), that is, whether the text in the column to be searched does contain the string to be compared against. For example, 'nan' does not equal 'nanog', but the first is contained in the latter.
N.B. The filter interprets all filter criteria conjunctively, i.e. ALL filters have to be true in order for a row to pass the filter. If you want to use disjuntive filter criteria (condition A OR condition B) or more complex filters you need to use the 'Complex Feature Annotations Filter' module.

Calculate Additional Columns

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Original FeaturesThe original dataset with annotations.
Output(s): Extended FeaturesThe original dataset annotations extended by all additional columns you decided to calculate.
Parameter(s): Calculated ColumnsDefinitions for new columns to be calculated.
Column LabelsThe labels of the new columns.
Name for Extended FeaturesAn user-defined name that will be assigned to the output dataset from port Extended Features.
Description:
This module extends the annotations available for a dataset by adding one or more additional columns to a dataset calculated on the basis of the columns already existing. You can define arbitrary calculations using an collection of common arithmetic and statistical operator and values from all columns in the dataset and constants.
Configuration:
To add a new column you will have to define the calculation to be used to fill this column with values. The config dialog lists all operators available to use for this purpose. Operators are categorised into row and column operators. Row operators work on a per-row (i.e. per record in the dataset basis), while column operator consider all values in an entire column instead. You can refer to the value in a specific column using a wildcard. Wildcards are introduced with a dollar sign ($) followed by the identifier of an annotation column. All available wildcards are listed in the config dialog. After adding a new column calculation, you can rename the new column by entering a label into the respective textbox. Let's look at some examples:
Example: Transform all numbers in a column to logarithmic scale.Say our dataset contains a numeric annotation column A. We can refer to the values of this column using the wildcard $A. The relevant operator for calculating the logarithm is LOG, which takes two arguments, a value to operate on and the base of the logarithm to be used. So, to take the logarithm to the base of 2 of all values in column A, we use the operation: LOG($A,2)
Example: Transform all numbers in a logarithmic column to normal scale.Say our dataset contains a numeric annotation column B, which contains numbers on a logarithmic scale. We can refer to the values of this column using the wildcard $B. To convert from a logarithmic (base 2) scale to the normal scale, we calculate 2 to the power of each value. The relevant GeneProf operator is POW, which takes two arguments, the base and the exponent. Thus, we use the operation: POW(2,$B)

Compare Feature Data

Developer(s): Halbritter F
Version: 2011-03-15
Input(s): Original Features SetsTwo or more feature data sets.
Output(s): Merged Feature SetA combination of all feature annotations. Feature will be merged on basis of the GeneProf internal IDs. For this to work, all features have to refer to the same reference set.
Parameter(s): Merge OperationInclude all features present in any of the datasets (UNION) or only the ones present in all dataset?
Conflict ResolutionResolve conflicts between features by..
Name for Merged Feature SetAn user-defined name that will be assigned to the output dataset from port Merged Feature Set.
Description:
Merging Sets. You can join feature sets by intersection (only features in all datasets), union (features in at least one dataset) or difference (features in exactly one dataset).
This module combines several feature datasets into one. You can merge feature datasets either as the union of all datasets, as their intersection or as their difference (cp. this figure). In the simplest case and, in fact, for most GeneProf datasets, merging feature data will be easily be achieved by combining the annotation columns from all datasets on a per-feature basis. If, however, one of the datasets is not based on GeneProf's internal gene records, but on another identifier (such might be the case, if you parsed the dataset from a file), you need to decide how to resolve the records in this dataset to the features in GeneProf's reference. For example, if your dataset contains microarray intensities which are based on probe set identifiers, more than one such probe set might be referring to the same gene. In that case, you might want to use the mean of all observed intensities or the maximum.
Configuration:
First, select your merge operation from the three available options: You can join feature sets by intersection (only features in all datasets), union (features in at least one dataset) or difference (features in exactly one dataset).
If all features are based on the same, internal identifiers (as is probably the case), you're done here. Otherwise, you'll need to decide how to collapse the features, i.e. how to resolve conflicts if there is more than one record in a dataset referring to the same reference feature. You can either simply take the first or a random records per feature, or you can decide to resolve the conflict by taking whatever row has the maximum / minimum value for a chosen 'master column'. You can choose which column is to be the master column from a dropdown list.

Complex Feature Annotations Filter

Developer(s): Halbritter F
Version: 2011-12-20
Related Citation(s): [17]
Input(s): Unfiltered FeaturesA set of feature annotations to apply a filter to.
Output(s): Compliant FeaturesOnly those features that passed the filter criteria you postulated.
Removed FeaturesAll features that failed to meet the filter criteria.
Parameter(s): Filter CriteriaAn arbitrary combination of filter criteria.
Disjunctive?Do ALL of the conditions have to be satisfied (conjunctive) or only at least one (disjunctive) for a record to pass the criteria?
Name for Compliant FeaturesAn user-defined name that will be assigned to the output dataset from port Compliant Features.
Name for Removed FeaturesAn user-defined name that will be assigned to the output dataset from port Removed Features.
Description:
This modules filters the values in a dataset, by testing each row in the dataset against one or more filter criteria. The module produces two outputs: One dataset with all rows that passed ALL filter criteria and a dataset with all rows that failed at least one criterion.
Configuration:
Complex filter criteria are defined by adding calculations that evaluate to a boolean value, i.e. that return true (the record passes the criterion) or false (the record fails the criterion).
Please refer to the documentation of module 'Calculate Additional Columns' to find out how to define calculations.

Drop Feature Annotation Columns

Developer(s): Halbritter F
Version: 2012-02-21
Input(s): Original FeaturesThe original dataset with annotations.
Output(s): Truncated FeaturesThe original dataset annotations retaining only the annotations you want.
Parameter(s): Kept ColumnsA selection of all columns that should be kept.
Name for Truncated FeaturesAn user-defined name that will be assigned to the output dataset from port Truncated Features.
Description:
This module drops annotation columns that are no longer requried from a dataset.
Configuration:
To configure this module, simply select all columns that you would like to keep in the dataset by ticking the checkboxes next to them.

Random Sample of Features

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): FeaturesA dataset of features from which you would like to take random samples.
Output(s): Subset of FeaturesA random subset of the input data.
Parameter(s): Sample SizeHow many random samples should be included in the output?
Name for Subset of FeaturesAn user-defined name that will be assigned to the output dataset from port Subset of Features.
Description:
This module selects a random subset of the records in the input datasets. Warning: If you re-run the same workflow process, the results will be different each time, i.e. using this module might confound the reproducability of subsequent workflow steps!
Configuration:
Simply use the slider control to select the size of the random sample you'd like to create. Use drag & drop to move the slider handler left or right.

Parsers

These modules have been designed for parsing GeneProf-compatible data from various file formats.

FASTA Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses sequence data from a FASTA-formatted file (usually: .fa, .fasta, .mfa). The module can take more than one file as input. In this case, all files are parsed and the sequences added together into one large sequence dataset. N.B. one nucleotide sequence may stretch across multiple lines.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

FASTQ Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Quality Score FormatPer-base quality scores in FASTQ files are encoded using ASCII characters. Unfortunately, that's pretty much where the format specification ends and FASTQ files vary a bit with respect to what the symbols actually represent. If you know the specific format your data files come in you can select it from the drop-down list (your sequencing provider should be able to tell you), otherwise we'll try to guess the format.
Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses sequence data from a FASTQ-formatted file (usually: .fq, .fastq, .mfq). The module can take more than one file as input. In this case, all files are parsed and the sequences added together into one large sequence dataset. Each sequence record has to consist of exactly four lines, i.e. long sequences may NOT stretch across multiple lines.
Configuration:
Unfortunately, there are a number of different versions of the FASTQ format that differ in the way they calculate the quality scores and it is not easy to tell which version one's dealing with (cp. SubConcept 'Quality Scores and Formats'). By default, GeneProf will try to guess the qualtiy score format by sampling a range of reads in the file and looking at the range of values their quality scores take. This usually works quite well, but sometimes GeneProf might make a mistake. Thus, if you do know the exact format of the input data, please do specifiy the exact format by selecting it from the available options.

SRA File Parser

Developer(s): Halbritter F
Version: 2012-10-24 (SRA Toolkit 2.1.16)
Related Citation(s): [17]
Related Web Page(s): SRA Homepage
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses a sequence dataset from a sequence file in SRA or SRA-lite format (usually: .sra or .lite.sra). The module will automatically distinguish single-end and paired-end data and can also handle colour space reads, which will be exported into nucleotide space and then parsed as normal reads (this might introduce some inaccuracies).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Raw Sequence Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses sequence data from a plain text file containing one nucleotide sequence per line. There may also be an additional column specifying the number of occurrences for each sequence (X), if so the output dataset will contain the parsed sequence X times. Thus, the format of each line in the input file should be:
{nucleotide sequence: ATCGN} {separator TAB or COMMA} {number (optional)}.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

FASTQ Paired-End Parser

Developer(s): Halbritter F
Version: 2012-03-08
Related Citation(s): [17]
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Quality Score FormatPer-base quality scores in FASTQ files are encoded using ASCII characters. Unfortunately, that's pretty much where the format specification ends and FASTQ files vary a bit with respect to what the symbols actually represent. If you know the specific format your data files come in you can select it from the drop-down list (your sequencing provider should be able to tell you), otherwise we'll try to guess the format.
Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses paired-end sequence data from a FASTQ-formatted file (usually: .fq, .fastq, .mfq; cp. SubConcept 'Paired-End and Mate-Pair Reads'). GeneProf assumes that the mate-pair sequences have been concatenated in the input file, so it will try to split each sequence in half to recover the two separate mate pairs. The module can take more than one file as input. In this case, all files are parsed and the sequences added together into one large sequence dataset. Each sequence record has to consist of exactly four lines, i.e. long sequences may NOT stretch across multiple lines.
Configuration:
Unfortunately, there are a number of different versions of the FASTQ format that differ in the way they calculate the quality scores and it is not easy to tell which version one's dealing with (cp. SubConcept 'Quality Scores and Formats'). By default, GeneProf will try to guess the qualtiy score format by sampling a range of reads in the file and looking at the range of values their quality scores take. This usually works quite well, but sometimes GeneProf might make a mistake. Thus, if you do know the exact format of the input data, please do specifiy the exact format by selecting it from the available options.

Generic Sequence Parser

Developer(s): Halbritter F
Version: 2011-03-15
Input(s): Input FileUp to 10 files to be parsed into one dataset. The structure of all files ought to be identical.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module takes an input file and tries to determine the format it comes in (assuming, all along, that the file is a plain text document containing nucleotide sequences). The module will fall back on all sequence parsers modules registered in GeneProf and check whether any of them is able to handle the file. In general, this works quite well for FASTA and FASTQ files!
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Bowtie Output Parser

Developer(s): Halbritter F
Version: 2012-05-07
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Aligned sequences from..The identifier of the dataset that holds the sequences from which have been aligned to the regions in this input file.
Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the Bowtie aligment program [9] into a set of GeneProf-compatible genomic regions . Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive). Do not use this parser for alignments of paired-end / mate-pair sequence datasets, please use this module instead: module 'Bowtie Output Parser (Mate-Pai..'.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Genomic Region Parser (BED, TXT, TSV)

Developer(s): Halbritter F
Version: 2013-07-02
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Column TypesIn order for GeneProf to make sense of the provided annotation you need to specify what sort of information is contained in each column.
Column LabelsUser-defined labels for each column.
Column DescriptionsUser-defined descriptions for each column.
Column URLsUser-defined URLs for each column.
Ignore header?Do not parse the content of the first line (say, if it contains column labels).
Add membership flag?Add an additional (boolean) column indicating that the parsed region is part of this dataset? This is useful if you want to merge different regions later on.
1-based?Genome coordinates can either be 0- (first base = 0, that's the standard for BED files and most databases) or 1-based (first base is 1). GeneProf usually deals with 0-based coordinates and others will have to be converted.
Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
The module parses a plain, tab-delimited text file into a GeneProf-compatible genomic region dataset . A genomic region is defined by its exact genomic location, defined by its chromosome, start and end coordinate and the strand it's located on. Hence, you text file needs to contain at least four columns, one for each of these data. In order to so successfully, the chromosome names in the text file need to be exactly the same as the chromosome names in the reference dataset .
Configuration:
Genomic Region Parser Configuration. The genomic region parser can parse arbirary tab-delimited text files into valid GeneProf datasets.
To configure the module, you need to tell GeneProf about the layout of your input file and about how the information in it is to be interpreted (cp. this figure). Crucially, you need to define four columns: chromosome, start, end and strand. The dialog will display a sample of the first few lines from your input file to help you to define the layout of the file. GeneProf can guess the format of the layout -- try this by clicking the Guess Types? button. If this doesn't work, just quickly select the column types yourself: Simply pick the type from the drop-down list.
Quite often your text files will contain a header row giving a label for each column. If that's the case, tick the Use first row as header? checkbox. GeneProf will then not try to parse a genomic region from the first line (which would fail, evidently).
If you intend to merge different datasets later on, it might come in quite handy to have a flag indicating where the regions did originally come from. Just tick the checkbox Yes, add a membership flag! to do so.

SAM/BAM Region Parser

Developer(s): Halbritter F (Module Developer); Wysoker A, Tibbetts K, Fennell T, Weisburd B (Software Developer)
Version: 2011-04-25 (Picard 1.29)
Related Citation(s): [17]
Related Web Page(s): Picard Project
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): RegionsRegions parsed from the input file.
Parameter(s): Name for RegionsAn user-defined name that will be assigned to the output dataset from port Regions.
Description:
This module parses alignments in SAM or BAM format into a set of GeneProf-compatible genomic regions . In doing so, GeneProf makes use of the Picard package. Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

MACS Peaks.xls Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the MACS peak detection program [7] into a set of GeneProf-compatible genomic regions . Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

CisGenome 2-sample .cod Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the CisGenome peak detection program [8] into a set of GeneProf-compatible genomic regions . N.B. Only use this module for CisGenome output of Two-Sample runs, i.e. WITH a background signal! Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

CisGenome 1-sample .cod Parser

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the CisGenome peak detection program [8] into a set of GeneProf-compatible genomic regions . N.B. Only use this module for CisGenome output of One-Sample runs, i.e. without a background signal! Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

ChIPSeqPeakFinder Output Parser

Developer(s): Halbritter F
Version: 2011-04-15
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): FC >Set a threshold on the fold change enrichment over background for each accepted peak.
Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the ChIPseqPeakFinder peak detection program [5] into a set of GeneProf-compatible genomic regions . Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

CCAT Peaks Parser

Developer(s): Halbritter F
Version: 2011-04-15
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the CCAT peak detection program [15] into a set of GeneProf-compatible genomic regions . Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Bowtie Output Parser (Mate-Paired)

Developer(s): Halbritter F
Version: 2011-04-14
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): Genomic RegionsGenomic regions (intervals) parsed from the input file.
Parameter(s): Aligned sequences from..The identifier of the dataset that holds the sequences from which have been aligned to the regions in this input file.
Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module parses an output file of the Bowtie aligment program [9] into a set of GeneProf-compatible genomic regions . Important: The chromosome names in this file have to match the chromosome names in the reference EXACTLY (case-sensitive). Use this parser ONLY for alignments of paired-end / mate-pair sequence datasets! It is assumed that the file contains two lines per aligned sequence pair: The firtst line reports the coordinate of the first mate, the second the alignment of the second mate (this is the usual output format of Bowtie for paired-end datasets). If this format is violated, the parsed output will make little sense.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

FASTQ Paired-End Parser (2 Files)

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Input File 1The first input file to be parsed.
Input File 2The second input file to be parsed.
Output(s): SequencesSequences parsed from the input file.
Parameter(s): Quality Score FormatPer-base quality scores in FASTQ files are encoded using ASCII characters. Unfortunately, that's pretty much where the format specification ends and FASTQ files vary a bit with respect to what the symbols actually represent. If you know the specific format your data files come in you can select it from the drop-down list (your sequencing provider should be able to tell you), otherwise we'll try to guess the format.
Name for SequencesAn user-defined name that will be assigned to the output dataset from port Sequences.
Description:
This module parses paired-end sequence data from (exactly) two FASTQ-formatted files (usually: .fq, .fastq, .mfq; cp. SubConcept 'Paired-End and Mate-Pair Reads').Important: GeneProf assumes that the sequences in both files correspond to the two sequences of one mate-pair and that they occur in exactly the same order.
Configuration:
Unfortunately, there are a number of different versions of the FASTQ format that differ in the way they calculate the quality scores and it is not easy to tell which version one's dealing with (cp. SubConcept 'Quality Scores and Formats'). By default, GeneProf will try to guess the qualtiy score format by sampling a range of reads in the file and looking at the range of values their quality scores take. This usually works quite well, but sometimes GeneProf might make a mistake. Thus, if you do know the exact format of the input data, please do specifiy the exact format by selecting it from the available options.

Feature Annotations Parser

Developer(s): Halbritter F
Version: 2011-07-22
Related Citation(s): [17]
Input(s): Input FileThe input file to be parsed.
Reference SetReference set used to identify the chromosomal locations or feature / gene identifiers in the input file with GeneProf-compatible entities.
Output(s): FeaturesFeature data parsed from the input file.
Parameter(s): Column TypesIn order for GeneProf to make sense of the provided annotation you need to specify what sort of information is contained in each column.
Column LabelsUser-defined labels for each column.
Column DescriptionsUser-defined descriptions for each column.
Column URLsUser-defined URLs for each column.
Ignore header?Do not parse the content of the first line (say, if it contains column labels).
Match in referenceThe column in the original reference set to use for matching entries.
Add membership flag?Add an additional (boolean) column indicating that the parsed feature is part of this dataset? This is useful if you want to merge different regions later on.
Name for FeaturesAn user-defined name that will be assigned to the output dataset from port Features.
Description:
The module parses information from a plain, tab-delimited text file into a GeneProf-compatible feature dataset . In order to so successfully, the text file needs to contain some sort of identifier which allows GeneProf to establish a connection between the data in the file and the features in the reference dataset .
Configuration:
Feature Parser Configuration. The feature parser can parse arbirary tab-delimited text files into valid GeneProf datasets.
To configure the module, you need to tell GeneProf about the layout of your input file and about how the information in it is to be interpreted (cp. this figure). Crucially, you need to define a column on which a mapping to the reference dataset is to be established. The dialog will display a sample of the first few lines from your input file to help you to define the layout of the file. Find the column which contains the feature identifiers (typically these would be Ensembl or RefSeq gene IDs or probe identifiers from a microarray platform) and select Accession as its Type. Then select the column in the reference that holds the corresponding identifiers (Match to this reference column).
Now you can tell GeneProf about the other columns in the data file. Quite often your text files will contain a header row giving a label for each column. If that's the case, tick the Use first row as header? checkbox. We'll automatically replace the labels for all column with the text in the first row of your input file to save you some work (however, you can change the labels, if you like). Next, select a Type for each column that contains information you'd like to include in the output. You can either select one of a few of commonly used types (descriptions, names, fold changes, p-values, etc.) or define a custom type (either a number, boolean or text). Please also adjust the labels and description to something sensible.

Alignment

Use these modules to align sequences to a reference genome (creating a 'genomic regions' dataset).

Align against DNA with Bowtie (v0.12.3)

Developer(s): Halbritter F (Module Developer); Langmead B, Trapnell C (Software Developer)
Version: 2013-01-03 (Bowtie v0.12.3)
Related Citation(s): [9][17]
Related Web Page(s): Bowtie Homepage
Input(s): SequencesThe sequences that you would like to align to the genome.
Reference GenomeThe reference genome / transcript sequences that you would like to align to.
Output(s): AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
Unaligned SequencesAll sequences that could not be aligned to the genome at all.
Ambiguous SequencesAll sequences that had too many possible alignments (i.e. highly ambiguous / repetitive sequences).
Parameter(s): Use quality values?Shall the alignment tool make use of read quality scores?
Max. mismatches in seedThe maximum number of mismatches allowed in the seed portion of the read. See also: parameter -l.
Seed lengthThe seed of a read is the left portion of the read which is more likely to be of a good quality.
Max. sum of mismatch qualsMaximum sum of quality values at all mismatched read positions.
Disable quality rounding?By default, bowtie rounds quality values to the next 10 with a maximum of 30.
Max. mismatchesThe maximum number of mismatches across the entire alignment.
No forward strand?Do not attempt to align reads to the forward strand of the reference.
No reverse strand?Do not attempt to align reads to the reverse strand of the reference.
Max. reported alignmentsReport up to K valid alignments per read.
Max. ambiguitySuppress all alignments if more than M valid alignments exist.
Best possible hitsMake Bowtie guarantee that reported singleton alignments are 'best' in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).
Use strataAn alignment's stratum is defined by its number of mismatches. Activating this option has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Share memoryMultiple bowtie processes on the same computer share their memory (does not affect your results).
Use multiple coresAllows bowtie to use more than one CPU on the same computer (does not affect your results).
Min. Insert SizeThe minimum insert size for valid paired-end alignments.
Max. Insert SizeThe maximum insert size for valid paired-end alignments.
Mate OrientationThe upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
No strataUse this parameter to switch OFF bowtie's --strata option: An alignment's stratum is defined by its number of mismatches. Activating this option (as by GeneProf default) has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Name for Unaligned SequencesAn user-defined name that will be assigned to the output dataset from port Unaligned Sequences.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Name for Ambiguous SequencesAn user-defined name that will be assigned to the output dataset from port Ambiguous Sequences.
Description:
This module uses the Bowtie alignment software (version 0.12.3, [9]) to place a collection of nucleotide sequences on a reference genome sequence. Bowtie (and this module) can be customized in many different ways to modify the behaviour of the alignment process. The module will report its output in three different datasets: Succesfully aligned reads will be stored in a genomic region dataset in which all reads aligning to the same genomic position are clustered into one entry (keeping track of the identfiers of all reads that aligned here). Additionally, GeneProf will report all reads that failed to align to the reference either because no successful match could be found at all or because there were to many possible locations (the alignment was too 'ambiguous'). If you want to use Bowtie to align to known transcript sequences or to arbitrary other sequences instead, have a look at these modules: 'Align against cDNA with Bowtie (v0.12.3)' and 'Align against Sequences with Bowtie (v0.12.3)'.
Configuration:
Bowtie offers a wealth of customization parameters and thus the configuration dialog might look a bit overwhelming at first, but it's really quite simple! When you try to align sequences to a reference sequence, it will often not be possible to find an exact match for every single sequence. There are several reasons for this: One the one hand, the sequences at hand might be (partially) erroneous and on the other hand, the reference sequence might be incomplete, polymorpheous (SNPs, etc.) or plainly wrong. To accommodate for this, one generally allows for a number of mismatches between the sequences to be aligned and the reference sequence. Bowtie can either do this by explicitly allowing 0-3 mismatches or, taking sequence quality scores into account (cp. SubConcept 'Quality Scores and Formats') impose a maximum total sum in mismatched quality scores. You might furthermore want to change the maximum number of reported alignments and the maximum ambiguity per alignment. By default, GeneProf's Bowtie module will only report uniquely aligned reads, but you can choose to report reads with at most 25 possible alignments. For further details of all available parameters, please refer to the Bowtie manual.

Align against Sequences with Bowtie (v0.12.3)

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [9][17]
Related Web Page(s): Bowtie Homepage
Input(s): Input SequencesThe sequences that you would like to align.
Reference SequencesThe sequences to be used as a reference for alignment (<=1 megabase in total length).
Output(s): AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
Unaligned SequencesAll sequences that could not be aligned to the genome at all.
Ambiguous SequencesAll sequences that had too many possible alignments (i.e. highly ambiguous / repetitive sequences).
Parameter(s): Use quality values?Shall the alignment tool make use of read quality scores?
Max. mismatches in seedThe maximum number of mismatches allowed in the seed portion of the read. See also: parameter -l.
Seed lengthThe seed of a read is the left portion of the read which is more likely to be of a good quality.
Max. sum of mismatch qualsMaximum sum of quality values at all mismatched read positions.
Disable quality rounding?By default, bowtie rounds quality values to the next 10 with a maximum of 30.
Max. mismatchesThe maximum number of mismatches across the entire alignment.
No forward strand?Do not attempt to align reads to the forward strand of the reference.
No reverse strand?Do not attempt to align reads to the reverse strand of the reference.
Max. reported alignmentsReport up to K valid alignments per read.
Max. ambiguitySuppress all alignments if more than M valid alignments exist.
Best possible hitsMake Bowtie guarantee that reported singleton alignments are 'best' in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).
Use strataAn alignment's stratum is defined by its number of mismatches. Activating this option has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Share memoryMultiple bowtie processes on the same computer share their memory (does not affect your results).
Use multiple coresAllows bowtie to use more than one CPU on the same computer (does not affect your results).
Min. Insert SizeThe minimum insert size for valid paired-end alignments.
Max. Insert SizeThe maximum insert size for valid paired-end alignments.
Mate OrientationThe upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
No strataUse this parameter to switch OFF bowtie's --strata option: An alignment's stratum is defined by its number of mismatches. Activating this option (as by GeneProf default) has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Name for Unaligned SequencesAn user-defined name that will be assigned to the output dataset from port Unaligned Sequences.
Name for Ambiguous SequencesAn user-defined name that will be assigned to the output dataset from port Ambiguous Sequences.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Description:
This module uses the Bowtie alignment software (version 0.12.3, [9]) to place a collection of nucleotide sequences on arbitrary other sequences given to the module. Bowtie (and this module) can be customized in many different ways to modify the behaviour of the alignment process. The module will report its output in three different datasets: Succesfully aligned reads will be stored in a genomic region dataset in which all reads aligning to the same genomic position are clustered into one entry (keeping track of the identfiers of all reads that aligned here). Additionally, GeneProf will report all reads that failed to align to the reference either because no successful match could be found at all or because there were to many possible locations (the alignment was too 'ambiguous'). If you want to use Bowtie to align to a well-defined genome reference or known transcript sequences instead, have a look at these modules: 'Align against DNA with Bowtie (v0.12.3)' and 'Align against cDNA with Bowtie (v0.12.3)'.
Configuration:
Bowtie offers a wealth of customization parameters and thus the configuration dialog might look a bit overwhelming at first, but it's really quite simple! When you try to align sequences to a reference sequence, it will often not be possible to find an exact match for every single sequence. There are several reasons for this: One the one hand, the sequences at hand might be (partially) erroneous and on the other hand, the reference sequence might be incomplete, polymorpheous (SNPs, etc.) or plainly wrong. To accommodate for this, one generally allows for a number of mismatches between the sequences to be aligned and the reference sequence. Bowtie can either do this by explicitly allowing 0-3 mismatches or, taking sequence quality scores into account (cp. SubConcept 'Quality Scores and Formats') impose a maximum total sum in mismatched quality scores. You might furthermore want to change the maximum number of reported alignments and the maximum ambiguity per alignment. By default, GeneProf's Bowtie module will only report uniquely aligned reads, but you can choose to report reads with at most 25 possible alignments. For further details of all available parameters, please refer to the Bowtie manual.

Align against cDNA with Bowtie (v0.12.3)

Developer(s): Halbritter F
Version: 2012-04-10
Related Citation(s): [9][17]
Related Web Page(s): Bowtie Homepage
Input(s): SequencesThe sequences that you would like to align to the genome.
Reference GenomeThe reference transcript sequences that you would like to align to.
Output(s): Expressed FeaturesAll input sequences aligning to transcripts of the same feature will be summed up and be reported in this dataset.
Unaligned SequencesAll sequences that could not be aligned to the genome at all.
Ambiguous SequencesAll sequences that had too many possible alignments (i.e. highly ambiguous / repetitive sequences).
Parameter(s): Use quality values?Shall the alignment tool make use of read quality scores?
Max. mismatches in seedThe maximum number of mismatches allowed in the seed portion of the read. See also: parameter -l.
Seed lengthThe seed of a read is the left portion of the read which is more likely to be of a good quality.
Max. sum of mismatch qualsMaximum sum of quality values at all mismatched read positions.
Disable quality rounding?By default, bowtie rounds quality values to the next 10 with a maximum of 30.
Max. mismatchesThe maximum number of mismatches across the entire alignment.
No forward strand?Do not attempt to align reads to the forward strand of the reference.
No reverse strand?Do not attempt to align reads to the reverse strand of the reference.
Max. reported alignmentsReport up to K valid alignments per read.
Max. ambiguitySuppress all alignments if more than M valid alignments exist.
Best possible hitsMake Bowtie guarantee that reported singleton alignments are 'best' in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).
Use strataAn alignment's stratum is defined by its number of mismatches. Activating this option has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Share memoryMultiple bowtie processes on the same computer share their memory (does not affect your results).
Use multiple coresAllows bowtie to use more than one CPU on the same computer (does not affect your results).
Min. Insert SizeThe minimum insert size for valid paired-end alignments.
Max. Insert SizeThe maximum insert size for valid paired-end alignments.
Mate OrientationThe upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
No strataUse this parameter to switch OFF bowtie's --strata option: An alignment's stratum is defined by its number of mismatches. Activating this option (as by GeneProf default) has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Name for Ambiguous SequencesAn user-defined name that will be assigned to the output dataset from port Ambiguous Sequences.
Name for Unaligned SequencesAn user-defined name that will be assigned to the output dataset from port Unaligned Sequences.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Split mates?Split up mate-pair / paired-end reads and map them individually.
Name for Expressed FeaturesAn user-defined name that will be assigned to the output dataset from port Expressed Features.
Description:
This module uses the Bowtie alignment software (version 0.12.3, [9]) to place a collection of nucleotide sequences on a reference transcriptome sequence. Bowtie (and this module) can be customized in many different ways to modify the behaviour of the alignment process. The module will report its output in three different datasets: Succesfully aligned reads will be stored in a feature dataset recording the number of reads aligned to transcripts of each individual gene. Ambiguously aligned reads will be assigned proportionally to their most likely transcript of origin in the same fashion described in module 'Quantitate Gene Expression'. Read counts will be normalised to RPM (reads per million) to account for differences in library sizes, but the 'raw' counts will also be kept in case you need them. Additionally, GeneProf will report all reads that failed to align to the reference either because no successful match could be found at all or because there were to many possible locations (the alignment was too 'ambiguous'). If you want to use Bowtie to align to a genome sequence or to arbitrary other sequences instead, have a look at these modules: 'Align against DNA with Bowtie (v0.12.3)' and 'Align against Sequences with Bowtie (v0.12.3)'.
Configuration:
Bowtie offers a wealth of customization parameters and thus the configuration dialog might look a bit overwhelming at first, but it's really quite simple! When you try to align sequences to a reference sequence, it will often not be possible to find an exact match for every single sequence. There are several reasons for this: One the one hand, the sequences at hand might be (partially) erroneous and on the other hand, the reference sequence might be incomplete, polymorpheous (SNPs, etc.) or plainly wrong. To accommodate for this, one generally allows for a number of mismatches between the sequences to be aligned and the reference sequence. Bowtie can either do this by explicitly allowing 0-3 mismatches or, taking sequence quality scores into account (cp. SubConcept 'Quality Scores and Formats') impose a maximum total sum in mismatched quality scores. You might furthermore want to change the maximum number of reported alignments and the maximum ambiguity per alignment. By default, GeneProf's Bowtie module will only report uniquely aligned reads, but you can choose to report reads with at most 25 possible alignments. For further details of all available parameters, please refer to the Bowtie manual.

TopHat 1.2 Alignment

Developer(s): Halbritter F (Module Developer); Trapnell C (Software Developer)
Version: 2012-11-14 (Tophat 1.2.0)
Related Citation(s): [10][17]
Related Web Page(s): TopHat Homepage, Picard Tools (for SAM parsing)
Input(s): SequencesThe sequences that you would like to align to the genome.
Reference GenomeThe reference genome / transcript sequences that you would like to align to.
Output(s): AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
JunctionsThe putative novel splice junctions discovered by TopHat.
Parameter(s): Min. Anchor LengthTopHat will report junctions spanned by reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This must be at least 3 and the default is 8.
Splice Mismatcheshe maximum number of mismatches that may appear in the anchor region of a spliced alignment. The default is 0.
Min. Intron LengthTopHat will ignore donor/acceptor pairs closer than this many bases apart. The default is 70.
Max. Intron LengthWhen searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. The default is 500000.
Max. MultihitsInstructs TopHat to allow up to this many alignments to the reference for a given read, and suppresses all alignments for reads with more than this many alignments.
Min. Isoform FractionTopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. The default is 0.15.
Mate Inner DistanceThis is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
Mate Inner Distance STDThe standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
No Novel JunctionsDo not attempt to find any new splice junctions.
No Known JunctionsDo not use known splice junctions.
Microexon SearchWith this option, the pipeline will attempt to find alignments incident to microexons. Works only for reads 50bp or longer.
Library TypeThe type of library sequenced.
Butterfly SearchTopHat will use a slower but potentially more sensitive algorithm to find junctions in addition to its standard search. Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.
Coverage SearchUse coverage search? Will be disabled by default, if reads are longer than 75bp, but can be forced ON.
Closure SearchUse mate pair closure-based search for junctions? Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp).
Segment MismatchesRead segments are mapped independently, allowing up to this many mismatches in each segment alignment. The default is 2.
Segment LengthEach read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.
Min. Closure ExonDuring closure search for paired end reads, exonic hops in the potential splice graph must be at least this long. The default is 50.
Min. Closure IntronThe minimum intron length that may be found during closure search. The default is 50.
Max. Closure IntronThe maximum intron length that may be found during closure search. The default is 5000.
Min. Coverage IntronThe minimum intron length that may be found during coverage search. The default is 50.
Max. Coverage IntronThe maximum intron length that may be found during coverage search. The default is 20000.
Min. Segment IntronThe minimum intron length that may be found during split-segment search. The default is 50.
Max. Segment IntronThe maximum intron length that may be found during split-segment search. The default is 500000.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Name for JunctionsAn user-defined name that will be assigned to the output dataset from port Junctions.
Description:
This module uses the Tophat alignment tool [10] to place short read sequences on a reference genome (and Picard to parse the output). Tophat can deal with (transcript) reads extending across splice junctions by using known gene models and (optionally) by trying to discover novel junctions using an in-built algorithm. GeneProf will automatically supply Tophat with known splice junction based on the used reference dataset . Tophat (and this module) can be customized in many different ways to modify the behaviour of the alignment process.
Configuration:
The module takes many parameters -- please refer to the Tophat manual for a detailed explanation of each of them. In GeneProf, you can customize each of these parameters using a slider control: Just drag & drop the slider handle (moving it left / right to decrease / increase the value) to select the value you want from a range of valid numbers. Alternatively, you can enter a value into the respective text box. Some of the most important parameters (in terms of influencing the quality of the results) are arguably the number of mismatches ('Segment Mismatches') and maximum ambiguity ('Max. Multihits') as well as the inner distance and standard deviation for mate-paired reads ('Mate Inner Distance' and 'Mate Inner Distance STD'). Please make sure to read up (at least) about these parameters.

Reference Set Manipulation

Use these modules to build a custom reference for alignment and feature-based modules.

Add Annotations to Reference

Developer(s): Halbritter F
Version: 2012-01-06
Related Citation(s): [17]
Input(s): Original ReferenceThe original reference set.
Additional AnnotationsTab-delimited text file containing the annotations you'd like to add.
Output(s): Extended ReferenceThe original feature set extended by the new annotation values.
Parameter(s): Column TypesIn order for GeneProf to make sense of the provided annotation you need to specify what sort of information is contained in each column.
Column LabelsUser-defined labels for each column.
Column DescriptionsUser-defined descriptions for each column.
Column URLsUser-defined URLs for each column.
Ignore header?Do not parse the content of the first line (say, if it contains column labels).
Match in referenceThe column in the original reference set to use for matching entries.
Name for Extended ReferenceAn user-defined name that will be assigned to the output dataset from port Extended Reference.
Description:
This module extends a reference dataset by adding one or more additional annotation columns to it. The annotation columns are parsed from a tab-delimited text file in a manner defined by the user.
Configuration:
In order to successfully add new annotation to the reference , we need to find some way to establish a connection between the data in your text file and the existing reference. In GeneProf, this is done by mapping each row in the file to an entry (a feature) in the reference on the basis of some sort of common identifier in both -- similar to the way feature datasets are parsed (cp. module 'Feature Annotations Parser'). So, first identify a column in your file that contains an identifier that is already in the reference. For example, GeneProf's recommended reference sets are based on Ensembl gene sets, so each of the features in them has a unique Ensembl gene ID associated. Thus, sensibly, the file should have a column containing an Ensembl ID per row. Whatever the ID type is, select Feature ID as its Type in the configuration dialog, then pick the corresponding annotation in the reference from the drop-down list above.
Once this is done, you can think about the new data you want to add. For each column with new annotation select a Type -- either by picking a pre-defined data type or by defining a new custom ID, number, text or boolean type. For custom types, please enter a sensible, short label and description.

Create Transcriptome-only Reference

Developer(s): Halbritter F
Version: 2012-03-27
Input(s): Transcriptome AssemblyA set of sequences, each corresponding to a contig of a transcriptome assembly.
Output(s): Transcriptome ReferenceThe new reference set defined by this module from the input data provided.
Parameter(s): Taxonomy CodeThe NCBI taxonomy code for the organism from this reference.
Name for Transcriptome ReferenceAn user-defined name that will be assigned to the output dataset from port Transcriptome Reference.
Description:
This module is used to create a new reference dataset (cp. SubConcept 'Reference Data'). TODO
Configuration:
TODO

Define a new Reference Set

Developer(s): Halbritter F
Version: 2012-01-05
Related Citation(s): [17]
Input(s): Feature Description FileA tab-delmited text file containing information about the features in the new reference set.
Chromosome SequencesThe sequences of all chromosomes in the reference genome assembly that this reference is to be based on. (optional, but required if you want to define any sort of feature / transcript coordinates. Paired-end / mate-pair sequences will be collapsed (you shouldn't really use those here anyway).
Exon Coordinates FileA tab-delmited text file containing the coordinates of all exons (one per line) along with the ID of the corresponding transcript and gene / feature. The chromosome name has to correspond exactly to the name used in the definition of the chromosome sequences.
Transcript SequencesThe sequences of all transcript variants. The name of each sequence has to correspond to the ID you used in the definition of the transcript coordinates. Paired-end / mate-pair sequences will be collapsed (you shouldn't really use those here anyway).
Output(s): Reference SetThe new reference set defined by this module from the input data provided.
Parameter(s): Column TypesIn order for GeneProf to make sense of the provided annotation you need to specify what sort of information is contained in each column.
Column LabelsUser-defined labels for each column.
Column DescriptionsUser-defined descriptions for each column.
Column URLsUser-defined URLs for each column.
Ignore header?Do not parse the content of the first line (say, if it contains column labels).
Column Types (Exons)In order for GeneProf to make sense of the provided annotation you need to specify what sort of information is contained in each column.
Column Labels (Exons)User-defined labels for each column.
Column Descriptions (Exons)User-defined descriptions for each column.
Column URLs (Exons)User-defined URLs for each column.
Ignore header? (Exons)Do not parse the content of the first line (say, if it contains column labels).
Taxonomy CodeThe NCBI taxonomy code for the organism from this reference.
Name for Reference SetAn user-defined name that will be assigned to the output dataset from port Reference Set.
Description:
This module is used to create a new reference dataset (cp. SubConcept 'Reference Data'). For this to work, you need four inputs: (1) A tab-delimited text file containing information about the genes and other transcriptional features in the new reference set. The file MUST contain one column giving a unique, stable identifier per record, as well as a name, chromosome, strand, start and end coordinate. The file may also contain further annotations. (2) A sequence dataset with all chromosome sequences. The names of these sequences are going to be used as chromosome names, thus they have to correspond EXACTLY to the names used in the features and exons file for chromosome coordinates. (3) An exon coordinates file. This, again, is a tab-delimited text file with one record for each exon. Each exon requires two stable identifiers, the first being the feature it belongs to (exactly the same used for the features file), the second another stable identifier for the transcript variant grouping several exons into one unit. Additionally, each exon entry requires genomic coordinates (chromosome, start, end, strand). No other annotations are allowed. (4) A sequence dataset containing transcript sequences. The names of these sequences have to EXACTLY correspond to the transcript identifiers used in the exons file.
Configuration:
Start by defining the organism this reference is for. Start entering the name of the organism in the left textbox. GeneProf will search NCBI's taxonomy and suggest matching terms for you. Look for the correct organism and select it from the list of suggestions. GeneProf will then store the taxon ID of the right organism as part of the reference.
Next, have a look at the sample of the features text file displayed below. Select the columns that contain the primary stable identifier and name, chromosome, strand, start and end coordinate. You can then add additional columns, e.g. a feature type or description, in the same way you define column in module 'Feature Annotations Parser'.
You need to define the layout of the exon coordinates file in the same way. Again, select the chromosome, strand, start and end coordinate columns. Then select the column which holds gene identifiers matching the stable identifiers select in the features file above as Feature ID and another stable identifier column for transcript IDs.
After defining the basic reference set, you may want to add additional annotations using this module: module 'Add Annotations to Reference'.

Extract Regions from Reference

Developer(s): Halbritter F
Version: 2012-03-27
Input(s): ReferenceAn arbitrary reference dataset.
Output(s): Genomic RegionsThe coordinates extracted.
Parameter(s): Coordinate TypeThe type of coordinates to extract: promoters, features, transcripts, exons, ..
Window (US)Window size upstream of the core coordinates to include.
Window (DS)Window size upstream of the core coordinates to include.
Use exon structureExtract the exon structure of features instead of just the interval spanning start to end.
Name for Genomic RegionsAn user-defined name that will be assigned to the output dataset from port Genomic Regions.
Description:
This module is used to create a new genomic regions dataset (cp. SubConcept 'Genomic Region Data') from the feature annotations provided in a reference dataset (cp. SubConcept 'Reference Data'). TODO
Configuration:
TODO

Map Features to Another Reference

Developer(s): Halbritter F
Version: 2012-03-27
Input(s): FeaturesFeatures that shall be re-mapped.
New ReferenceThe reference set you would like to map to.
Mappings FileThe reference set you would like to map to.
Output(s): Remapped FeaturesRemapped features. Features that could not be mapped will be retained in the second output.
Unnmapped FeaturesAll features for which no match in the new reference could be found.
Parameter(s): Match in referenceThe column in the original reference set to use for matching entries.
Match in annotationsThe column in the new annotation set to use for matching entries.
Name for Remapped FeaturesAn user-defined name that will be assigned to the output dataset from port Remapped Features.
Name for Unnmapped FeaturesAn user-defined name that will be assigned to the output dataset from port Unnmapped Features.
Description:
Maps the features in a dataset onto another reference on the basis of the values of one column in the annotations of the features matching one column in the new reference.
Configuration:
Coming soon.

Parse Reference Set from GenBank

Developer(s): Halbritter F
Version: 2012-02-21
Input(s): GenBank FilesA tab-delimited text file containing information about the features in the new reference set.
Output(s): Reference SetThe new reference set defined by this module from the input data provided.
Parameter(s): Taxonomy CodeThe NCBI taxonomy code for the organism from this reference.
Name for Reference SetAn user-defined name that will be assigned to the output dataset from port Reference Set.
Description:
This module is used to create a new reference dataset (cp. SubConcept 'Reference Data'). TODO
Configuration:
TODO

Transcriptomics

Modules specifically designed for the analysis of RNA-seq, DeepSAGE/Tag-Seq and other transcriptomics data.

DESeq

Developer(s): Halbritter F (Module Developer); Anders S (Software Developer)
Version: 2011-03-15
Related Citation(s): [16][17]
Related Web Page(s): DESeq @ BioConductor
Input(s): Merged DatasetA dataset containing expression values for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes and p-values between all specified sample groups.
Parameter(s): Sample GroupingsCompare these groups of samples for differntial expression.
Use total library size?Use the total number of mapped reads as library size instead of the DESeq estimate.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module uses the DESeq Bioconductor package [16] to calculate the statistical significance of differences of groups of numbers. Please refer to the original research publication for details about the statistical methods and algorithms used. DESeq has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filte the output dataset, it only calculates the P-Values! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these P-Values.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
DESeq, by default, uses a trimmed-mean estimate of the library sizes for normalization, which is thought to be better than the total library size. If you do NOT wish to use DESeq's size estimate, tick the checkbox below to scale the libraries by their total size instead.

DESeq (for Region Data)

Developer(s): Halbritter F (Module Developer); Anders S (Software Developer)
Version: 2011-03-15
Related Citation(s): [16][17]
Related Web Page(s): DESeq @ BioConductor
Input(s): Merged DatasetA dataset containing expression values for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes and p-values between all specified sample groups.
Parameter(s): Sample GroupingsCompare these groups of samples for differntial expression.
Use total library size?Use the total number of mapped reads as library size instead of the DESeq estimate.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module uses the DESeq Bioconductor package [16] to calculate the statistical significance of differences of groups of numbers. Please refer to the original research publication for details about the statistical methods and algorithms used. DESeq has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filte the output dataset, it only calculates the P-Values! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these P-Values.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
DESeq, by default, uses a trimmed-mean estimate of the library sizes for normalization, which is thought to be better than the total library size. If you do NOT wish to use DESeq's size estimate, tick the checkbox below to scale the libraries by their total size instead.

Differential Expression by Fold Change

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Annotated DataA data set containing expression values (or other numerical values) for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes between all specified sample groups.
Parameter(s): Sample GroupingsCalculate the fold changes between these groups of values.
ScaleThe scale to be used for displaying the fold change. On the NORMAL scale, a value > 1 refers to up-regulation while a value < 1 means down-regulation. On the LOGARITHMIC scales, on the other hand, any positive number denotes an up- and any negative number a down-regulation. E.g for the Log2- or Log10-scale, a value of 1 means a 2-fold or 10-fold up-regulation, respectively.
Add ConstantComparisons to 0 (i.e. if one group does not show any evidence of transcription) can pose a problem for the calculation of the fold change. Adding a small constant to each value can help to calculate more meaningful fold changes in this case, otherwise the fold change will be arbitrarily defined by the value of the non-zero group.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module calculates a simple fold-change comparison between two (or more) groups of numeric values. we've designed this module with the detection of differentially expressed genes between different conditions / treatments / groups in mind, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! The module has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filter the output dataset, it only calculates the fold changes! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these fold changes.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
By default, GeneProf will calculate the fold changes on a logarithmic scale to the base of 2. You can change this to a base of 10 or to the 'normal' scale by picking a different option from the drop-down list. To avoid divisions by zero (or taking the logarithm of 0), GeneProf usually adds a small, stabilising constant of 1 to all values prior to further calculation. You can change this by altering the value in the text box labelled Add Constant.

Differential Expression by Fold Change (for Region Data)

Developer(s): Halbritter F
Version: 2011-03-15
Related Citation(s): [17]
Input(s): Annotated DataA data set containing expression values (or other numerical values) for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes between all specified sample groups.
Parameter(s): Sample GroupingsCalculate the fold changes between these groups of values.
ScaleThe scale to be used for displaying the fold change. On the NORMAL scale, a value > 1 refers to up-regulation while a value < 1 means down-regulation. On the LOGARITHMIC scales, on the other hand, any positive number denotes an up- and any negative number a down-regulation. E.g for the Log2- or Log10-scale, a value of 1 means a 2-fold or 10-fold up-regulation, respectively.
Add ConstantComparisons to 0 (i.e. if one group does not show any evidence of transcription) can pose a problem for the calculation of the fold change. Adding a small constant to each value can help to calculate more meaningful fold changes in this case, otherwise the fold change will be arbitrarily defined by the value of the non-zero group.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module calculates a simple fold-change comparison between two (or more) groups of numeric values. we've designed this module with the detection of differentially expressed genes between different conditions / treatments / groups in mind, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! The module has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filter the output dataset, it only calculates the fold changes! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these fold changes.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
By default, GeneProf will calculate the fold changes on a logarithmic scale to the base of 2. You can change this to a base of 10 or to the 'normal' scale by picking a different option from the drop-down list. To avoid divisions by zero (or taking the logarithm of 0), GeneProf usually adds a small, stabilising constant of 1 to all values prior to further calculation. You can change this by altering the value in the text box labelled Add Constant.

EdgeR

Developer(s): Halbritter F (Module Developer); Robinson M, McCarthy D, Smyth G (Software Developer)
Version: 2011-03-15
Related Citation(s): [11][17]
Related Web Page(s): edgeR @ BioConductor
Input(s): Merged Feature SetA dataset containing expression values for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes and p-values between all specified sample groups.
Parameter(s): Sample GroupingsCompare these groups of samples for differntial expression.
Common or Tag-Wise Dispersion?EdgeR can either use a common dispersion estimate or a tag-wise moderated dispersion parameter for its negative binomial model.
Bayes Dispersion Estimate?Estimate the dispersion parameter empirically using Bayes rule (only works for large number of replicates!).
A Priori WeightThe higher this number, the more will the tag-wise dispersion parameter be squeezed towards the common dispersion. You should choose a rather high number (say 100) if only a few samples are available (less than 6) and progressively smaller with larger number of samples (around 10 for 8 samples, etc.).
Multiple-Testing AdjustmentThe method that should be used for multiple-testing adjustment of p-values.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module uses the EdgeR Bioconductor package [11] to calculate the statistical significance of differences of groups of numbers. Please refer to the original research publication for details about the statistical methods and algorithms used. EdgeR has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! EdgeR has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filte the output dataset, it only calculates the P-Values! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these P-Values.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
EdgeRYou can further customize the behaviour of EdgeR by defining the way it gauges the dispersion within (and between) groups. By default, EdgeR uses the common dispersion, but you can choose to calculate a tag-wise dispersion instead. Please refer to the EdgeR manual for details about the maths behind that.
P-values calculated by this module are adjusted by applying a method for multiple testing correction (cp. this Wikipedia article about multiple testing, by default, FDR-correction. You can pick a different method according to your preference from the list.

EdgeR (for Region Data)

Developer(s): Halbritter F (Module Developer); Robinson M, McCarthy D, Smyth G (Software Developer)
Version: 2011-03-15
Related Citation(s): [11][17]
Related Web Page(s): edgeR @ BioConductor
Input(s): Merged Feature SetA dataset containing expression values for multiple samples.
Output(s): Advanced ComparisonThe original dataset augmented by calculated fold changes and p-values between all specified sample groups.
Parameter(s): Sample GroupingsCompare these groups of samples for differntial expression.
Common or Tag-Wise Dispersion?EdgeR can either use a common dispersion estimate or a tag-wise moderated dispersion parameter for its negative binomial model.
Bayes Dispersion Estimate?Estimate the dispersion parameter empirically using Bayes rule (only works for large number of replicates!).
A Priori WeightThe higher this number, the more will the tag-wise dispersion parameter be squeezed towards the common dispersion. You should choose a rather high number (say 100) if only a few samples are available (less than 6) and progressively smaller with larger number of samples (around 10 for 8 samples, etc.).
Multiple-Testing AdjustmentThe method that should be used for multiple-testing adjustment of p-values.
Name for Advanced ComparisonAn user-defined name that will be assigned to the output dataset from port Advanced Comparison.
Description:
This module uses the EdgeR Bioconductor package [11] to calculate the statistical significance of differences of groups of numbers. Please refer to the original research publication for details about the statistical methods and algorithms used. EdgeR has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! EdgeR has been designed for the detection of differentially expressed between different conditions / treatments / groups based on gene expression estimates derived from RNA-seq experiments, but it's use might reach beyond this scope and applications to ChIP-seq data or other numerical values are conceivable. GeneProf does not limit the kinds of numbers that go into the calculation in order to leave this scope open, but this does mean that you need to take a bit of care when using this module, i.e. rubbish in means rubbish out! Important: This module will NOT filte the output dataset, it only calculates the P-Values! You may subsequently use a filter module to look for differentially expressed genes by imposing a threshold on these P-Values.
Configuration:
To configure this module, you need to tell GeneProf which groups of numbers (presumably, expression values) to compare to each other. The module calculates statistical significance for differences in values pair-wise. Several such comparisons can be made by one run of this module. For each comparison, select all columns for one group (treatment, abnormal, condition B, cell type B, ..) in the select box on the left and then all all columns for the second group (wild type, background, condition A, cell type A, ..) in the select box on the right. Enter a name for the comparison (e.g. Treatment vs Control) and click Add Comparison. You can remove a previously added comparison by clicking the red X icon.
EdgeRYou can further customize the behaviour of EdgeR by defining the way it gauges the dispersion within (and between) groups. By default, EdgeR uses the common dispersion, but you can choose to calculate a tag-wise dispersion instead. Please refer to the EdgeR manual for details about the maths behind that.
P-values calculated by this module are adjusted by applying a method for multiple testing correction (cp. this Wikipedia article about multiple testing, by default, FDR-correction. You can pick a different method according to your preference from the list.

GOSeq Enrichment Analysis

Developer(s): Young M
Version: 2011-08-17
Related Citation(s): [14]
Related Web Page(s): goseq Bioconductor Page
Input(s): RNA-seq DataA set of features containing one or more columns indicating whether a certain gene is differentially expressed or not.
Output(s): Functional EnrichmentGene ontology enrichments calculated by GOseq.
Parameter(s): Flag ColumnsA selection of boolean-value columns that indicate whether the feature belongs to a group that should be tested for enrichments (e.g. differentially expressed genes or genes bound by a transcription factor).
Length-Bias NormalizationGOseq proposes to correct RNA-seq enrichments by correcting for the length-bias in sequencing data. You can either use the total gene length, its transcriped length or no correction.
MethodMethod used to calculate enrichment scores.
RepetitionsNumber of repetitions for random sampling. This parameter is ignore if the method has not been set to Sampling.
FDR <Upper threshold on the FDR-corrected p-values to be accepted as significant.
Name for Functional EnrichmentAn user-defined name that will be assigned to the output dataset from port Functional Enrichment.
Description:
TODO
Configuration:
TODO

Quantile Normalization

Developer(s): Halbritter F (Module Developer); Bolstad B, Irizarry R, Astrand M, Speed T (Software Developer)
Version: 2011-07-28
Related Citation(s): [2]
Input(s): Feature DataA dataset containing numeric values to be normalized.
Output(s): Quantile NormalizedThe original dataset augmented by quantile-normalized values.
Parameter(s): ColumnsA selection of columns to be normalized.
Name for Quantile NormalizedAn user-defined name that will be assigned to the output dataset from port Quantile Normalized.
Description:
This module applies a quantile normalization to an arbitrary selection of numeric columns. You need to take care that applying this kind of normalization actually makes sense! This module makes use of the preprocessCore (Bolstad et al, Bioinformatics (2003)).
Configuration:
To configure this module, you only need to tell GeneProf which columns you would like to be normalized. The select box in the config dialog will list all numeric columns in the dataset. Select any you would like (you can select multiple columns by keeping the CTRL-key pressed while clicking with the mouse!).

Quantitate Coverage in Regions

Developer(s): Halbritter F
Version: 2012-06-26
Input(s): AlignmentUp to 80 genomic region datasets corresponding to aligned short read sequences.
Reference RegionsA set of genomic regions, e.g. novel genes, promoters, enhancers or ChIP-seq peaks.
Output(s): Coverage CountsThe genomic regions provided as a reference supplemented with coverage counts for all input alignments.
Parameter(s): Strand-specificAre the sequences to be interpreted as strand-specific (N.B. standard RNA-seq reads are not!)?
Scale by..Scale raw read counts by total number of aligned reads or distinct number of regions?
Quant. MethodCalculate the number of aligned reads, the maximum height or both?
Name for Coverage CountsAn user-defined name that will be assigned to the output dataset from port Coverage Counts.
Description:
TODO
Configuration:
TODO

Quantitate Gene Expression

Developer(s): Halbritter F
Version: 2012-04-17
Related Citation(s): [17]
Input(s): AlignmentUp to 80 genomic region datasets corresponding to alignments of transcript sequences.
Output(s): Expression CountsExpression estimates (RPM and RPKM) for each gene in this organism's gene reference set.
Parameter(s): Strand-specificAre the sequences to be interpreted as strand-specific (N.B. standard RNA-seq reads are not!)?
Exonic Hits OnlyCount only sequences assigned to the exons of genes or also intronic hits?
Window SizeAlso count alignments in the immediate neighbourhood of the reference features? (window size 0 means no)
Window Size USAlso count alignments immediately upstream of the TSS of a feature? (window size 0 means no)
Window Size DSAlso count alignments immediately downstream of the TSS of a feature? (window size 0 means no)
Name for Expression CountsAn user-defined name that will be assigned to the output dataset from port Expression Counts.
Description:
Gene Expression Counts. GeneProf calculates reads counts per gene as gene expression estimates.
This module calculates gene expression read counts (i.e. estimates of gene expression levels) from a genomic region dataset , which is assumed to report alignments of short read sequences from an RNA-seq (or Tag-seq) experiment. Read counts are calculated by adding up the number of read alignments within the transcribed region (exons) of all known gene models. Read counts are subsequently re-scaled to account for differences in library size by transforming them to RPM (reads per million, i.e. divide by total library size and multiply by 1,000,000; GeneProf uses the total number of reads aligned to gene regions as the total library size). GeneProf also calculates a read count normalised as RPKM (reads per kilobase million; additionally scaled to account for transcript length bias, cp. [3]).
The module deals with ambiguously aligned reads by assigning them proportionally to their most likely region of origin (cp. this figure), that is: We assume that an ambiguously aligned read is more likely to come from somewhere in the genome where many other reads aligned to (uniquely), than from a region where no other reads were found. A fraction of the read (something between 0.0 and 1.0) will be assigned to each feature it aligned to, but a higher weight will be given to features with many other reported reads. This is, in idea, similar to [3].
Configuration:
Configuring this module is pretty straight-forward: After connecting all input datasets (i.e. alignments of different RNA-seq datasets), you can, in most cases, just run the module with default parameters. Usually, GeneProf will count only reads that fall within the exons of known genes. You can decide to also consider intronic reads (by un-ticking the checkbox Exonic Hits only) or even reads close to the gene's boundaries (upstream and / or downstream -- just set the Window Size to the maximum distance you'd like to include in the counts (in base pairs)).
Also, GeneProf will -- by default -- ignore strand information when calculating counts, because standard RNA-seq experiments are not strand-sensitive. However, there are now more specialised protocol that do maintain strand information, so if you've carried such an advanced profiling experiment, please do tick the checkbox (Strand-specific) to create more accurate expression counts (in this case, you'll additionally get anti-sense read counts per feature!).

Quantitate Promoter Activity

Developer(s): Halbritter F
Version: 2013-07-16
Input(s): AlignmentUp to 50 genomic region datasets corresponding to alignments of short read sequences.
Output(s): Promoter ActivityExpression estimates (RPM) for each gene in this organism's gene reference set.
Parameter(s): Strand-specificAre the sequences to be interpreted as strand-specific?
Any TSS?Should we consider all alternative promoter regions? The reported count will be the maximum for any considered promoter per gene.
Window Size USThe window upstream of a TSS (in bp) to consider the promoter region.
Window Size DSThe window downstream of a TSS (in bp) to consider the promoter region.
Name for Promoter ActivityAn user-defined name that will be assigned to the output dataset from port Promoter Activity.
Description:
This module quantitates the amount of aligned reads from a genomic region dataset that fall within the promoter region of a gene. The module is functionally very similar to module 'Quantitate Gene Expression' as in that it looks at the amount of alignments within a region and adds those up.
The module deals with ambiguously aligned reads by assigning them proportionally to their most likely region of origin, that is: We assume that an ambiguously aligned read is more likely to come from somewhere in the genome where many other reads aligned to (uniquely), than from a region where no other reads were found. A fraction of the read (something between 0.0 and 1.0) will be assigned to each feature it aligned to, but a higher weight will be given to promoters with many other reported reads. This is, in idea, similar to [3].
Configuration:
Configuring this module is pretty straight-forward: After connecting all input datasets (e.g. alignments of different Pol2-seq datasets), you can, in most cases, just run the module with default parameters.
If you wish, you may alter the (arbitrary) decision of what's considered a 'promoter'. We define a promoter as a certain, defined region surrounding the transcription start site (TSS) of a featuer. You may change the size of the promoter region (upstream and / or downstream) -- just set the Window Size to the maximum distance you'd like to include in the counts (in base pairs). Many genes might have alternative TSS, so you can either consider each of them or just the 'out-most' TSS. The reported count will be the maximum for any considered promoter per gene, i.e. the activity of the most active promoter.
Also, GeneProf will -- by default -- ignore strand information when calculating counts, because standard NGS experiments are not strand-sensitive. However, there are now more specialised protocol that do maintain strand information, so if you've carried such an advanced profiling experiment, please do tick the checkbox (Strand-specific) to create more accurate read counts.

Quantitate Short RNA Expression

Developer(s): Halbritter F
Version: 2011-04-17
Input(s): AlignmentUp to 80 genomic region datasets corresponding to alignments of transcript sequences.
Output(s): Short RNA ExpressionExpression estimates (RPM and RPKM) for each short RNA in this organism's gene reference set.
Parameter(s): Name for Short RNA ExpressionAn user-defined name that will be assigned to the output dataset from port Short RNA Expression.
Description:
This module calculates gene expression read counts (i.e. estimates of gene expression levels) from a genomic region dataset , which is assumed to report alignments of short read sequences from an miRNA-seq (or other short RNA profiling) experiment. Read counts are calculated by adding up the number of read alignments within the transcribed region of all short RNA transcript features in the reference (that is, unlike the gene expression count module (module 'Quantitate Gene Expression'), this module only considers a subset of all features in the reference, namly all with Feature Type one of: lincRNA, miRNA, miRNA_pseudogene, misc_RNA, misc_RNA_pseudogene, ncRNA, processed_transcript, scRNA_pseudogene, snRNA, snRNA_pseudogene, snlRNA, snoRNA, snoRNA_pseudogene). Read counts are subsequently re-scaled to account for differences in library size by transforming them to RPM (reads per million, i.e. divide by total library size and multiply by 1,000,000; GeneProf uses the total number of reads aligned to gene regions as the total library size). GeneProf also calculates a read count normalised as RPKM (reads per kilobase million; additionally scaled to account for transcript length bias, cp. [3]).
The module deals with ambiguously aligned reads by assigning them proportionally to their most likely region of origin (cp. this figure), that is: We assume that an ambiguously aligned read is more likely to come from somewhere in the genome where many other reads aligned to (uniquely), than from a region where no other reads were found. A fraction of the read (something between 0.0 and 1.0) will be assigned to each feature it aligned to, but a higher weight will be given to features with many other reported reads. This is, in idea, similar to [3].
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

ChIP-seq

Modules specifically designed for the analysis of ChIP-seq data.

Assign TFBS to Genes

Developer(s): Halbritter F
Version: 2011-05-19
Related Citation(s): [17]
Input(s): ChIP PeaksA genomic region dataset corresponding to an ChIP-seq peaks identified with a peak-finding module (e.g. MACS or CisGenome).
Output(s): Regulated FeaturesA set of features with a boolean value for each gene that indicates, whether the feature has a nearby ChIP-seq peak or not.
Annotated PeaksAll input peaks with additional annotations to mark all genes they have been assigned to.
Parameter(s): Window Size UpstreamA binding site has to be within this distance upstream of the TSS of a feature to be assigned to that feature.
Window Size DownstreamA binding site has to be within this distance downstream of the TSS of a feature to be assigned to that feature
Assign closest only?It will often happen that a binding peak is close to more than one feature. If several features are found within the specified window size, shall the peak be assigned to all of them or just to one?.
TFBS ColumnsA selection of boolean-value columns that indicate whether the region is a binding site for a certain transcription factor. If no columns are selected, all boolean columns will be used.
Name for Regulated FeaturesAn user-defined name that will be assigned to the output dataset from port Regulated Features.
Name for Annotated PeaksAn user-defined name that will be assigned to the output dataset from port Annotated Peaks.
Description:
This module assigns ChIP-seq peaks (or, in fact, any genomic regions ) to genes within a user-defined radius of their transcription start site (TSS). This is a binary assignment (i.e. gene is regulated: yes / no).
Configuration:
Associating Binding Sites with Genes. Binding peaks can be assigned to genes within a defined neighbourhood of their transcription start site (TSS).
First, customize the desired behaviour of the module by defining the upstream and downstream window size within which regions (binding sites) are to be assigned to a gene. The window extends a certain range upstream and downstream of the transcription start site (TSS) of each gene. If a binding site is within this neighbourhood of a gene, both will be associated to each other (cp. this figure). In some cases, a binding site might be located in the proximity of two different genes. GeneProf, by default, will link those binding sites to all genes, but you may choose to assign them only to the closest gene instead. Just tick the checkbox Assign closest only? if you wish to do so.
GeneProf's genomic region datasets may contain binding sites for several factors at once. The peak detection modules (e.g. module 'Find Peaks with MACS') will usually add a boolean annotation column to their output indicating for each factor whether a region had a peak for this factor. It's important to tell the module at hand which columns these are, so we can say in the output things like 'Gene G has a putative binding site for Factor F'. In the select box, simply select all the annotations that specify binding sites (they are usually called 'Peak: MYFACTORA' or something like that).

Calculate TFAS

Developer(s): Halbritter F (Module Developer); Ouyang Z, Zhou Q, Wong W (Software Developer)
Version: 2012-03-14
Related Citation(s): [12][17]
Input(s): ChIP PeaksA genomic region dataset corresponding to an ChIP-seq peaks identified with a peak-finding module (e.g. MACS or CisGenome).
Output(s): Regulated FeaturesThe TFAS (transcription factor association strength) for each each in the reference will be calculated based on the ChIP peaks provided.
Parameter(s): D_0The window size determines in which distance around the binding peak to look for regulated genes.
Max. DistanceTo reduce computation time, only peaks within a certain distance around the TSS of a gene will be considered. If this distance is chosen sufficiently large (as it is by default), the contribution to the TFAS of peaks beyond this distance threshold will be neglegible anyway.
Intensity Value(s)Use these columns as the input dataset as intensity values, i.e. as the binding strength at each locus. By selecting more than one column you can calculate the TFAS for several transcription factors at once!
Alternative TSSHow to deal with alternative transcription start sites? Ignore alternatives and only use the outer-most TSS or calculate the TFAS with respect to each start site and take the mean or max of all calculations.
Name for Regulated FeaturesAn user-defined name that will be assigned to the output dataset from port Regulated Features.
Description:
The module calculates a numeric score specifying the strength of association between a DNA-binding protein and each gene in the reference set, in this case, a measure called TFAS = transcription factor association strength, calculated using formula proposed in [12]. The module can calculate TFAS score for several factors at once.
Configuration:
To configure the module, you need to define the annotation column (or columns, if there are multiple factors to be considered) in the input dataset that quantitate binding intensity by region. The TFAS is usually calculated using previously identified binding peaks (from a peak calling algorithm, cp. e.g. module 'Find Peaks with MACS') and using the binding intensity (read count or fold enrichment over background) in each of these regions. You can select all columns which you would like to calculate a TFAS for, which might come from different transcription factor or might, alternatively, be different measurements for peak intensity. The TFAS calculation gives higher weight to regions that are close to a gene's TSS and progressively less weight to more distant regions. You can customize the rate at which the contribution of distant regions decrease by altering the d_0 parameter: Lower values will lead to less weight be given to distant sites and higher values in turn make distant sites contribute more to the final score.

Center Peaks

Developer(s): Halbritter F
Version: 2011-11-15
Input(s): Input PeaksA dataset containing ChIP binding peaks (or other regions-of-interest).
AlignmentsAlignment profiles for the same data characterized by the peaks (at most 50).
Output(s): Centered PeaksPeaks refined to a well-defined region around the peak.
Parameter(s): Peak SizeSet the size of refined (centered) peaks to X base-pairs around the highest point in the peak region.
Fragment LengthThe average (or maybe better median) length of the sequenced DNA fragments.
Use strandedness?If active, determine the center of the peak region by looking at the higest point on the forward and the highest point on the reverse strand and taking the middle, otherwise ignore strand-information and look for the highest point throughout the peak region.
Peak-Alignment AssociationsAssociation of peak annotations to the respective alignment profile.
Name for Centered PeaksAn user-defined name that will be assigned to the output dataset from port Centered Peaks.
Description:
Refine binding peaks by restricting them to a fixed width window around their highest point.
Configuration:
Coming soon.

Find Peaks with CCAT

Developer(s): Halbritter F (Module Developer); Xu H, Handoko L, Wei X, Ye C, Sheng J, Wei C, Lin F, Sung W (Software Developer)
Version: 2012-11-14 (CCAT 2.0)
Related Citation(s): [15]
Related Web Page(s): CCAT Software Homepage @ GIS
Input(s): Aligned ChIP ReadsUp to 30 genomic region datasets corresponding to aligned sequences from a several ChIP-seq experiments.
ChIP ControlA genomic region dataset corresponding to the aligned sequences from a control sample in a ChIP-seq experiment. These define the background signal.
Output(s): ChIP PeaksClustered regions of enriched DNA-protein binding.
Parameter(s): Fragment SizeThe (average) length of fragments sequenced in base-pairs / number of nucleotides.
Window SizeThe size of the sliding window.
Moving StepThe step size between sliding windows. Smaller values give higher precision but require (much) more computation.
Strand-sensitive?If set to be TRUE, the peaks will be determined to be the transition from sense strand to anti-sense strand; if set to be FALSE, the peaks will be determined to be the local maximum of read-enrichment profile.
Min. CountMinimum read count in peak region. This is used to filter regions prior to peak finding.
Output NumberThe number of peaks reported (N.B. these are not necessarily significantly enriched!).
Random SeedCCAT uses some random numbers in its calculations -- fixing the random seed will result in these number always being the same and all results will consequently be fully reproducible.
Min. DifferenceThe minimum score of normalized difference.
Bootstrap PassThe number of passes for the bootstrap procedure.
Name for ChIP PeaksAn user-defined name that will be assigned to the output dataset from port ChIP Peaks.
Description:
This module attempts to find peaks in regions of aligned ChIP sequences by using the CCAT algorithm [15]. For details of the algorithm, please refer to the original authors research publication. This module is merely a wrapper that integrates the original software into the GeneProf pipelines.
Configuration:
Please refer to the CCAT manual for details about the parameters.

Find Peaks with ChIPSeqPeakFinder

Developer(s): Halbritter F (Module Developer); Chen X, Xu H, Yuan P, Fang F, Huss M, Vega V, Wong E, Orlov Y, Zhang W, Jiang J, Loh Y, Yeo H, Yeo Z, Narang V, Govindarajan K, Leong B, Shahab A, Ruan Y, Bourque G, Sung W, Clarke N, Wei C, Ng H (Software Developer)
Version: 2012-11-14
Related Citation(s): [5][17]
Related Web Page(s): ChIPSeqPeakFinder Homepage @ GIS
Input(s): Aligned ChIP ReadsUp to 30 genomic region datasets corresponding to aligned sequences from a several ChIP-seq experiments.
ChIP ControlA genomic region dataset corresponding to the aligned sequences from a control sample in a ChIP-seq experiment. These define the background signal.
Output(s): ChIP PeaksClustered regions of enriched DNA-protein binding.
Parameter(s): Fragment SizeThe (average) length of fragments sequenced in base-pairs / number of nucleotides.
Min. Peak DistanceThe minimum distance between peaks.
Min. Peak ThresholdThe minimum number of aligned reads in a peak.
FDR <False discovery rate applied to peaks.
FC >Set a threshold on the fold change enrichment over background for each accepted peak.
Name for ChIP PeaksAn user-defined name that will be assigned to the output dataset from port ChIP Peaks.
Description:
This module wraps the ChIPSeqPeakFinder algorithm [5], which will attempt to discover regions of enriched DNA-protein binding from aligned ChIP-seq read data and a background signal. For details of the algorithm, please refer to the original research publication (supplementary methods). You can connect more than one input dataset to this module. GeneProf will then repeatedly run the peak finder for each individual input dataset in comparison to the same background signal and join the outputs into one dataset.
Configuration:
Please refer to the ChIPseqPeakFinder manual for details about the parameters. In short, the Fragment Size is the average length of the sequenced DNA fragments, the Min. Peak Distance is the smallest gap allowed between two neighbouring peaks (if two peaks are any closer, only the left-most will be reported), Min. Peak Threshold is the least number of short reads piling up in one place to be considered a peak and FDR <= defines a maximum threshold for the experiment FDR (false discovery rate) calculated by the algorithm.

Find Peaks with MACS

Developer(s): Halbritter F
Version: 2013-02-19 (MACS v1.4)
Related Citation(s): [7]
Related Web Page(s): MACS Homepage
Input(s): Aligned ChIP ReadsUp to 30 genomic region datasets corresponding to aligned sequences from a several ChIP-seq experiments.
ChIP ControlA genomic region dataset corresponding to the aligned sequences from a control sample in a ChIP-seq experiment. These define the background signal.
Output(s): ChIP PeaksClustered regions of enriched DNA-protein binding.
False PeaksChIP peaks that failed the FDR (false discovery rate) threshold.
Parameter(s): BandwidthBand width. This value is used while building the shifting model. If --nomodel is set, 2 time of this value will be used as a scanwindow width. DEFAULT: 300
P-Value <Pvalue cutoff for peak detection. DEFAULT: 1e-5.
Fold Enrichment to BG >Lower Bound: Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
Fold Enrichment to BG <Upper Bound: Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
No Lambda?If True, MACS will use fixed background lambda as local lambda for every peak region. Normally, MACS calculates a dynamic local lambda to reflect the local bias due to potential chromatin structure.
Small Local Lambda RegionThe small nearby region in basepairs to calculate dynamic lambda. This is used to capture the bias at the peak summit region. Invalid if there is no control data.
Large Local Lambda RegionThe large nearby region in basepairs to calculate dynamic lambda. This is used to capture the surround bias. DEFAULT: 1000 when control is availalbe, and 5000 if there's no control.
No Model?Whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100, try to set shiftsize to change it.
Shift SizeThe arbitrary shift size in bp. When nomodel is true, MACS will regard this value as 'modeled' d.
Effective Genome SizeEffective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8)
PET Insert SizeBest distance between Pair-End Tags. Default is 200bps. When MACS reads mapped positions for 5' tag and 3' tag, it will decide the best pairing for them using this best distance parameter
FDR <False discovery rate (FDR) threshold. Peaks not passing this threshold will be considered false. If there is no background sample, this threshold is going to be ignored.
Distinct InputsOnly use distinct alignments as input (might avoid amplification bias).
Name for False PeaksAn user-defined name that will be assigned to the output dataset from port False Peaks.
Name for ChIP PeaksAn user-defined name that will be assigned to the output dataset from port ChIP Peaks.
Description:
This module wraps the MACS program [7], which will attempt to discover regions of enriched DNA-protein binding from aligned ChIP-seq read data and a background signal (optional). For details of the algorithm, please refer to the original research publication. You can connect more than one input dataset to this module. GeneProf will then repeatedly run the peak finder for each individual input dataset in comparison to the same background signal and join the outputs into one dataset.
Configuration:
Please refer to the MACS manual for details about the parameters. MACS, while calculating an FDR value if a background signal is available, does calculate an empirical FDR (false discovery rate), it does not usually apply a filter to the detected peaks. GeneProf adds functionality to directly filter out presumably wrong peaks using a cutoff on the FDR values. Only successful peaks will be reported to the main output, while the others will be retained in a second output dataset. N.B. if no background signal is supplied, MACS will not calculate an FDR value and the filter will be ignored. Also, if the input and backgroud datasets are vastly different in size, the FDR scores might be inaccurate (according the MACS' own documentation).

Find Peaks with SISSRs v1.4

Developer(s): Halbritter F (Module Developer); Jothi R (Software Developer)
Version: 2011-04-15 (SISSRs v1.4 - modified)
Related Citation(s): [6][17]
Related Web Page(s): SISSRs Homepage
Input(s): Aligned ChIP ReadsUp to 50 genomic region datasets corresponding to aligned sequences from a several ChIP-seq experiments.
ChIP ControlA genomic region dataset corresponding to the aligned sequences from a control sample in a ChIP-seq experiment. These define the background signal.
Output(s): ChIP PeaksClustered regions of enriched DNA-protein binding.
Parameter(s): Distinct InputsOnly use distinct alignments as input (might avoid amplification bias).
Fragment SizeThe (average) length of fragments sequenced in base-pairs / number of nucleotides. If set to 0, the fragment size will be estimated automatically.
FDRTODO
e-valueTODO
Window SizeTODO
p-value <=TODO
Mappable FractionTODO
Min. Dir. ReadsTODO
Max. Fragment LengthTODO
Library SizesUse these, manually defined library sizes for normalization / scaling.
Name for ChIP PeaksAn user-defined name that will be assigned to the output dataset from port ChIP Peaks.
Description:
This module wraps (a slightly modified version) of the SISSRs algorithm [6], which will attempt to discover regions of enriched DNA-protein binding from aligned ChIP-seq read data and a background signal. For details of the algorithm, please refer to the original research publication. You can connect more than one input dataset to this module. GeneProf will then repeatedly run the peak finder for each individual input dataset in comparison to the same background signal and join the outputs into one dataset.
Configuration:
Please refer to the SISSRs manual for details about the parameters. Note that GeneProf adds one small additional mode of configuring the algorithm: Usually, SISSRs (like many other algorithms) will account for differences in library size between treatment and control by rescaling according to the number of distinct reads in the input. If you'd like to use any other library size for scaling, you can enter an arbitrary number in the textboxes of the config dialog.

MEME Motif Discovery

Developer(s): Halbritter F (Module Developer); Bailey T, Noble W (Software Developer)
Version: 2012-04-18 (MEME 4.8.1_2)
Related Citation(s): [1]
Related Web Page(s): MEME Homepage
Input(s): SequencesUp to 30 sequence datasets (e.g. transcription factor binding sites). Each set will be examined independently for motifs. Paired-end / mate-pair sequences are collapsed.
Output(s): MotifsMotifs discovered by MEME.
Parameter(s): Motif DistributionDistribution of motifs (only one or more than one per sequence?)
Min. WidthMinimum motif width.
Max. WidthMaximum motif width.
Max. MotifsMaximum number of motifs to find.
Min. SitesMinimum number of sites for each motif.
e-valueStop if motif search when E-value greater than this threhold.
Name for MotifsAn user-defined name that will be assigned to the output dataset from port Motifs.
Description:
MEME is a tool for de novo motif discovery, i.e. it looks through a set of DNA (or protein, although this is not currently supported in GeneProf) sequences and tries to find enriched sequence patterns that are shared across all or some of these sequences.
Configuration:
For documentation of MEME and its parameters, please refer to the MEME Suite Project Page. GeneProf limits the range of parameters quite drastically to avoid exorbitant runtimes. If these do not meet your requirements, please use the official MEME instance on their website.

Map Regions to Genes

Developer(s): Halbritter F
Version: 2012-08-29
Related Citation(s): [17]
Input(s): Genomic RegionsA genomic region dataset containing a selection of regions of particular interest (e.g. ChIP-binding peaks, methylated regions, transcriped hotspots, ..). The dataset may contain regions belonging to different classes where class-membership is indicated by a boolean-value column (a flag).
Output(s): Associated FeaturesA set of features with a boolean value for each gene that indicates, whether the feature has a nearby region of interest assigned to it or not.
Annotated RegionsAll input region with additional annotations to mark all features they have been assigned to.
Parameter(s): Window Size Upstream TSSA region has to be within this distance upstream of the TSS (transcription start site) of a feature to be assigned to that feature.
Window Size Downstream TSSA region has to be within this distance downstream of the TSS (transcription start site) of a feature to be assigned to that feature
Window Size Downstream TTSA region has to be within this distance downstream of the TTS (transcription termination site) of a feature to be assigned to that feature
Accept Introns?Assign regions to a feature if they are located within an intron of this feature?
Accept Exons?Assign regions to a feature if they are located within an exon of this feature?
All TSS?Some features have multiple, alternative TSS (transcription start sites). Should TSS-related criteria take all known TSS into account?
TSS-DS stop at TTS?If you have chosen to use a window downstream of the TSS, do you want to use this fixed window size even if it extends past the TTS or cut it off at the TTS? Activate this option to stop at the TTS. Has no effect if no downstream-of-TSS window is used.
Flag ColumnsA selection of boolean-value columns that indicate whether the region belongs to a certain class. For example, your input dataset might contain putative binding sites for multiple transcription factors. In this case, you ought to have a column for each factor that is TRUE if the binding site belongs to factor X or FALSE if not and the same for factor Y, Z, etc. If no columns are selected, all boolean columns will be used.
Transfer AnnotationsA selection of annotation to transfer from the region to the mapped features (in addition to those used as flags).
Only closest feature?Instead of associating all features within the allowed window to a peak, associate only the closest feature?
Name for Associated FeaturesAn user-defined name that will be assigned to the output dataset from port Associated Features.
Name for Annotated RegionsAn user-defined name that will be assigned to the output dataset from port Annotated Regions.
Description:
This module assigns genomic regions (e.g. ChIP binding peaks or methylation hotspots) to nearby genes. This is a binary assignment (i.e. gene is associated with region: yes / no).
Configuration:
Associating Regions with Genes. Genomic region can be assigned to genes within a defined neighbourhood of their transcription start site (TSS), or if they overlap the exons and/or introns of a gene.

Post-Processing

Use these modules for the post-processing of your data, e.g. for comparing different samples, etc.

Put Aligned Reads into Bins

Developer(s): Halbritter F
Version: 2012-11-14
Related Citation(s): [17]
Input(s): AlignmentsA number of genomic region datasets (sensibly aligned short reads) which should be assigned to bins.
Output(s): BinsThe coordinates and counts of all bins that had at least one region in at least one input dataset assigned.
Parameter(s): Fragment LengthThe average (or maybe better median) length of the sequenced DNA fragments. If you set this value to -1, the actual length of each individual region will be used.
Bin SizeDivide each chromosome into bins of this size.
Name for BinsAn user-defined name that will be assigned to the output dataset from port Bins.
Description:
This module estimates the genome-wide coverage with short reads by dividing each chromosome into bins of a fixed size and calculating the number of reads aligned to each bin. The module can, theoretically, be run with any sort of genomic region dataset (i.e. not just alignments), but whether the output will make sense is questionable.
Configuration:
Customize this module by defining the size of the genomic bins to be used. Smaller bins mean higher resolution, but at the cost of speed. Also, too small bins might make it difficult to discover similarity patterns between datasets. When GeneProf calculates the coverage, it will look at how many aligned reads overlap the boundaries of each individual bins and sum up those numbers per bin. For these overlaps, it might make sense to extend each aligned read region a little to get a better estimate of the actual coverage. For example, it's common practice for ChIP-seq alignments to extends reads to half the (or the full) average length of the fragmented DNA that was sequences. If you wish to use such an extension, enter the desired length in the fragment length text box, otherwise set the value to -1 in order to use the actual size of the aligned reads.

Summary Statistics

Use these modules to create useful, comparative summary statistics of several datasets.

ChIP-seq Peak Summary

Developer(s): Halbritter F
Version: 2012-10-12
Related Citation(s): [17]
Input(s): TFBS / ChIP-seq PeaksA genomic region dataset that contains one or more boolean value columns that indicate whether a region is a binding site for a TF or not.
Output(s): TFBS StatisticsStatistics such as the location of regions with respect to genes, overlaps of binding sites, ..
Parameter(s): Peak FlagsA selection of boolean-value columns that indicate whether the region is a binding site for a certain transcription factor. If no columns are selected, all boolean columns will be used.
Name for TFBS StatisticsAn user-defined name that will be assigned to the output dataset from port TFBS Statistics.
Description:
This module calculates a number of informative summary statistics for DNA-protein binding sites in a genomic regions dataset . The module will look at the position of binding sites with respect to known gene models and, if there are sites for more than one factor in the dataset at hand, compare their similarity.
The results will be summarized in a series of plots and graphs:
  • Number of Binding Sites: This simple barchart summarizes the number of binding sites for each of the different factors considered.
  • Position of Binding Sites relative to Genes: This plot shows the genomic distribution of binding sites for each factor with respect to known gene models. The binding sites are subdivided over (arbitrary) categorical bins by their location with the respect to the next known gene. The categories are defined as follows: 'Narrow Promoter' = TSS (transcription start site) +/- 500bp, 'Wide Promoter' = TSS +/- 2000bp, 'exonic' = anywhere in an exon of a known gene (but not in the 'narrow' or 'wide promoter'), 'intronic' = in the intron of a known gene, '5/10/20/50kb upstream' = up to 5/10/20/50kb upstream of the TSS, 'NONE' = nowhere near a known gene (none of the aforementioned categories).
  • Position of Binding Sites relative to TSS: Distance to the TSS of the next known gene rounded to the next-closest 1kb. Maximum distance considered: 50kb up- or downstream of a TSS.
  • Co-Localisation of Binding Sites / in MTLs (distance <= 100 / 250 / 500bp): The last four plots of the summary page tell us a little about how similar the binding sites of the different transcription factors are: We count the co-localization frequency for each transcription factor (i.e. how ofen do the binding sites of one factor overlap with the binding sites of the other). Each of the plots is progressively more permissive in the way these overlaps are calculated: The first plot looks for peaks which are EXACTLY the same, i.e. with exactly the same start and end coordinates. The next plot looks for overlapping binding sites or all binding sites within 100bp distance from each other. The other two plots allow for a distance of at most 250 or 500bp, respectively..
Configuration:
To configure this module, select all columns in the dataset that should be considered. The module assumes that the regions in the input datasets correspond to putative DNA-protein binding sites and that it might contain binding sites for more than one factor. To indicate which factor (or which combination of factors) a region belongs to, there ought to be a boolean-value column, which is to be true if a region is a binding site for factor A and false otherwise (all of GeneProf's peak detection modules produce such output). The module allows you to select any sort of boolean-value column, but the output will make little sense if they report anything else but belonging to a binding factor, so please take care when selecting the correct columns.

Gene Expression Summary

Developer(s): Halbritter F
Version: 2012-08-20
Related Citation(s): [17]
Input(s): Gene Expression DataA feature dataset containing one or more columns with gene expression data.
Output(s): Gene Expression StatisticsStatistics such as the distribution of gene expression intensities, similarity in gene expression profiles, ..
Parameter(s): GEX ColumnsA selection of numerical-value columns that contain some measurement of gene expression intensity (normalised or raw read counts, microarray measurements, ..). The module makes the assumption that the values of these columns are comparable.
Name for Gene Expression StatisticsAn user-defined name that will be assigned to the output dataset from port Gene Expression Statistics.
Description:
This module calculates a number of informative summary statistics for one or more gene expression values (sensibly belonging to several experimental conditions or samples) in a feature dataset . The module will determine and juxtapose the most strongly expressed genes in each dataset, the overall distribution of expression values, the similarity between groups and the types of features expressed.
The results will be summarized in a series of plots and graphs:
  • Distribution of Expression Values: This is a panel of histograms showing the overall distribution of expression values in all selected columns, once on a normal (left) and once on a logarithmic scale (right). It's a general assumption, in particular for differential gene expression tests, that the expression between different conditions, cell types, etc. is largely the same and that only a relatively small percentage of genes should be different. Thus, if the distribution of values in different columns looks vastly dissimilar, the differential expression statistics might not work properly and further normalisation might be required. Read up about histograms on Wikipedia.
  • Gene Expression Heatmap: A heatmap representation of up to 1000 genes (if there are more than 1000 features in the dataset, a random sample of all features will be taken). The heatmap is hierarchically clustered on both axes, i.e. 'similar' features and samples are clustered together. Colours are normalised by row (= by feature), so red means 'expressed higher than average' and blue mean 'lower than average'. Read up about heatmaps on Wikipedia.
  • Correlation of Expression Patterns: A visualisation of the Pearson correlation matrix of all selected columns. GeneProf calculates the pair-wise Pearson correlation coefficients between all selected expression values and visualises the results in a symmetric heatmap. Rows are clustered by their similarity. The matrix gives a quick impression about the similarity of individual samples. Read up about Pearson correlation on Wikipedia.
  • Principal Components: This is a visualisation of the first three principal components from the principal component analysis (PCA) of the selected columns (3 or more columns only). The figure is divided into a panel of four sub-plots: The first plot gives you an idea about how well the PCA worked for the data at hand. Importantly, the red line summarises which percentage of the variance of the data could be explained by the first 1, 2, 3, .. (given of the x-axis) principal components. The three remaining graphs, plot the contribution of each dataset to the three first PCs (PC1 vs. PC2 = top-right, PC2 vs. PC3 = bottom-left and PC1 vs. PC3 = bottom-right). Read up about PCA on Wikipedia.
  • Expressed Feature Types - Number of Features: A barplot showing which types of features were found expressed in the selected columns. For this purpose, any feature with a positive (greater 0) expression value is considered expressed. Feature categories include such terms as protein coding, pseudogene, miRNA, etc. These annotations have been added by the creator of the reference set -- for all of GeneProf's recommended reference sets, they've been gathered from Ensembl.
  • Expressed Feature Types - Total Expression: The same plot as above, however, this time taking the actual expression values into account, i.e. while the earlier plot simply counted the number of features of a given type in the dataset, this plot looks at how strongly the features in each category are expressed.
Configuration:
To configure this module, select all columns in the dataset that should be considered. The module assumes that these are some sort of expression values (read counts, RPM, RPKM, ..) on a comparable scale, i.e. it will not normalise the values. You can input any sort of numerical values, but it might make little sense to use anything but expression values.

General Genomic Region Statistics

Developer(s): Halbritter F
Version: 2012-10-03
Related Citation(s): [17]
Input(s): Genomic Region DataUp to 100 hundred genomic region datasets that you would like to compare.
Output(s): Genomic Region StatisticsStatistics such as region sizes, chromosomal distribution, etc.
Parameter(s): Name for Genomic Region StatisticsAn user-defined name that will be assigned to the output dataset from port Genomic Region Statistics.
Description:
This module summarizes a number of informative summary statistics about one or more genomic region datasets , which are assumed to report alignments of short reads, i.e. they are many (millions) of relatively short regions clustered by genomic coordinates and might have the identifiers of aligned sequences associated to them. The module may be used with other sorts of genomic data, but please do take care when interpreting the outputs.
The results will be summarized in a series of tables, plots and graphs:
  • Dataset Sizes: A simple bar chart quickly summarizing the size of the input datasets. Important: The size of the input dataset, here, refers to the number of distinct genomic regions in each individual datasets and NOT to the number of aligned reads! For example, if 20 short reads aligned to exactly the same genomic position (chromosome 1, 1000-1050, forward strand), this region will only count once for the purposes of this plot! You can find out the total number of aligned reads from the Alignment Ambiguity plot below.
  • Tables - Average Total Region Lengths / Interval Lengths / Number of Intervals: A per-dataset summary of three measures of region length and structure: The Total Region Length is the length of the entire interval described by a regions start and end coordinate (= end - start), while the Total Interval Length is the sum of the lengths of the individual intervals making up a region. Remember, a genomic region might be made up of several smaller intervals (cp. SubConcept 'Genomic Region Data'), e.g. if the region stems from the alignment of paired-end / mate-pair reads or if a gapped alignment program accounting for reads spanning exon-exon junctions has been used. If all reads consist of exactly one interval, the Total Region Length and the Total Interval Length should be exactly identical.
  • Distribution of Regions over Chromosomes: A barchart providing an overview of how the input data distributes across all chromosomes of the reference. We're talking about distinct genomic regions in the sense defined above and the counts are shown as percentages of all regions in a dataset.
  • Alignment Ambiguity: For read alignment datasets, this barchart illustrates which percentage of the input sequence data could be successfully aligned to the genome. If a certain degree of ambiguity had been permitted, the plot furthermore distiguishes between the amount of ambiguity observed -- that is: The x-axis gives the number of possible matches in the genome (number of regions in the input region dataset) for each individual read and the y-axis gives the count (as a percentage) of how many input reads had this many matches. 1 possible match = 'unique alignment', 0 possible matches = 'unaligned or too ambiguously aligned'.
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

General Sequence Statistics

Developer(s): Halbritter F
Version: 2012-08-20
Related Citation(s): [17]
Input(s): Sequence DataUp to fifty sequence data sets that you would like to compare.
Output(s): Sequence Data StatisticsStatistics such as read lengths, quality, nucleotide composition, etc.
Parameter(s): Name for Sequence Data StatisticsAn user-defined name that will be assigned to the output dataset from port Sequence Data Statistics.
Description:
This module summarizes a number of informative summary statistics about one or more sequence datasets , which may be short read sequencing libraries.
The results will be summarized in a series of tables, plots and graphs:
  • Library Sizes: A simple bar chart quickly summarizing the size of the input datasets. The total height of each bar reports the number of sequences in each dataset, while the dark greenish component counts only distinct sequences. Distinct sequences are those differ from each other in at least one nucleotide character, e.g. ATTC is distinct from GTTC, but not from ATTC.
  • Nucleotide Composition per Dataset / Library: This barchart gives an overview about the distribution of nucleotide characters within each library. In general, one would usually expect a roughly equal amount of each nucleotide in each library, although the distribution might be a bit skewed towards certain nucleotides in genomic DNA or certain other library types, which are biases towards one nucleotide or another. The counts are not normalised, so larger libraries will have overall higher bars.
  • Nucleotide Composition per Cycle: Another way to look at nucleotide distribution is per base position. In short read sequencing libaries, each base position corresponds to one sequencing cycle. Again, we would usually expect the nucleotide distribution to be roughly uniform across all cycles. Deviations from uniformity might be a reason for concern and could indicate that something might have gone wrong in the sequencing process.
  • Average Quality Score per Cycle: This plot will only be shown for short read sequence libraries with associated quality scores (cp. SubConcept 'Quality Scores and Formats'). The plot shows the average quality score (arithmetic mean) at each stage of the sequencing process. The quality scores are generally observed to drop a bit towards the end of longer reads, but should hopefully remain within an acceptable range. Sudden drop in quality might indicate errors in the sequencing process.
  • Cumulative Quality Scores / Average Quality Scores: While the above plot took the average of the quality scores per base position, the next two plots take the sum (Cumulative Quality Scores) or average (Average Quality Scores) of all quality scores per sequence and count the frequency with which each summary value is observed. These plots can be quite useful to decide on sensible threshold for quality control filtering.
  • Abundance of Identical Sequences: The plot shows how abundant individual read species are: GeneProf looks at all distinct reads (distinct reads differ in at least one nucleotide character, see first plot) and count how often each of these occurs. GeneProf then plots the abundance of a sequence against the frequency of this degree of abundance. Generally, most sequences will occur only a few times and only very little sequences will be highly abundant (either due to amplification bias or thanks to real biology, e.g. for highly abundant transcript sequences).
Configuration:
This module does not take any parameters and cannot be customized any further. You may, however, change the names of the outputs produced in the Input & Outputs tab.

Meta

Meta-modules are combinations of other GeneProf modules, designed to make workflows more concise. Use meta-modules if you do not require all intermediate outputs.

MACS + Gene Association + Statistics

Developer(s): Halbritter F (Module Developer); Halbritter F (Meta-Module Developer)
Version: 2012-10-23
Related Citation(s): [7][17]
Related Web Page(s): MACS Homepage
Input(s): Aligned ChIP ReadsUp to 30 genomic region datasets corresponding to aligned sequences from a several ChIP-seq experiments.
ChIP ControlA genomic region dataset corresponding to the aligned sequences from a control sample in a ChIP-seq experiment. These define the background signal.
Output(s): ChIP PeaksClustered regions of enriched DNA-protein binding.
Regulated FeaturesA set of features with a boolean value for each gene that indicates, whether the feature has a nearby ChIP-seq peak or not.
Annotated PeaksAll input peaks with additional annotations to mark all genes they have been assigned to.
TFBS StatisticsStatistics such as the location of regions with respect to genes, overlaps of binding sites, ..
Parameter(s): BandwidthBand width. This value is used while building the shifting model. If --nomodel is set, 2 time of this value will be used as a scanwindow width. DEFAULT: 300
P-Value <Pvalue cutoff for peak detection. DEFAULT: 1e-5.
Fold Enrichment to BG >Lower Bound: Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
Fold Enrichment to BG <Upper Bound: Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
No Lambda?If True, MACS will use fixed background lambda as local lambda for every peak region. Normally, MACS calculates a dynamic local lambda to reflect the local bias due to potential chromatin structure.
Small Local Lambda RegionThe small nearby region in basepairs to calculate dynamic lambda. This is used to capture the bias at the peak summit region. Invalid if there is no control data.
Large Local Lambda RegionThe large nearby region in basepairs to calculate dynamic lambda. This is used to capture the surround bias. DEFAULT: 1000 when control is availalbe, and 5000 if there's no control.
No Model?Whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100, try to set shiftsize to change it.
Shift SizeThe arbitrary shift size in bp. When nomodel is true, MACS will regard this value as 'modeled' d.
Effective Genome SizeEffective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8)
PET Insert SizeBest distance between Pair-End Tags. Default is 200bps. When MACS reads mapped positions for 5' tag and 3' tag, it will decide the best pairing for them using this best distance parameter
FDR <False discovery rate (FDR) threshold. Peaks not passing this threshold will be considered false. If there is no background sample, this threshold is going to be ignored.
Distinct InputsOnly use distinct alignments as input (might avoid amplification bias).
Name for ChIP PeaksAn user-defined name that will be assigned to the output dataset from port ChIP Peaks.
Name for False PeaksAn user-defined name that will be assigned to the output dataset from port False Peaks.
Window Size UpstreamA binding site has to be within this distance upstream of the TSS of a feature to be assigned to that feature.
Window Size DownstreamA binding site has to be within this distance downstream of the TSS of a feature to be assigned to that feature
Assign closest only?It will often happen that a binding peak is close to more than one feature. If several features are found within the specified window size, shall the peak be assigned to all of them or just to one?.
Peak FlagsA selection of boolean-value columns that indicate whether the region is a binding site for a certain transcription factor. If no columns are selected, all boolean columns will be used.
Name for Annotated PeaksAn user-defined name that will be assigned to the output dataset from port Annotated Peaks.
Name for Regulated FeaturesAn user-defined name that will be assigned to the output dataset from port Regulated Features.
Name for TFBS StatisticsAn user-defined name that will be assigned to the output dataset from port TFBS Statistics.
Description:
This meta-module combines the MACS-based peak detection module (see module 'Find Peaks with MACS') with gene association (see module 'Assign TFBS to Genes') and binding sites statistics (see module 'ChIP-seq Peak Summary'). Running this meta-module is completely equivalent to using the MACS module followed by gene association and statistics, but may help to keep workflows with many datasets more manageable and concise. Some intermediate datasets will be discarded (i.e. peaks failing to pass the FDR filter).
Configuration:
To configure this meta module, you may customize the parameters of the individual modules used. Please have a look at the respective parts of the manual:module 'Find Peaks with MACS', module 'Assign TFBS to Genes' and module 'ChIP-seq Peak Summary'.

Quality Control + Bowtie Alignment

Developer(s): Halbritter F (Module Developer); Trapnell C, Langmead B (Software Developer); Halbritter F (Meta-Module Developer)
Version: 2011-03-15
Related Citation(s): [17][9]
Related Web Page(s): Bowtie Homepage
Input(s): Original SequencesA set of sequences which you would like to apply a filter to.
Reference GenomeThe reference genome / transcript sequences that you would like to align to.
Output(s): Modified SequencesOnly those sequences that passed the filter criteria you postulated and modified in the way you specified.
AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
Parameter(s): Trim RightTrim a certain number of nucleotides from the end of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the end of each sequence in the pair separately.
Trim LeftTrim a certain number of nucleotides from the front of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the beginning of each sequence in the pair separately.
Fix LengthTrim sequences to a fixed length. Shorter sequences will remain unchanged. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed to the fixed length separately.
Cut after NTrim every sequence from the first occurrence of an uncertain nucleotide (N). For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
Cut after low qualityTrim every sequence from the first occurrence of a certain (low) quality value. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
AppendAppend a fixed sequence to the end of each input sequence. For paired-end / mate-pair sequences, the suffix will be appended to both sequences.
PrependAttach a fixed sequence to the front of each sequence. For paired-end / mate-pair sequences, the prefix will be appended to both sequences.
Cumulative Quality Score >The sum of all quality scores per read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Minimum Quality Score >The minimum per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Mean Quality Score >The average per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Sequence Complexity >Highly homegeneous sequences (e.g. only T's) are likely to result from error and may be discarded. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Discard N-ReadsDiscard all reads containing uncertain nucleotides (N). For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence contains an N.
Sequence Length >Discard all reads that are shorter than a given length. For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence is too short.
Trim leading..Trim all leading (i.e. those at the left of the reads) occurrences of a certain nucleotide.
Trim trailingTrim all trailing (i.e. those at the right of the reads) occurrences of a certain nucleotide.
Name for Failed SequencesAn user-defined name that will be assigned to the output dataset from port Failed Sequences.
Name for Modified SequencesAn user-defined name that will be assigned to the output dataset from port Modified Sequences.
Use quality values?Shall the alignment tool make use of read quality scores?
Max. mismatches in seedThe maximum number of mismatches allowed in the seed portion of the read. See also: parameter -l.
Seed lengthThe seed of a read is the left portion of the read which is more likely to be of a good quality.
Max. sum of mismatch qualsMaximum sum of quality values at all mismatched read positions.
Disable quality rounding?By default, bowtie rounds quality values to the next 10 with a maximum of 30.
Max. mismatchesThe maximum number of mismatches across the entire alignment.
No forward strand?Do not attempt to align reads to the forward strand of the reference.
No reverse strand?Do not attempt to align reads to the reverse strand of the reference.
Max. reported alignmentsReport up to K valid alignments per read.
Max. ambiguitySuppress all alignments if more than M valid alignments exist.
Best possible hitsMake Bowtie guarantee that reported singleton alignments are 'best' in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).
Use strataAn alignment's stratum is defined by its number of mismatches. Activating this option has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Share memoryMultiple bowtie processes on the same computer share their memory (does not affect your results).
Use multiple coresAllows bowtie to use more than one CPU on the same computer (does not affect your results).
Min. Insert SizeThe minimum insert size for valid paired-end alignments.
Max. Insert SizeThe maximum insert size for valid paired-end alignments.
Mate OrientationThe upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
No strataUse this parameter to switch OFF bowtie's --strata option: An alignment's stratum is defined by its number of mismatches. Activating this option (as by GeneProf default) has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Name for Ambiguous SequencesAn user-defined name that will be assigned to the output dataset from port Ambiguous Sequences.
Name for Unaligned SequencesAn user-defined name that will be assigned to the output dataset from port Unaligned Sequences.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Description:
This meta-module combines the sequence pre-processing and quality control module (see module 'Modify and Filter Sequences') with sequence alignment to a genome using Bowtie (see module 'Align against DNA with Bowtie ..'). Running this meta-module is completely equivalent to using the quality control module followed by sequence alignment, but may help to keep workflows with many datasets more manageable and concise. Some intermediate datasets will be discarded (i.e. reads filtered out during quality control as well as highly ambiguously aligned and unaligned reads).
Configuration:
To configure this meta module, you may customize the parameters of the individual modules used. Please have a look at the respective parts of the manual:module 'Modify and Filter Sequences' and module 'Align against DNA with Bowtie ..'.

Quality Control + Bowtie Iterative Trimming Alignment

Developer(s): Halbritter F (Module Developer); Trapnell C, Langmead B (Software Developer); Halbritter F (Meta-Module Developer)
Version: 2011-03-15
Related Citation(s): [17][9]
Related Web Page(s): Bowtie Homepage
Input(s): Original SequencesA set of sequences which you would like to apply a filter to.
Reference GenomeThe reference genome / transcript sequences that you would like to align to.
Output(s): Modified SequencesOnly those sequences that passed the filter criteria you postulated and modified in the way you specified.
AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
Parameter(s): Trim RightTrim a certain number of nucleotides from the end of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the end of each sequence in the pair separately.
Trim LeftTrim a certain number of nucleotides from the front of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the beginning of each sequence in the pair separately.
Fix LengthTrim sequences to a fixed length. Shorter sequences will remain unchanged. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed to the fixed length separately.
Cut after NTrim every sequence from the first occurrence of an uncertain nucleotide (N). For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
Cut after low qualityTrim every sequence from the first occurrence of a certain (low) quality value. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
AppendAppend a fixed sequence to the end of each input sequence. For paired-end / mate-pair sequences, the suffix will be appended to both sequences.
PrependAttach a fixed sequence to the front of each sequence. For paired-end / mate-pair sequences, the prefix will be appended to both sequences.
Cumulative Quality Score >The sum of all quality scores per read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Minimum Quality Score >The minimum per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Mean Quality Score >The average per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Sequence Complexity >Highly homegeneous sequences (e.g. only T's) are likely to result from error and may be discarded. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Discard N-ReadsDiscard all reads containing uncertain nucleotides (N). For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence contains an N.
Sequence Length >Discard all reads that are shorter than a given length. For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence is too short.
Trim leading..Trim all leading (i.e. those at the left of the reads) occurrences of a certain nucleotide.
Trim trailingTrim all trailing (i.e. those at the right of the reads) occurrences of a certain nucleotide.
Name for Failed SequencesAn user-defined name that will be assigned to the output dataset from port Failed Sequences.
Name for Modified SequencesAn user-defined name that will be assigned to the output dataset from port Modified Sequences.
Use quality values?Shall the alignment tool make use of read quality scores?
Max. mismatches in seedThe maximum number of mismatches allowed in the seed portion of the read. See also: parameter -l.
Seed lengthThe seed of a read is the left portion of the read which is more likely to be of a good quality.
Max. sum of mismatch qualsMaximum sum of quality values at all mismatched read positions.
Disable quality rounding?By default, bowtie rounds quality values to the next 10 with a maximum of 30.
Max. mismatchesThe maximum number of mismatches across the entire alignment.
No forward strand?Do not attempt to align reads to the forward strand of the reference.
No reverse strand?Do not attempt to align reads to the reverse strand of the reference.
Max. reported alignmentsReport up to K valid alignments per read.
Max. ambiguitySuppress all alignments if more than M valid alignments exist.
Best possible hitsMake Bowtie guarantee that reported singleton alignments are 'best' in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).
Use strataAn alignment's stratum is defined by its number of mismatches. Activating this option has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Share memoryMultiple bowtie processes on the same computer share their memory (does not affect your results).
Use multiple coresAllows bowtie to use more than one CPU on the same computer (does not affect your results).
Min. Insert SizeThe minimum insert size for valid paired-end alignments.
Max. Insert SizeThe maximum insert size for valid paired-end alignments.
Mate OrientationThe upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
No strataUse this parameter to switch OFF bowtie's --strata option: An alignment's stratum is defined by its number of mismatches. Activating this option (as by GeneProf default) has the consequence that only alignments with fewer mismatches will be reported, even if further alignments with more mismatches exist.
Name for Ambiguous SequencesAn user-defined name that will be assigned to the output dataset from port Ambiguous Sequences.
Name for Unaligned SequencesAn user-defined name that will be assigned to the output dataset from port Unaligned Sequences.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Min. Len. (ITA)Discard reads becoming too short during the iterative trimming alignment procedure.
Max. Steps (ITA)Stop after at most this many cycles of trimming and re-alignment.
Trim Start (ITA)Trim this many bases of the start (left) of each unaligned read in each step.
Trim End (ITA)Trim this many bases of the end (right) of each unaligned read in each step
Description:
This meta-module combines the sequence pre-processing and quality control module (see module 'Modify and Filter Sequences') with sequence alignment to a genome using Bowtie (see module 'Align against DNA with Bowtie ..'). After an intial round of quality control, GeneProf will try to align the short read sequence data to the given reference genome. Reads that failed to align will then be trimmed a little and we will try to align the shortened reads again. The process is repeated until all sequences have been aligned or a maximum number of iteration has been reached. This methodology has been reported to drastically improve alignment success rate (e.g. in [4]), in particular, in datasets with long reads or where sequencing errors accumulate with read length. However, it is important to acknowledge, that short reads tend to align more ambiguously (GeneProf will discard reads that align too ambiguously using a user-defined threshold) and also that the process might create 'wrong' alignments, i.e. shorter reads might seem to originate from a certain region of the genome, although they are actually not from there at all. Thus, this module should be used with care!
Configuration:
To configure this meta module, you may customize the parameters of the individual modules used. Please have a look at the respective parts of the manual:module 'Modify and Filter Sequences' and module 'Align against DNA with Bowtie ..'.
The pre-processing and quality control module is run once before the iterative alignment procedure starts, so the parameters will affect the data that goes into the later steps. The Bowtie configuration will be used in each step of the iterative alignment process.
This meta-module takes a few additional parameters concerned with the iterative alignment procedure: At each step, unaligned sequences will be trimmed a bit prior to re-aligning them. You can define how much of each sequences is trimmed off, either from the beginning of each read (Trim Start (ITA)) or from its end (Trim End (ITA)). The iterative alignment process will continue either until no unaligned sequences are left or until the sequenecs become to short or until a maximum number of iterations has been exceeded. You can set the minimum sequence length accepted (Min. Len. (ITA)) and the maximum number of steps allowed (Max. Steps (ITA)) using this dialog.

Quality Control + Tophat 1.20 Alignment

Developer(s): Halbritter F (Module Developer); Trapnell C (Software Developer); Halbritter F (Meta-Module Developer)
Version: 2011-03-15
Related Citation(s): [17][10]
Related Web Page(s): TopHat Homepage, Picard Tools (for SAM parsing)
Input(s): Original SequencesA set of sequences which you would like to apply a filter to.
Reference GenomeThe reference genome / transcript sequences that you would like to align to.
Output(s): Modified SequencesOnly those sequences that passed the filter criteria you postulated and modified in the way you specified.
AlignmentsThe genomic coordinates of the regions where the sequences aligned to.
Parameter(s): Trim RightTrim a certain number of nucleotides from the end of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the end of each sequence in the pair separately.
Trim LeftTrim a certain number of nucleotides from the front of every sequence. For paired-end / mate-pair sequences, trim the this number of nucleotides of the beginning of each sequence in the pair separately.
Fix LengthTrim sequences to a fixed length. Shorter sequences will remain unchanged. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed to the fixed length separately.
Cut after NTrim every sequence from the first occurrence of an uncertain nucleotide (N). For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
Cut after low qualityTrim every sequence from the first occurrence of a certain (low) quality value. For paired-end / mate-pair sequences, each sequence in the pair will be trimmed separately, thus the result might be sequences with different lengths for each member of the pair.
AppendAppend a fixed sequence to the end of each input sequence. For paired-end / mate-pair sequences, the suffix will be appended to both sequences.
PrependAttach a fixed sequence to the front of each sequence. For paired-end / mate-pair sequences, the prefix will be appended to both sequences.
Cumulative Quality Score >The sum of all quality scores per read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Minimum Quality Score >The minimum per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Mean Quality Score >The average per-base quality score in each read has to be at least as large as this value in order to pass the filter. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Sequence Complexity >Highly homegeneous sequences (e.g. only T's) are likely to result from error and may be discarded. For paired-end / mate-pair sequences, the score is calculated for each read separately and the pair is only accepted if both members pass the threshold.
Discard N-ReadsDiscard all reads containing uncertain nucleotides (N). For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence contains an N.
Sequence Length >Discard all reads that are shorter than a given length. For paired-end / mate-pair sequences, the entire pair will be discarded if either sequence is too short.
Trim leading..Trim all leading (i.e. those at the left of the reads) occurrences of a certain nucleotide.
Trim trailingTrim all trailing (i.e. those at the right of the reads) occurrences of a certain nucleotide.
Name for Failed SequencesAn user-defined name that will be assigned to the output dataset from port Failed Sequences.
Name for Modified SequencesAn user-defined name that will be assigned to the output dataset from port Modified Sequences.
Min. Anchor LengthTopHat will report junctions spanned by reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This must be at least 3 and the default is 8.
Splice Mismatcheshe maximum number of mismatches that may appear in the anchor region of a spliced alignment. The default is 0.
Min. Intron LengthTopHat will ignore donor/acceptor pairs closer than this many bases apart. The default is 70.
Max. Intron LengthWhen searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. The default is 500000.
Max. MultihitsInstructs TopHat to allow up to this many alignments to the reference for a given read, and suppresses all alignments for reads with more than this many alignments.
Min. Isoform FractionTopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. The default is 0.15.
Mate Inner DistanceThis is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
Mate Inner Distance STDThe standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
No Novel JunctionsDo not attempt to find any new splice junctions.
No Known JunctionsDo not use known splice junctions.
Microexon SearchWith this option, the pipeline will attempt to find alignments incident to microexons. Works only for reads 50bp or longer.
Library TypeThe type of library sequenced.
Butterfly SearchTopHat will use a slower but potentially more sensitive algorithm to find junctions in addition to its standard search. Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.
Coverage SearchUse coverage search? Will be disabled by default, if reads are longer than 75bp, but can be forced ON.
Closure SearchUse mate pair closure-based search for junctions? Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp).
Segment MismatchesRead segments are mapped independently, allowing up to this many mismatches in each segment alignment. The default is 2.
Segment LengthEach read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.
Min. Closure ExonDuring closure search for paired end reads, exonic hops in the potential splice graph must be at least this long. The default is 50.
Min. Closure IntronThe minimum intron length that may be found during closure search. The default is 50.
Max. Closure IntronThe maximum intron length that may be found during closure search. The default is 5000.
Min. Coverage IntronThe minimum intron length that may be found during coverage search. The default is 50.
Max. Coverage IntronThe maximum intron length that may be found during coverage search. The default is 20000.
Min. Segment IntronThe minimum intron length that may be found during split-segment search. The default is 50.
Max. Segment IntronThe maximum intron length that may be found during split-segment search. The default is 500000.
Name for AlignmentsAn user-defined name that will be assigned to the output dataset from port Alignments.
Name for JunctionsAn user-defined name that will be assigned to the output dataset from port Junctions.
Description:
This meta-module combines the sequence pre-processing and quality control module (see module 'Modify and Filter Sequences') with sequence alignment to a genome using Tophat (see module 'TopHat 1.2 Alignment'). Running this meta-module is completely equivalent to using the quality control module followed by sequence alignment, but may help to keep workflows with many datasets more manageable and concise. Some intermediate datasets will be discarded (i.e. reads filtered out during quality control as well as any novel splice junctions discovered).
Configuration:
To configure this meta module, you may customize the parameters of the individual modules used. Please have a look at the respective parts of the manual:module 'Modify and Filter Sequences' and module 'TopHat 1.2 Alignment'.

Bibliography


[1] Bailey, TL and Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol (1994). PMID: 7584402.
[2] Bolstad, BM and Irizarry, RA and Astrand, M and Speed, TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003). PMID: 12538238.
[3] Mortazavi, A and Williams, BA and McCue, K and Schaeffer, L and Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods (2008). PMID: 18516045.
[4] Cloonan, N and Forrest, AR and Kolle, G and Gardiner, BB and Faulkner, GJ and Brown, MK and Taylor, DF and Steptoe, AL and Wani, S and Bethel, G and Robertson, AJ and Perkins, AC and Bruce, SJ and Lee, CC and Ranade, SS and Peckham, HE and Manning, JM and McKernan, KJ and Grimmond, SM. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods (2008). PMID: 18516046.
[5] Chen, X and Xu, H and Yuan, P and Fang, F and Huss, M and Vega, VB and Wong, E and Orlov, YL and Zhang, W and Jiang, J and Loh, YH and Yeo, HC and Yeo, ZX and Narang, V and Govindarajan, KR and Leong, B and Shahab, A and Ruan, Y and Bourque, G and Sung, WK and Clarke, ND and Wei, CL and Ng, HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell (2008). PMID: 18555785.
[6] Jothi, R and Cuddapah, S and Barski, A and Cui, K and Zhao, K. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. (2008). PMID: 18684996.
[7] Zhang, Y and Liu, T and Meyer, CA and Eeckhoute, J and Johnson, DS and Bernstein, BE and Nusbaum, C and Myers, RM and Brown, M and Li, W and Liu, XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. (2008). PMID: 18798982.
[8] Ji, H and Jiang, H and Ma, W and Johnson, DS and Myers, RM and Wong, WH. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. (2008). PMID: 18978777.
[9] Langmead, B and Trapnell, C and Pop, M and Salzberg, SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. (2009). PMID: 19261174.
[10] Trapnell, C and Pachter, L and Salzberg, SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). PMID: 19289445.
[11] Robinson, MD and McCarthy, DJ and Smyth, GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (2010). PMID: 19910308.
[12] Ouyang, Z and Zhou, Q and Wong, WH. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. U.S.A. (2009). PMID: 19995984.
[13] Quinlan, AR and Hall, IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (2010). PMID: 20110278.
[14] Young, MD and Wakefield, MJ and Smyth, GK and Oshlack, A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. (2010). PMID: 20132535.
[15] Xu, H and Handoko, L and Wei, X and Ye, C and Sheng, J and Wei, CL and Lin, F and Sung, WK. A signal-noise model for significance analysis of ChIP-seq with negative control. Bioinformatics (2010). PMID: 20371496.
[16] Anders, S and Huber, W. Differential expression analysis for sequence count data. Genome Biol. (2010). PMID: 20979621.
[17] Halbritter, F and Vaidya, HJ and Tomlinson, SR. GeneProf: analysis of high-throughput sequencing experiments. Nat. Methods (2011). PMID: 22205509.