GeneProf Manual


Florian Halbritter, 27-Nov-2012

Table of Contents

Chapter: Introduction

Chapter: Concepts Explained

Chapter: Tutorials

Chapter: Modules

Chapter: Pages

Chapter: Advanced Topics

Chapter: Frequently Asked Questions (FAQ)

Pages


Following below are descriptions and explanations for all the different types of web pages you'll encounter on your journeys through the GeneProf web interface.

GeneProf Homepage

The GeneProf Homepage. From here you can navigate to any part of the application and access all public data.
This page is your entry point to GeneProf's web application. From here, you'll be able to access all the different components of the system. Let's have a look at how the page is organized (cp. this figure).
Notice, first of all, at the very top of the page, a dark greenish navigation bar. The navigation bar provides quick access to most of GeneProf's function organized into different categories. The navigation bar is part of every page and you can use it to jump almost everywhere from anywhere within just a few clicks!
Right below the navigation bar, at the top of the actual page content you'll find a quick introduction to what GeneProf is and what it can do for you. Since you've already come this far, we'll assume you're well aware of GeneProf's purpose and scope by now (you may have a look at Chapter 'Introduction', though).
On the right side of the page there'll be a number of greenish boxes. If you haven't signed in to your personal GeneProf account yet, there'll be one providing you with links to the login (see this page) and registration (see this page) pages. Furthermore, regardless of whether you've signed in or not, two more boxes provide you with some information about recent changes to GeneProf: The most recent public experiment and a section for news and announcements. It might be worth keeping an eye on the news section, just to make you aware of any changes that might have an effect on the way you use GeneProf.
The last items on the page are what's probably most interesting: We've surveyed a number of common tasks people have in mind when coming to GeneProf and you can access those parts of the application directly from this section (adequately labeled Things you might want to do). So what's on offer?

User Pages

The pages below are in some way or another connected to GeneProf user accounts.

Registration

User Registration Page. Sign up for a free GeneProf user account by providing just a little information about yourself.
Interested in obtaining a GeneProf user account? Great! You'll need an account to gain access to the advanced functionality of the application, such as running your own experiments and analyses. Registration with GeneProf is free of charge.
Registration with GeneProf is simple. You'll just need to complete the form on the registration page (cp. this figure), filling in some minimum information about yourself. Please provide your first name(s) and family name, as well as a valid email address. We ask you to repeat your email address just to avoid typos! You'll notice that both email text fields turn green if you've entered a valid email address and both email addresses match. If there's a typo (or if the address you've entered does not seem to comply with general guidelines for the format of email addresses) the text fields will be highlighted in red and you'll have a chance to correct your details.
Lastly, you will need to confirm that you're human. Sounds silly, but unfortunately this is necessary to prevent a large number of registrations by anonymous, automated programs. Just enter the two-word phrase displayed in the image into the corresponding text field. If you have problems reading the phrase, you can click the refresh button (the top-most of the three blue buttons next to the phrase) to request a different phrase.
Once you've completed both steps, click the big Register! button to send off your registration application, which will then be processed by the system. Within a few minutes, you should receive an email to the address that you've provided. The email will contain your initial user name and a randomly generated password. You can use these information to sign into your new account via the login page (cp. this page).
After your first login, you can change your user name and password to something more memorable via the user profile (cp. this page).

User Login

The Login Page. Sign in to your user account to access your private data and run analysis pipelines.
This page (cp. this figure) allows you to sign in to your personal GeneProf user account. You'll need to sign in in order to gain access to the advanced functionality of the application, such as running your own experiments and analyses.
Please provide your account details (user name and password) in the appropriate text fields. If you wish, you can tick the box saying Remember me? to store your login details on your local computer and sign in automatically the next time you visit the page. This requires 'Cookies' to be enabled in your web browser (Google provides a good tutorial on how to enable cookies in all major browsers.
Click the Start GeneProf button to complete the login.
If you've forgotten your user name or password or if you don't have a user account yet, you can follow the links at the bottom of the page to either request a new password or to register for an account (cp. this page and this page, respectively).

Forgotten Password

Reset Forgotten Passwords. If you've lost your login details, a new, randomly generated password will be emailed to you.
Have you forgotten your password? Don't worry, GeneProf can email you a new randomly generated password within minutes.
Just enter your email address into the appropriate text field on this page (cp. this figure). You will also need to confirm that you're human. Sounds silly, but unfortunately this is necessary to prevent automated programs from requesting new passwords. Just enter the two-word phrase displayed in the image into the corresponding text field. If you have problems reading the phrase, you can click the refresh button (the top-most of the three blue buttons next to the phrase) to request a different phrase.
When you've completed both steps, click the big Reset Password button and check your email inbox. A new random password should have been emailed to you. Sometimes this can take a few minutes to appear in your inbox.

User Profile

This page (cp. this figure) provides some essential information about a registered GeneProf user and provides a contact form, which you may use to get in touch with the respective user. Mind that, some of this information might be hidden, if the user did not choose to make it public.
User Profile. This page provides you with some information about a GeneProf user and provides you with a means of contacting them (if allowed by the user; left figure). If it's your own profile, you can also modify it and adjust certain preferences (right figure).
Futhermore, the page will summarize all experiments owned by this user, which have been made public (if any). The list of experiments will be subdivided into pages (if necessary) and may be sorted, searched and filtered similar to the experiment browser page (see this page).
If you've registered a user account and visit your own profile, the page additionally contains forms, that allow you to change your account details, including your email address and password.
If you are concerned about your privacy, you can choose not to display your name in your profile when other people visit it. Simply un-tick the box labelled Allow other users to view information about you on your profile? and click the Update My Personal Details, Preferences & Privacy Settings button. Similarily, you can prevent other users from contacting you via GeneProf (GeneProf won't ever share your email address directly) by un-ticking the box labelled Allow other users to contact you via GeneProf?.

Experiment Pages

The pages below are all in some way or another connected to experiments (cp. Concept 'Experiments').

Experiment Main Page

Experiment Main Page. The page provides a summary of an experiment in GeneProf. The page is organized in several section, providing general information about the experiment and its creator(s), the input datasets and experimental factors / conditions, the data analysis and its outputs.
This page is the nexus of everything GeneProf is about! All data in GeneProf is centered around virtual experiments (cp. Concept 'Experiments') and each experiment, all its associated data, the data analysis workflow used and the main results are summarised here.
As a consequence of the shere amount of what's in it, we understand the page might look a bit daunting at first, but don't worry: It will all become clear soon!
Firstly, please note that the same page will be used to display your own, in-progress experiments and finished, public experiments (cp. SubConcept 'Finalization and Publication') and therefore the exact contents of the page depend on (other than what the experiment is actually about) the stage of the analysis process the experiment is in. For example, actions and links that are in some way concerned with modifying the experiment (like uploading data or modifiying the workflow, running wizards, etc.) will not be available for public experients and other experiments, which you are not (or no longer) allowed to modify, of course.
So, let's have a look at what's there: Every experiment page is split into five main sections: General information, popular tasks, input data & sample annotation, main outputs and data analysis workflow. More details about each of these sections follow below:

General Information

At the top of the page, general information about the experiment will be displayed. This is, basically, the information you (or the experiment's creator) has entered in the experiment creation page (cp. this page) augmented with some additional information.
In addition to a verbose description of the experiment, you will find links to relevant publications and external web resources. All reference datasets used in the experiment will be listed as well as the technology platforms (e.g. sequencing machines) used. You'll also find information about the creator of the experiment (with a link to the user's profile, cp. this page), the date the experiemnt was first created and the date when it was last modified. Click the last modified date to open a dialog with the complete analysis history of this experiment (cp. this page)).
Note, that there's a link titled Permalink. You can use the URL given here to link to this specific experiment (it includes the experiments accession number, gpXP_...), e.g. for use in publications.
If this is your own experiment and not yet completed, this section will moreover include a progress indicator, highlighting which steps of the experiment creation process you've already dealt with and what still needs to be done, i.e. to complete a full experiment you ought to provide input data and sample annotation, define and execute an analysis workflow, mark the main outputs of the experiment and then finalise it (cp. Concept 'Experiments').

Popular Tasks

In order to make your life a little bit easier, GeneProf summarises some of the most common tasks that might be of relevance to this dataset in this section. For in-progress experiments, for example, you'll find quick links to analysis wizards and experiment modification.
Some other popular tasks include: Browsing genomic data from the experiment in the genome browser (cp. this page), exporting a printable report for this experiment in PDF format (this report will contain all the information covered in the experiment main page and we propose to include such a report in any publication using GeneProf as an analysis tool), exporting browser tracks (cp. this page) and so on.

Input Data & Sample Annotation

This section of the page sums up the input data used in this experiment and the way in which these data has been annotated. You might want to read up about our interpretation of the concept of 'sample annotation' if you're confused about what you see here, see SubConcept 'Sample Annotation'.
If you are allowed to modify the experiment, you can upload additional data to your experiment by clicking the Upload button (which will open the upload dialog: this page) and change the sample annotation by clicking the Update Annotation button (which will forward to the sample annotation page: this page). Note, that you have to upload some input data first, before you can edit the sample annotation.

Main Results

Usually, most people will only be interested in a few of the datasets produced in the course of a GeneProf data analysis workflow (cp. SubConcept 'Main Results'). Those will be the main results, e.g. lists of genes regulated by a transcription factor, a table of gene expression values or one of the many summary statistics (cp. ModuleCategory 'Summary Statistics').
The datasets are split into several tabs, and you can scroll through the content of each tab, however, especially on smaller computer screens, the display might at times get quite cramped. You can open each dataset in a new window / browser tab by clicking the button (Too small? Open in new page?)) at the top.
For experiment creators: You can choose which datasets are to be displayed in this place by connecting them to a Main Experimental Results module in the workflow (cp. module 'Main Experimental Results')).

Analysis Workflow

The last item on the experiment summary page, is a simplified illustration of the workflow used. In this illustration, all nodes of the same type in the graph (i.e. the workflow steps using the same module) have been collapsed into one and summary statistics modules have been left out for the sake of clarity. You can view the full details of the workflow in the workflow designer by clicking the accordingly labelled button (i.e. Workflow Designer; cp. this page).
If you are in the process of creating this experiment and have not yet created any workflow, you'll get a selection of workflow wizards (cp. SubConcept 'Analysis Wizards') in this place instead. N.B. don't miss the drop-down list with further wizards below the big images with the most popular wizards!

Experiment Creation / Modification

Use this page to create a new experiment or modify the general information of an existing experiment. The only mandatory information to be provided is a short name for the experiment and a (slightly more verbose) description. As long as you don't intend to make your experiment public, this information is really just for yourself, but try to fill it in with something sensible anyway, just to help yourself keep track of your own data. You can always return to this page from the experiment main page (cp. this page) later on to modify the information.
Please try to keep the experiment name short (ideally no more than 30-40 symbols). The description should contain information about the experimental setup, purpose and goal of the study. You can lay it out like an abstract.
Experiment Creation / Modification. You can create a basic, new experiment by entering just a little information in this form. The same form is also used to update existing experiments.
Let's have a look at the other input fields. Data Ownership refers to the relation the analysis in this experiment has to the data that is being used: If you're analysing your own primary research data, that would be primary, while, if you are re-analysing data produced elsewhere or within the scope of another study, the type should be secondary. If you are using both, your own and other data, choose mixed.
If the data in this experiment has already been uploaded to the SRA [15], you can also provide the SRA project accession to link the GeneProf experiment to the submission in this database. Just fill in the accession number (the letters SRA, SRP, ERP or ERA followed by 6 numbers) into the respective text box. You may also leave the field blank.
Published studies may furthermore include citations in their description. Also, if you're using external data, please cite the sources of those. To do so, fill in the Pubmed identifiers of the abstracts for the corresponding articles. You can include up to five citations per experiment. Please also select the citation type, that is, data if you re-use the research data from another publication, analysis if the cited publication presents the analysis given in this experiment or both if the citation details the analysis in this experiment and also presents (for the first time) the data used in the experiment.
Lastly, you may include up to three references to relevant external websites, e.g. your group's homepage or another website or database with further details about this experiment. Please enter a (short) title for the external website and provide the full URL (including http:// prefix).
To confirm your submission, click Save Changes. You'll be forwarded to the experiment main page (see this page) of this experiment.

Sample Annotation

Providing sample annotation for the data in your experiment is an essential step in the creation of any experiment (cp. SubConcept 'Sample Annotation'). Sample annotation eases interpretation of your analysis, helps you to keep track of what's going on and, in some cases, allows GeneProf to do more of the work for you (i.e. appropriately annotated samples make it possible for GeneProf to guess a lot of the values required to configure analysis wizards).
Sample annotation, in GeneProf, comes in form of an annotation table. Each row in this table is meant to correspond to one biological sample and each column reports annotations of a different type. Each sample has some input data associated to it, i.e. the experimental measurements taken for this sample. One or more files might be linked to the same sample.
If you imported data from the SRA, some annotation has already been gather from the original data source and you will only have to complete or correct the annotation where necessary. Otherwise, GeneProf will create an empty annotation table for you and you need to provide all information yourself.
You can dynamically add and remove rows from the table using the Add Row button (in the action bar, bottom left of the page) and the remove buttons (red crosses in the row that you want to remove), respectively. You can also add and remove columns (i.e. sample annotation types) from a range of terms by clicking the Add Column button (in the action bar, bottom left of the page) or the remove column button (red crosses next to the column header of the column you want to remove), respectively.
Start by selecting the inputs associated to each sample: For each row, highlight all input datasets that belong to the same biological sample. You can select multiple sample by keeping the CTRL-key on your keyboard (Mac-users: CMD-key) pressed while you click with the mouse.
For most annotation types, you can just enter a textual annotation value directly into the text boxes in each row. A few annotation types take a controlled vocabulary of annotation terms only, for these you have to define the input value in a popup dialog that will appear as soon as you start typing in the column or when you click the little pencil icon next to the text field.
When you start typing, GeneProf will try to find similar annotation terms that other users have used (by looking through all public experiments) and suggest those to you. This can help to keep terminologies more consistent by enouraging re-use of the exact same terms. Please try to use suggested terms whereever appropriate. For your convience, we have also added a way to fill an entire column with the same value: Just fill in the desired value into one cell of a column and double-click this cell to copy the value across to all other cells in this column.
In general, please try to provide as much information as possible (at least, the one marked important in the listing below!), even annotations that you consider irrelevant to your experiment, if you have the information available. You never know, somebody else might benefit from it! Here's some additional information about the individual annotation types and some guidelines how to add fill them in:

Batch Browser Track Export

This tool allows you to quickly create a set of BED or WIG formatted files (see this FAQ from the UCSC genome browser) which can be displayed as tracks in many genome browser software suites.
Start by choosing the track type: BED for tracks showing little boxes for each region in the exported dataset (e.g. for analysed ChIP-binding peaks) or WIG for coverage like plots (good for ChIP-seq binding patterns or RNA-seq expression data). Depending on the track type, you will have a few further options concerning the track layout.
For BED files, you can choose whether you want export all or only distinct regions in the dataset. In datasets of aligned reads, many regions will occur several times (once per aligned read), thus exporting 'distinct' regions only will keep the output a more manageable size.
More options are available for WIG files: Outputs can be smoothened to produce better-looking plots. The plotting window size determines the degree of smoothing (higher number = smoother plot). The intensity at each base position can either be calculated by distinct regions only or taking all regions into account. People usually look at distinct regions only for ChIP-seq data, in which multiple alignments to the same locus are believed to mostly arise from amplification biases, while RNA-seq data may sensibly take all regions into account (since the sequenced space, i.e. the transcriptome, is smaller, repeated alignments to the same region are more likely to stem from real signals). To create coverage plots, GeneProf will sum up the number of overlapping regions per bin (default size 5bp) across all chromosomes. You can extend regions to a fixed length, if desired. It's common practice to extend ChIP-seq alignments to the average fragment length (or half the average fragment length) to create more informative plots.
Once you've configured the general options, select the datasets you'd like to export by ticking the checkbox next to each dataset. You can change the name and description (in the UCSC browser, for example, the name is displayed to the left and the description on top of each track) and choose a colour for each track.
Finally, click the Start Download link to begin creating and downloading the tracks. It may take up to several minutes before your browser prompts you to select a destination for the downloaded file and the entire download process may take a lot longer, so please be patient.

Batch Rename Datasets

We've added this tool for your convenience to quickly change the names of several datasets in your experiment without having to re-run analysis steps in your workflow.
The page will list all datasets in your experiment. Just alter the dataset names in the text boxes of those datasets you'd like to change and click Submit Changes.

Experiment Browser

This page allows you to browse through all experiments in GeneProf's public repository. If you are logged in to your GeneProf user account, the listed experiments will also include your own experiments (even if they are not yet public; however, other people will, of course, not be able to see those!).
Experiment Browser. You can flick through, filter and search public (and your own) experiments using this browser feature.
The experiments are organized in a big table which is divided into serval pages. You can skip to the next / previous page using the links (Next Page and Previous Page, respectively) at the bottom left of the page.
If you're looking for a specific experiment, you can dynamically filter the table by several criteria, e.g. by its internal GeneProf accession number (gpXP_...), name or owner (i.e. the experiment creator). To apply a filter, simply enter the text you are looking for into the textbox under the respective header and confirm with your ENTER-key. For Platform(s) and Status you can select from a list of all possible values in a dropdown list. By default, text searches will look for an occurrence of your search text anywhere in the corresponding column of the experiments, but you can change this behaviour by changing the search operator: ~ = find a match anywhere in the string (default), = = the entire string has to match your search exactly, != = the entire string MUST NOT match your search string and !~ = your search string MAY NOT appear ANYWHERE in the corresponding column of matched experiments.
You can furthermore sort the table by clicking the header of a specific column. By default, the table will be sorted in ascending order by the values in this column. Click again to reverse the sort order (i.e. descending). You can also combine sorting with filters.
Click either the accession number or the Magnifying Glass button to open the experiment's main page (cp. this page).

The Workflow Designer

Welcome to the GeneProf's graphical workflow designer! Now, this is probably the most complex and feature-rich page in the entire application. It might look a bit scary at first, but don't worry, it's really quite straight-forward to use!
The workflow designer page (cp. this figure) will be used for viewing existing data analysis workflows (cp. Concept 'Workflows') and also to create new ones from scratch or to modify those created with an analysis wizard (cp. SubConcept 'Analysis Wizards'). If you're viewing an existing workflow from somebody else's experiment, you, of course, won't be able to modify the workflow, so the modification components of the workflow will be disabled (or hidden), so in this case, ignore those parts of the documentation of this page that refer to anything about modifying workflows.
Workflow Design. In GeneProf, entire analysis pipelines are constructed by connecting several simple steps visually with each other.
Visually, GeneProf's workflows consist of boxes and arrows. Boxes represent steps in the analysis process and arrows show how these steps are linked together, i.e. the output of one process might become the input of another (cp. this figure). You can track what happened to any one dataset by following its development through the entire workflow along the arrows.
The Workflow Designer. With GeneProf's visual, dynamic workflow designer it is easy to compile entire analysis pipelines using drag & drop.
Procedurally, one first constructs a workflow by adding workflow components to the workflow canvas and connecting steps with arrows. Afterwards, workflow steps can be customized in more detail by altering the parameters of a workflow component. At this stage, all workflow components will be pending, i.e. the underlying analysis has not been executed yet. Pending workflow components have dashed borders (cp. this figure). When you're done constructing the workflow, you can execute it. The experiment will be entered into a processing queue on the GeneProf compute cluster and each component will soon be executed one step at a time. Workflow components that have been successfully executed are than marked with solid borders. If a process fails to complete correctly, the box will have red borders. You can then check the workflow history (cp. this page to find out what went wrong. After a workflow has been executed, you can add additional steps without having to re-run those components that have already been executed. If you change parameters of existing components, though, or if you change their inputs, the components (and all those depending on them) will be marked pending again and you have to re-run the analysis.
Workflow Components. Workflow steps in the workflow designer are represented as boxes.
Going into a bit more detail, each workflow component (= each box) may take several inputs and outputs -- these are represented as little boxes on the left (inputs) and right (outputs) of the workflow components. We call them 'ports' and its these port which are connected by arrows. Inputs and outputs are datasets of a certain type (cp. Concept 'Datasets'). The appearance of the ports (i.e. their colour and symbol) indicates which kind of data is expected at each point and you can only connect output ports of one type to input ports of the same type. Usually, a workflow component needs to be fully connected (that is, all input ports have to have to have an input connected to them), but some inputs might be optional. Optional input ports are marked with a question mark. Some input ports can take more than one input connection. These ports are marked with a plus sign. You can connect different components easily by drag & drop: Just click the output port of one component with your mouse, then keep your mouse button pressed while dragging the arrow over to the input port (of the same colour) of another component and release the mouse button. These are the different port types and respective icons used:
Data TypeExactly OneMultipleOptional
Sequence DataSubConcept 'Sequence Data'SS+S?
Genomic Region DataSubConcept 'Genomic Region Data'RR+R?
Feature DataSubConcept 'Feature Data'FF+F?
File DataSubConcept 'File Data'XX+X?
Reference DataSubConcept 'Reference Data'**+*?
Special DataSubConcept ''Special' Data'$N/AN/A
Any Kindrainbowrainbowrainbow
You can add additional workflow components, from the workflow palette. The workflow palette is the grey box at the right of the workflow designer (you can drag & drop it around, if it's in the way). In the palette, you'll find all supported modules (cp. SubConcept 'Workflow Modules' and Chapter 'Modules'). They are divided into tabs and categories by purpose and scope: The tabs give a rough grouping into basic / general purpose modules (General), modules for RNA-seq and ChIP-seq data (RNA/ChIP) and meta-modules (that is, modules that combine several other into concise, higher-order modules: Meta). Each module is represented as a box and the inputs it expects and outputs it will produce are clearly marked. You can hover over a module to get a short description. If you want to add an instance of a module, just drag & drop it over to the workflow canvas.
Another way to add a component is to duplicate an existing component. To do so, simply select the component you want to duplicate and click the duplicate icon (Duplicate). You can remove components in a similar manner: Either click the delete icon (Delete) next to a selected component.
Right, so having explained the basics, let's look at the workflow components in more detail. Apart from the name of the module used, each workflow component contains some additional information and links (cp. this figure): You can configure the behaviour of each module (or examine the parameters that have been used in a public experiment) by clicking the Module Parameters link. A new, module-specific dialog will open that allows you to customize all parameters this module takes (please refer to the documentation of each module for more details, cp. Chapter 'Modules'). You need to save the workflow after adding new components or changing connections before you can configure the modules. Furthermore, there are labels for all inputs and outputs of the module. These labels appear in the same order as the ports, so if you're confused about which port is which, check the order of these labels. The output labels are links to the details of the output dataset. So if you want to examine intermediate outputs of a workflow, which are not included in an experiments main outputs, click one of these links.
Further actions are available in the action bar: You can Save the workflow (Save) or Execute it (Execute). If you want to get rid of an entire workflow and start from scratch, click the Purge icon (Purge). This will delete all workflow componets, but the inputs. If you'd like to quickly examine the parameters of multiple processes, you can switch to a more detailed view of each workflow component by clicking the Toggle View icon (Toggle View): Each workflow component will then include a list of all parameters inside its box. Click the button again to switch back to the simple view. You can also directly upload additional data to the workflow (cp. this page) and examine the experiment history (cp. this page).

Wizard Pages

Please see the following sections for detailed explanations of the individual analysis wizard (cp. SubConcept 'Analysis Wizards') pages along with use-case examples.

All-in-one RNA-seq Analysis Wizard

What the wizard will do..

This wizard presents a one-stop solution to the analysis of a typical RNA-seq experiment. The wizard takes either raw data files (cp. SubConcept 'File Data') as inputs and parses (cp. SubConcept 'Parsing') them into sequences or starts right ahead with ready-processed sequence datasets (cp. SubConcept 'Sequence Data').
We assume that each datasets reports short read sequences representing transcripts (or fragments thereof), possibly from different biological conditions and (ideally) with a number of replicates for each condition. Each biological / experimental condition is considered one group of samples and the individual datasets belonging to each group will (initially) be processed separately, i.e. each sequenced library will independently subjected to a quality control step (where the user specifies how strict this quality control should be) and then aligned to the reference genome or transcriptome (cp. SubConcept 'Reference Data' and SubConcept 'Alignment').
The workflow will then calculate gene expression read counts for each dataset on the basis of aligned reads overlapping known transcriptional features (cp. SubConcept 'Feature Data'), assigning ambiguously aligned reads proportionally to their most likely regions of origin. The read counts are reported normalised as RPM (reads per million) and RPKM (reads per kilobase million).
If there is more than one sample group, the wizard will also calculate the statistical significance of observed differences between groups (for every possible comparison) and filter the results by a given significance threshold (again, for every possible comparison between two groups).
Moreover, the wizard will supplement the analysis with useful statistics about the input sequence before and after quality control, the read alignment and the gene expression patterns in all groups detected.

How to configure the wizard

The wizard page enables you to customize the analysis process in a minimum of hands-on time by simple answering a few basic questions about the nature of your input data, the quality control and alignment steps and the calculation of gene expression scores and their comparison.
Step 1: Experiment Design
Start by choosing the reference dataset appropriate for your data (cp. SubConcept 'Reference Data') . GeneProf provides a good selection of pre-defined references , but, in the unlikely case that a reference for your organism of choice is not available, you can define your own reference set (see Tutorial 'Creating a Custom Reference Set').
If you're starting off with raw files (i.e. data that has not yet been parsed into a compatible dataset, cp. SubConcept 'Parsing'), you now need to specify an appropriate parser for the sequence data files you have uploaded (usually those should be in either FASTA or FASTQ format). Actually, GeneProf will try to guess the format of the files for you -- that usually works quite well, but if you actually do know the format of your files, you might not want to take any risks and specify the format yourself. In particular the quality score format (cp. SubConcept 'Quality Scores and Formats') can be tricky to guess. If in doubt, ask your sequencing provider about the exact format of the files they've provided you with. If you already have sequences as input data (e.g. after importing data from the SRA or from another experiment (cp. Tutorial 'Uploading Data to GeneProf')), you don't need to worry about this step.
Next you need to tell GeneProf what the input data is really all about, that is, you need to specify which input datasets belong together. The idea here is, that you create one 'group' for each biological / experimental condition. For example, you might have expression data from different cell types / tissues, in this case you should create one group per cell type / tissue. Or you might have a treatment and control-type experiment (e.g. before and after administration of a drug, knockdown of a gene, etc.): Again, each of the two conditions should be one group in terms of the wizard. You can add additional groups (if required) by using the Add Group button. In each group, select all input datasets that belong to it. You can select more than one sample in each select box by keeping your CTRL-key (Mac users: CMD-key) pressed while clicking with the mouse. You can remove all selections from a group by clicking on Clear. If there are no selections in a group, the entire group will be ignored. Please also enter a descriptive, but short label for each group.
If you have provided adequate sample annotation previously (cp. this page), the wizard will try to guess the group layout for you. However, please make sure that the selection is sensible! If the samples have been annotated by Sample Group, the wizard creates one group for each distinct value of this annotation column and puts all associated input datasets in the respective groups.
If there are less than 6 groups in total, you can choose to compare each group against each other to check for differentially expressed genes. This option is activated by default, but if you have to many groups, it will be deactivated (the reason for this is, that the number of possible comparisons rises very quickly to insane numbers after this: 2 groups = 1 comparisons, 3 groups = 3 comparisons, 4 groups = 6 comparisons, 5 groups = 10 comparisons, 6 groups = 15 comparisons, 7 groups = 21 comparisons, 8 groups = 28 comparisons, etc.). Of course, you can still use GeneProf's module for differential expression analysis later on via the workflow designer (cp. this page) defining exactly which comparisons you want to make yourself.
Step 2: Quality Control & Alignment
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp. this page). Please refer to the documentation of the quality control and pre-processing module (cp. module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp. SubConcept 'Quality Scores and Formats').
Next, tell GeneProf a little about the sample preparation / sequencing strategy you've followed. There are, essentially, two different ways of studying transcription using next-generation sequencing technologies: One can either sequence randomly primed parts of transcript sequences (this is usually referred to as RNA-seq) or otherwise transcripts may be restriction-enzymatically digested and the digested fragments are sequenced afterwards (this is known as digital tag profiling, Tag-seq, DeepSAGE and many other names). There are further variations of these protocols, but, in terms of data analysis, they are basically the same.
So, if you have performed an RNA-seq experiment, just keep the default, otherwise select either one of the given Tag-seq options (if you've used either NlaIII or DpnII as a restriction enzyme) or enter custom values in the text boxes to match the behaviour of your restriction enzyme. The idea here is, that the restriction enzyme recognizes a specifc DNA subsequence, so one can attach this known subsequence to the front of each read and then trim the read to a fixed length, which is again given by the behaviour of the restriction enzyme.
Afterwards, choose how to do the sequence alignment (cp. SubConcept 'Alignment'): You can use one of two popular tools for alignment: Bowtie (default; [4]) or TopHat [5]. Both of these programs are great tools for aligning sequences to a reference sequence (genome / transcriptome) and, in fact, TopHat internally makes use of Bowtie. TopHat has the additional benefit that it can map short reads that cross exon-boundaries by using a compendium of known splice junctions. If you have rather long reads in your input libraries, or if you're dealing with paired-end / mate-pair reads, the probability that reads span multiple exons is actually quite high and you might be well advised to use TopHat. If you have short reads (e.g. from a Tag-seq) procedure or if you'd like to align directly to known transcript sequences, Bowtie will do a great job and it's a lot quicker than TopHat.
You can further customize the behaviour of the alignment software (no matter whether you use TopHat or Bowtie), by adjusting the number of mismatches allowed and the permitted alignment ambiguity. When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. You can either discard ambiguous alignments (by dragging the slider in such a way, that A = 1) or allow for up to 10 possible alignments. Reads have more than 10 possible alignments will be discarded and those with a low degree of ambiguity will be assigned to all possible loci giving a higher weight to loci where many other reads aligned to (which are assumed to be more likely the actual origin of the ambiguous reads; cp. module 'Quantitate Gene Expression' for further details). Please note that increasing the number of mismatches might increase the overall ambiguity of alignments, so it's important to find some sort of trade-off between ambiguity and alignment coverage. If in doubt, you can keep the default for now and experiment with other parameters later on via the workflow designer (cp. this page).
Step 3: Gene Association, Quantitation & Differential Expression
In the last step of the analysis, gene expression intensities will be calculated by summing up read counts that aligned to known genes. By default, only reads aligning to exons will contribute towards the expression counts, but you may wish to include intronic alignments as well. Furthermore, you need to tell us if your sequencing libraries are strand-specific. Normal RNA-seq protocols do not retain strand-information and reads on either strand will be counted towards the expression of a gene. However, a number of modified protocols have been published that maintain strand-specificity, so if you're using one of these protocols, make sure to tell the wizard about it, so we can calculate more accurate read counts (you'll also get additional anti-sense read counts for each gene).
Note that the last two choices are only necessary to be made if you decided to align to the genome. If you're using Bowtie instead to align directly to known transcript sequences, the association to known genes is trivial and no parameters need to be set.
Lastly, decide on a significance threshold for differntial expression between experimental conditions. By default, the p-value (FDR-adjusted as calculated using [13]) is required to be less or equals to 0.05, which is moderately stringent, but you might want to relax this criterion a bit and possible filter the data further later on.
Once you're satisfied with all parameter settings, click the big Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp. this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.

Example

For an example use case of this wizard, please refer to the RNA-seq tutorial: Tutorial 'Differential Expression Analys..'.

All-in-one ChIP-seq Analysis Wizard

What the wizard will do..

This wizard presents a one-stop solution to the analysis of a typical ChIP-seq experiment. The wizard takes either raw data files (cp. SubConcept 'File Data') as inputs and parses (cp. SubConcept 'Parsing') them into sequences or starts right ahead with ready-processed sequence datasets (cp. SubConcept 'Sequence Data').
We assume that each dataset reports short read sequences representing fragmented DNA enriched for interaction with a DNA-binding protein. You may study more than one such protein at a time and, at the same time, have more than one dataset for each factor. Different datasets for the same protein will be merged into one and subsequently subjected to a quality control step. The wizard allows the user to specify how strict this quality control step should be.
All sequences parsing quality control will then be aligned (using Bowtie [4]) to the genome of a chosen reference (cp. SubConcept 'Reference Data') and a peak finding algorithm (MACS [2]) will be used for each protein individually to detect regions of enriched binding (so called 'peaks'). This step, ideally, makes use of a background signal (e.g. randomly fragmented input DNA or DNA precipitated with a non-specific antibody) to achieve a more reliable detection of enrichment, although the wizard allows users to run the analysis without a background signal.
The peaks for all proteins are merged into one genomic region dataset (cp. SubConcept 'Genomic Region Data'). GeneProf will then find genes neighboring these putative binding sites and report the genes close to binding sites of each protein (i.e. gene potentially regulated by the protein). Optionally, the wizard will also calculate the transcription factor association strength (TFAS [9]) between each feature in the reference and each protein in the input data.
Moreover, the wizard will supplement the analysis with useful statistics about the input sequence before and after quality control, the read alignment and the ChIP-peaks detected.

How to configure the wizard

The wizard page enables you to customize the analysis process in a minimum of hands-on time by simple answering a few basic questions about the nature of your input data, the quality control and alignment steps and the peak detection and gene association.
Step 1: Experiment Design
Start by choosing the reference dataset appropriate for your data (cp. SubConcept 'Reference Data') . GeneProf provides a good selection of pre-defined references, but, in the unlikely case that a reference for your organism of choice is not available, you can define your own reference set (see Tutorial 'Creating a Custom Reference Set').
If you're starting off with raw files (i.e. data that has not yet been parsed into a compatible dataset, cp. SubConcept 'Parsing'), you now need to specify an appropriate parser for the sequence data files you have uploaded (usually those should be in either FASTA or FASTQ format). Actually, GeneProf will try to guess the format of the files for you -- that usually works quite well, but if you actually do know the format of your files, you might not want to take any risks and specify the format yourself. In particular the quality score format (cp. SubConcept 'Quality Scores and Formats') can be tricky to guess. If in doubt, ask your sequencing provider about the exact format of the files they've provided you with. If you already have sequences as input data (e.g. after importing data from the SRA or from another experiment (cp. Tutorial 'Uploading Data to GeneProf')), you don't need to worry about this step.
Next you need to tell GeneProf what the input data is really all about, that is, you need to specify which input datasets belong together. The idea here is, that you create one 'group' for each DNA-binding protein that you ChIP'ed. If you did several ChIP's for the same protein, but under different experimental conditions, each should go into a separate group. You can add additional groups (if required) by using the Add Group button. In each group, select all input datasets that belong to it. You can select more than one sample in each select box by keeping your CTRL-key (Mac users: CMD-key) pressed while clicking with the mouse. You can remove all selections from a group by clicking on Clear. If there are no selections in a group, the entire group will be ignored. Please also enter a descriptive, but short label for each group.
Control data (e.g. randomly fragmented input DNA or DNA precipitated with a non-specific antibody) should not be added as a regular group, but instead be added to the background signal group (that exists by default, below the other groups in the wizard page). If you do not have any background signal, you can leave the group empty.
If you have provided adequate sample annotation previously (cp. this page), the wizard will try to guess the group layout for you. However, please make sure that the selection is sensible! If the samples have been annotated by SampleGroup, the wizard creates one group for each distinct value of this annotation column and puts all associated input datasets in the respective groups. Watch out! If one of your samples groups was something like 'Control' there will be a group for that now, although it should actually be the background signal. You need to clear the selection in the group the wizard has created for you and instead select the respective dataset(s) in the background signal group. If no Sample Group annotation is available, the wizard will fall back to the Gene annotation, assuming that this details the protein that has been ChIP'ed. So the wizard will create one group for each different gene annotated and put all samples without a Gene annotation into the background signal group.
Step 2: Quality Control & Alignment
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp. this page). Please refer to the documentation of the quality control and pre-processing module (cp. module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp. SubConcept 'Quality Scores and Formats').
Optionally, you can also decide to trim reads to a fixed length prior to quality control. This might make sense if you know that the sequenced reads are longer than the actual biological sequences that went into the process or if you think that for another reason sequence information beyond a certain point is unreliable or nonsensical to use. If you do not want to trim the reads, just leave to field blank or set it to zero.
Next, choose how to do the sequence alignment (cp. SubConcept 'Alignment'). At the moment, the only parameter you can change here is the number of mismatches allowed. When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. This wizard will discard all sequences for which the alignment is ambiguous. Please note that increasing the number of mismatches might increase the overall ambiguity of alignments, so it's important to find some sort of trade-off between ambiguity and alignment coverage. If in doubt, you can keep the default for now and experiment with other parameters later on via the workflow designer (cp. this page).
Step 3: Peak Calling & Feature Association
The last step is all about how you want to go about detecting enriched binding peaks in the aligned data (enriched binding peaks correspond to putative binding sites, or so the theory). GeneProf makes use of a peak detection algorithm called MACS [2]. If you give a background signal, this algorithm will estimate an empirical false discovery rate (FDR) which can be used to filter out peaks which have a high likelihood of being false. By default, the threshold is set so that all accepted peaks have to have a FDR of less or equal to 0.1 (that is, theoretically, we expect less than 10% of all accepted peaks to be wrong), but you can adjust this threshold in whichever way you like. Also, if you do not have a background signal, the FDR threshold will be ignored.
You'll also need to fill in the average length of the sequenced DNA fragments in your ChIP libraries. This is a parameters that influences the peak calling procedure of MACS and, unfortunately, we cannot guess this from your data. Usually, that's something between 200 and 500bp, but it really depends on your sample preparation procedure. By default, the fragment length is set to 400bp, which is a value we've frequently observed in published studies.
Lastly, you need to decide how the detected peaks should be associated to features. Traditionally, the idea was that transcription factor bind in the promoter regions of the genes they regulate, but it is becoming increasingly evident that this does not always appear to be the case. GeneProf, by default, associates a binding peak with a (potentially, but not necessarily regulated) feature, if the binding peak is up to 20kb upstream or up to 1kb downstream of that feature's transcription start site (TSS). If several features are within this range of the peak, all features will be linked to the binding site. You can change the window sizes (upstream and downstream) or decide only to associate the closest feature, if you like.
We recommend to, in addition to above binary assignment of binding sites to genes, calculate a continuous score that suggest how strongly a binding factor associates to each gene. GeneProf can calculate a transcription factor association score on the basis of all the enrichment of a TF binding signal over the background signal in all binding peaks nearby the TSS of each gene according to a formula proposed by [9]. It has been suggested that these continuous scores correlate well with the regulation of gene expression by transcription factors. By default, GeneProf will calculate the TFAS, if you, however, do not wish to do so (e.g. this might make little sense for non-TF ChIP data), you can disable this feature by un-ticking the checkbox.
Once you're satisfied with all parameter settings, click the big Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp. this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.

Example

For an example use case of this wizard, please refer to the ChIP-seq tutorial: Tutorial 'Discovering Transcription Fact..'.

Quality Control Wizard

What the wizard will do..

This wizard has been designed to make it easier to add some quality control steps for your sequencing data to an existing workflow. Unlike, say, the RNA-seq or ChIP-seq wizard (cp. this page and this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!
The wizard will add a pre-processing and quality control module for each selected dataset individually perform read trimming and quality filters according to the settings chosen by the user.

How to configure the wizard

The wizard page lists all ready-processed sequence datasets (cp. SubConcept 'Sequence Data') in your experiment. Click the checkboxes next to all datasets you'd like to apply the wizard to.
If desired, you can change the name of the modified output datasets (i.e. after QC), by altering the name given in the individual textboxes.
You can now configure the behaviour of the wizard in two aspects: Pre-processing and quality control. Pre-processing refers to modifications applied to the reads in a sequencing library prior to any other quality control.
This wizard currently only supports two methods of pre-processing: You can trim reads to a fixed length (to do so, tick the checkbox and drag the slider to the desired length) and you can attach a pre-fix (i.e. a fixed DNA subsequence) to the beginning of each read (to do so, tick the checkbox and enter the prefix into the textbox). These options have been motivated by DeepSAGE- / TagSeq- libraries, for which these pre-processing steps are essential, however, the trimming step, in particular, might be useful for other scenarios as well.
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp. this page). Please refer to the documentation of the quality control and pre-processing module (cp. module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp. SubConcept 'Quality Scores and Formats').
That's it! Once you're satisfied with all parameter settings, click the big Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp. this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.

Alignment Wizard

What the wizard will do..

This wizard has been designed to make it easier to add sequence alignment (cp. SubConcept 'Alignment') for several sequence datasets to an existing workflow. Unlike, say, the RNA-seq or ChIP-seq wizard (cp. this page and this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!

How to configure the wizard

The wizard page lists all ready-processed sequence datasets (cp. SubConcept 'Sequence Data') in your experiment. Click the checkboxes next to all datasets you'd like to align.
If desired, you can change the name of the output datasets (i.e. after alignment), by altering the name given in the individual textboxes.
Next, choose the reference dataset (cp. SubConcept 'Sequence Data') giving the genome sequence that you would like to align to by picking one from GeneProf's list of recommended reference sets.
The wizard will use Bowtie [4] to align the sequences to the chosen genome. You can configure the behaviour of the aligner by changing two parameters: The number of mismatches and the level of ambiguity allowed.
When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. You can either discard ambiguous alignments (by dragging the slider in such a way that A = 1) or allow for up to 10 possible alignments. it is possible to use ambiguously aligned reads for gene expression data (cp. module 'Quantitate Gene Expression' for further details), but for ChIP-seq data most people decide to discard ambiguous aligments, although this is by no means imperative. If in doubt, you can keep the defaults for now and experiment with other parameters later on via the workflow designer (cp. this page).
That's it! Once you're satisfied with all parameter settings, click the big Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp. this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.

Gene Expression Wizard

What the wizard will do..

This wizard has been designed to make it easier to add the necessary components for deriving gene expression counts from aligned short read data to an existing workflow (using this module: module 'Quantitate Gene Expression'). Unlike, say, the RNA-seq or ChIP-seq wizard (cp. this page and this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!

How to configure the wizard

The wizard page lists all ready-processed genomic region datasets (cp. SubConcept 'Genomic Region Data') in your experiment. It is assumed here, that these stem from an alignment of a next-gen sequencing library that in some way reports reads from transcript sequencing. If that is not the case, results of the gene expression quantization might make little sense. Click the checkboxes next to all datasets you'd like to use.
Gene expression intensities will be calculated by summing up read counts that aligned to known genes. By default, only reads aligning to exons will contribute towards the expression counts, but you may wish to include intronic alignments as well. Furthermore, you need to tell us if your sequencing libraries are strand-specific. Normal RNA-seq protocols do not retain strand-information and reads on either strand will be counted towards the expression of a gene. However, a number of modified protocols have been published that maintain strand-specificity, so if you're using one of these protocols, make sure to tell the wizard about it, so we can calculate more accurate read counts (you'll also get additional anti-sense read counts for each gene).
That's it! Once you're satisfied with all parameter settings, click the big Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp. this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.

Dataset Pages

The pages below are all in some way or another connected to datasets (cp. Concept 'Datasets').

Dataset Overview Page

This is the main overview page of a GeneProf dataset (cp. Concept 'Datasets'). The page usually give a brief overview about the dataset and might include a sample of the full data, but what information exactly is displayed on the page depends on the type of dataset. Any dataset overview page (regardless of its type) can be linked to using the permalink given on this page. Note, however, that access to datasets in experiments which have not been made public (cp. SubConcept 'Finalization and Publication') is subject to access restrictions and you might have to log into your user account (cp. this page) before you can display the page.
Dataset Overview for raw data files : In addition to the name of the dataset, the page will include the name of the file originally uploaded and a sample of the file's contents (the first 10 lines). You can download the unprocessed file by clicking the Download File button at the bottom of the page.
Dataset Overview for sequence datasets : In addition to the name of the dataset, the page includes the total number of sequences in the dataset (Size) and a sample of the first 10 sequences in the dataset (in the order read into the dataset, which can probably be considered random for all practical purposes). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. FASTA or FASTQ) by clicking the Export Data button (see this page).
Dataset Overview for genomic region datasets : In addition to the name of the dataset, the page includes the total number of distinct regions in the dataset (Size). Watch out! Don't confuse the number of regions with the number of aligned reads (for genomic region datasets reporting alignments of short-read libraries): Each genomic region might have more than one read aligned to it and also one read might align to more than one region (alignment ambiguity!). The page includes a sample of the first 10 regions in the dataset (in an order specified by the creator of the dataset or by internal ID, by default). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. BED or WIG) by clicking the Export Data button (see this page), produce customized high-quality plots based on the data in this dataset by clicking the Plot Data button (see this page) or open a dynamic browser table to look at all entries in this dataset by clicking the Browse Complete Data Table button (see this page).
Dataset Overview for feature datasets : In addition to the name of the dataset, the page includes the total number of features for which the dataset contains information (Size). These might be all features found expressed in a sample, a list of gene bound by a transcription factor, etc. -- what it is depends completely on the process that preoduced the dataset, really. The page includes a sample of the first 10 features in the dataset (in an order specified by the creator of the dataset or by internal ID, by default). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. Excel Spreadsheets or tab-delimited text files) by clicking the Export Data button (see this page), produce customized high-quality plots based on the data in this dataset by clicking the Plot Data button (see this page) or open a dynamic browser table to look at all entries in this dataset by clicking the Browse Complete Data Table button (see this page).
Dataset Overview for reference datasets : The main body of this page gives essentially the same information as the one for feature datasets . The information in the reference dataset will typically contain a rich set of annotation for each feature in the reference, in particular, a lot of identifiers from other databases that can be used to draw connections to other data. Like other datasets, the features in the reference can be browsed and exported (see this page and this page). A reference dataset is furthermore supplemented with two additional sequence datasets , one for chromosome sequences and one for transcript sequences. Both of those are linked on this page and you can investigate them in more detail by clicking the accession number.
Dataset Overview for 'special' datasets : There is no fixed layout for the display of a 'special' dataset. Typically, the page will include links to downloadable files, tables with summary statistics and / or plots. The information shown on this page is governed by the module producing the dataset, e.g. the summary statistics modules: ModuleCategory 'Summary Statistics'.

Dataset Details Browser

Table Browser. Results tables can be filtered and sorted dynamically. Also, additional annotation columns may be added and the column order altered.
GeneProf's dynamic table browser (cp. this figure) is a convenient tool for looking through large table-like datasets, be it genomic regions , features or whatever other type of tabular data (cp. Concept 'Datasets')). The behaviour of this tool is very much the same as the experiment browser (cp. this page)).
Large tables are devided into pages and you can easily advance to the next or go back to the previous page using the page control in the bottom left of the screen. The tables can be sorted by the values in one column by clicking the header of this column. Click the same header again to toggle the sort order (ascending / descending).
You can also filter the table dynamically (and without modifying the underlying dataset) using the filter input fields underneath the table headers. To apply a filter, simply enter the text you are looking for into the textbox under the respective header and confirm with your ENTER-key. For some columns (those with a limited range of values), you can select from a list of all possible values in a dropdown list. By default, text searches will look for an occurrence of your search text anywhere in the corresponding column of the experiments, but you can change this behaviour by changing the search operator: ~ = find a match anywhere in the string (default), = = the entire string has to match your search exactly, != = the entire string MUST NOT match your search string and !~ = your search string MAY NOT appear ANYWHERE in the corresponding column of matched records.
For most datasets, you can add additional information (i.e. more columns) to the table. For example, for feature datasets you can benefit from the rich set of annotations that comes with the reference dataset its features refer to, such as the feature type (protein coding, short RNA, etc.) or identifiers from many biological databases. To change which columns are to be displayed, open the Display Options (in the 'additional actions' part of the page, cp. this figure and this page).
Importantly, if you change the the columns displayed in one of your own, unfinalised datasets, the change will be stored as part of the dataset and the next time you open this page GeneProf will remember your settings. If, on the other hand, you change the display options of a public or finalised dataset (cp. SubConcept 'Finalization and Publication'), the changes will be lost when you leave the page.
Further actions might be available depending on the type of dataset you are looking at and on whether its your own, editable data or a public or finalised dataset. Apart from the display options, you can always decide to export the table data in a variety of formats (cp. this page) and you may create plots based on the data in the table (cp. this page). For editable datasets, you may also have the option to apply a permanent filter to the data (cp. this page) or to calculate additional columns (cp. this page). N.B. the latter two options will NEVER change the dataset itself, but rather create an additional step that will be added to the workflow and produce a new dataset with the chosen modifications.

Genome Browser

Genome Browser. GeneProf's simple in-built genome browser (based on GenomeGraphs [3]) provides quick visualization of aligned short read data to the genomic reference sequence. Data from different experiments can be easily combined in the same plot and compared.
GeneProf features a simple, integrated genome browser built on GenomeGraphs [3]. The genome browser is meant to provide you with a tool to quickly examine and compare genomic alignments from your own and public sequencing experiments and does by no means constitute a full-blown replacement for such great tools as the Ensembl [14,7], UCSC [12,8], IGV [16] or other genome browsers [6,11,10]. Note that GeneProf supports batch exports for browser tracks which should be compatible with the aforementioned tools (cp. this page).
So, that being said, let's have a look at the browser page. It consists of four parts:

Visual Data Explorer

The Visual Data Explorer (VDE; cp. also this tutorial: Tutorial 'Examining Public Next-Gen Data..') has been designed to make it easy (or at least easier) to discover patterns across many of these datasets in an intuitive, visual way.
The idea here is that you choose a collection of datasets of the same type and from the same organism and group those by an annotation category, i.e. a common property that they share, and then subsequentlt visualise the values in the dataset with one of the available types of plots.
So, to get started, select a reference dataset (cp. SubConcept 'Reference Data'). The VDE lists all public references, which have public datasets associated to them.
Next select a data type. There are five types to choose from:
Depending on which organism you have chosen, not all of these types might be available -- that is, because GeneProf does not (yet) have enough public data for all supported organisms. Mind, that this might change in the future as we add more data to the system, so it might be worth checking again later on if you are missing relevant data for your organism of choice.
Proceed by selecting all datasets you'd like to take into account. Click the Select Dataset(s).. button, which opens popup dialog listing all dataset of the chosen reference and type (see this page for help). Pick as many as you like, but bare in mind that plots might become overcrowded and patterns might be disturbed by comparing completely unrelated samples.
Now select a criterion by which to group the data (Group data by..). Note, that not all samples with have sensible annotation values for each given category. We strive to complete the annoation as well as possible, yet not always all information is available and / or applicable for each given sample. If no annotation is available the given samples will appear in the plots as Unknown or N/A. You don't have to group the data, but it remains questionable whether the VDE can tell you much without any grouping.
Lastly, decide which type of plot you'd like to create. GeneProf's VDE currently support three types:
Clicking the Update Plot button will refresh the VDE plot (this may sometimes take up to a few minutes, but is usually rather quick).
You can save the plot in a variety of high-quality formats using the Export High Quality Images form at the bottom right of the page.

Gene Search

Gene Search Page. You can easily search for your favorite genes using the quick search feature. Choose the correct gene from a list of possible matches (from different organisms and reference datasets).
Looking for information about a particular gene or other feature? Just enter the name of the gene you are looking for into the text box (cp. this figure) and hit your ENTER-key or click the Search button. The page will display a list of all features matching your query, grouped by reference dataset. You can continue on to the details page (cp. this page) for each individual gene by clicking its accession number.
You can search for gene names or stable identifiers in a number of public databases (Ensembl, RefSeq, etc. -- exactly which identifiers are supported depends on the reference set , cp. SubConcept 'Reference Data'). By default, GeneProf will search through the textual information available for each gene and try to match it to your query, i.e. searching for 'nanog' will match (amongst others) both, the NANOG and the NANOGP1 genes in human). You can refine your search results by ticking the Exact matches only-box, which will cause searches to match the entire string. In the previous example, only the NANOG gene will be returned. Note that searches are case-insensitive, so 'nanog' and 'nANoG' will both return the same result. You can further refine the search by selecting the organism (or rather the reference dataset) from the drop-down box next to the text field.

Record Details

Gene Summary Page: General Information. This part of the page summarizes general information about a feature, such as its name, identifiers in a variety of databases and its genomic location.
Gene Summary Page: Public Data. LEFT: This part of the page summarizes information about the feature's expression across public datasets in GeneProf. RIGHT: This part of the page summarizes information about DNA binding activity either of this feature itself (top) or other features (transcription factors) in its neighborhood across public datasets in GeneProf.
This page provides information about a particular record (a gene, sequence or other data point of any sort) in GeneProf. Depending on the type of data displayed, the contents of the page will vary slightly. Compare this figure and this figure for examples of how a gene summary page looks like. We'll first summarize what the page looks like for other kinds of data and then come back to the gene-centric summary example (which is, by far, the most extensive summary page).
For sequences (cp. SubConcept 'Sequence Data', e.g. chromosome or transcript sequences or next-gen sequencing short reads), the page will show the sequence's total length (number of nucleotides), the general nucleotide distribution across the sequence (as a pie chart) as well as the (up to 480 first bases) of the sequence itself and any corresponding quality values (if any; SubConcept 'Quality Scores and Formats'). You'll also have the option to download the sequence in FASTA or FASTQ format.
For genomic regions (cp. SubConcept 'Genomic Region Data', e.g. aligned reads, transcription factor binding sites, genomic bins, etc.) the page simply displays the genomic location (chromosome, start, end, ..), a plot of this region in its genomic neighborhood (including other regions from the same dataset, if any) as well as a table of all annotations available for this region (e.g. identifiers of aligned reads, enrichment values from peak finding algorithms, etc.).
Finally, we return to look at gene-centric summary pages (i.e. those for features , cp. SubConcept 'Feature Data'). These pages might contain a lot more information than those for other types of data. For a start, the page contains all information available for this gene in the reference dataset it belongs to (cp. SubConcept 'Reference Data'), such as its name, identifiers in a variety of databases and its genomic location. By the way, most of this information has been gathered from Ensembl [14,7]. If there's any additional data about this gene in the dataset you're viewing it from (e.g. RNA-seq expression data), it will be displayed alongside the generic information.
Interestingly, below the general information, you'll find an additional section labeled Public GeneProf Data for this Feature. Here, we summarize the public data that is available in GeneProf for this particular feature. Firstly, if any of the public datasets in GeneProf report expression values (e.g. from an RNA-seq experiment) for this feature, there will be a plot detailing the feature's expression across all these datasets. It's a bar chart including confidence intervals, as you might be familiar with. The plot is summarizing expression values by a group of sample annotation values, by default by the cell line the expression samples have been derived from. You can easily change the grouping criterion by clicking any one of the links below the image (it might take a moment to reload, so please be patient).
Secondly, you will furthermore find a subsection concerned with DNA-binding activity involving this feature -- either actively or passively. Let me explain.. If the feature is a transcription factor (DNA-binding protein), GeneProf might contain ChIP-seq data for this feature. If so, there will be a summary of what's known about this feature's DNA binding activity in form of two tables: The first table gives an overview of all public datasets reporting genes potentially regulated by this factor and the total number of feature reported bound in this dataset (importantly, this is the number of features (genes) with a binding site for the transcription factor somewhere in the proximity of their transcription start site (TSS, typically at most 20kb upstream and 1kb downstream of the TSS). This is different from the total number of binding sites! One binding site might be in the proximity of more than one feature (i.e. count twice, thrice, ..) or more than one binding site might count only once if they are all near the same feature). You can click the ID of any dataset to continue on to the details page for this dataset and investigate the binding in more detail. Additionally, there will be a list of the 25 features most strongly regulated by this factor. The estimated strength of regulation is calculated as a TFAS score on the basis of binding site enrichments (cp. TFAS module: module 'Calculate TFAS' and [9]). If more than one dataset with TFAS scores is available, we'll take an average of all scores to decide the order.
Lastly, you might be interested in transcription factors potentially regulating this feature's expression. If there's any data available, there will be a list of factors which have one or more binding sites in the features proximity along with the tissues, cell types, etc. in which this binding has been observed and the datasets that report these binding sites. Again, you can click the dataset identifiers to continue on to more details.

Processing Queue

Processing Queue. This page gives your an overview about outstanding jobs on GeneProf's processing servers. Data belonging to other users may be partially obscured.
The processing queue (cp. this figure) provides information about the workload currently handled by our servers. Due to their computational demands (next-gen data is big and therefore costly to process) all analysis pipelines are executed on a network of number-crunching machines. Any one of these machines is waiting for new experiments to be submitted to the queue and then these will be picked up sequentially by one of the processors.
This page displays a list of all experiments that are currently in the queue along with their current status (pending = the experiment has not yet been picked up and is awaiting its turn in the queue, executing = the experiment is currently being processed by one of the servers).
For your own experiments, you'll also get an estimate of how much (longer) the execution of the experiment is expected to take. Beware, this is merely a rough estimate based on previous execution times of similar processes! The estimate might in cases be way off and should be treated only as a guideline.
Please note that information about experiments owned by other users will be partially obscured so to protect their (and their data's) privacy. For your own experiments, you can jump directly to the experiments main page, where you can check the current progress via the workflow designer (see this page) and experiment history (see this page).

Issue Tracking Pages

All pages in this category are connected to GeneProf's issue tracking tool, which is used to report bugs and submit feature requests (cp. Section 'Reporting Bugs and Feature Req..').

Issue Tracker

The page provides an overview of all outstanding issues reported in GeneProf's database. The displayed table can be filtered and sorted dynamically, please refer to the documentation of the experiment browser for instructions about how to (cp. this page)
Click the ID of any issue to view its report in detail (cp. this page).
Interestingly enough, the same page also allows you to submit new reports yourself. So, if you've noticed something going wrong or if you've come up with a great idea, this is the place to be!
To submit a new report, please first check through the list of open issues to make sure nobody else already pointed out the same thing (if so, you may wish to comment on the existing issue report, cp. this page). If there's no report in the system yet, first select whether the submitted issue is a bug report (a report about a fault you've found in the software), a feature request (a suggestion for a new component or functionality that would improve GeneProf) or other (for any other sort of correspondence).
You can make submitted reports private, so other users (apart from the administrators, of course) won't be able to see them, if you're concerned about your privacy. We'd like to encourage you to keep reports public whereever possible, though, to avoid replication of the same issues.
Enter a short, descriptive title (ideally no longer than 30 symbols) and a verbose, detailed description of the issue. Please, also read the submission guidelines (Section 'Reporting Bugs and Feature Req..') for more information about what should be included in the report.
By default, you'll be notified of all updates to this issue by email. If you do not wish to be notified, un-tick the textbox corresponding checkbox.
If you are reporting a bug, it might be advantegeous if administrators may access your log files and experiment histories. We realize that these might contain confidential details about your research and, albeit we ensure you that we will indeed treat any information confidential, if you would not like us to look into relevant reports, just un-tick the corresponding checkbox. If the report is not concerned with a bug report, this checkbox will be ignored.
Lastly, you need to confirm that you have read the submission guidelines (Section 'Reporting Bugs and Feature Req..'). Please do indeed read those to make sure you are aware of all information you are meant to provide in order to make it possible for us to deal with your report in a quick and efficient manner.
Click the Submit Report button to add your new report to the issue database.

Issue Report

This page shows the details of an issue report. An issue may either be a bug (some fault in the software), a feature request (a suggestion about how to improve GeneProf) or any other sort of correspondence.
The report page contains details about who submitted the report, when it was submitted and the current status of the report: New = a newly submitted report, no administrator has yet responded to it. Feedback = An administrator has reponded to this issue, however, it's not yet being dealt with. Assigned = We are currently working on this issue. Closed = The issue has been closed un-resolved, either because we were unable to reproduce it or because it has become irrelevant for some other reason. Resolved = The issue has successfully been resolved.
Most importantly, the page contains all correspondence about this issue: Starting with the initial report, any responses from administrators will be listed. Also, any registered GeneProf user may comment on a report (for example, to add their own experiences about a bug or to support a feature request) and such comments will also be listed in the report page.
If you'd like to comment on a report yourself, you can do so by entering your comment into the text box at the bottom of the page. Please read the submission guidelines (see Section 'Reporting Bugs and Feature Req..') and confirm that you have read them, then submit your comment by clicking on the Submit Comment button.

Pop-Up Windows and Dialogs

Dataset Display Options

Use this dialog to choose which columns you'd like to show in the dataset browser (cp. this page). It's easy: There are two boxes listing all columns which are available for this dataset. The box on the right contains currently selected columns and the one on the left contains those that are currently not selected. You can change your selection by double-clicking a column label in either box (which will switch that column over to the other box, i.e. either include it in the list of displayed columns or remove it) or you can select multiple column labels at once and use the Add and Remove buttons.
You can change the order in which columns will be displayed by moving the selected columns (in the box on the right) up or down. The column at the top of the list will be displayed as the first column in the browser (left-most). Select one or more columns in the list and use the Up and Down buttons to change the order.
Finally, you can change the way in which the table is to be sorted (by default, when the page is loaded). Just pick the column to sort by from the dropdown list and choose a sort order.
If the dataset you change the display options for is owned by you (i.e. it's in one of your own experiments), the changes will be saved in the dataset and the next time you open the dataset, the previously selected column will be shown. Otherwise, the display options will only apply to the current page.

Experiment History

This dialog provides an overview of changes to the experiment at hand and its data analysis workflow. It also contains a record of all programs that have been executed in the course of the analysis of this experiment and of their outputs.
The overview table is sorted chronologically. Everytime the experiment is being executed, a snapshot of the workflow will be taken and stored in the history. You can check in which way the workfow at this time point is different from the latest version (i.e. the one currently in use, which you will see in the experiment's workflow designer, cp. this page) if you click the small workflow icon (workflow icon) next to one of the records.
Outputs of analysis modules (cp. SubConcept 'Workflow Modules') that have been executed in the past can be examined by clicking the history icon (history icon) next to one record. This might be particularly helpful, if you'd like to find out what went wrong, if the execution of one component failed.
Important: The histories of some earlier, public experiments (experiments with an accession number < 100) might be incomplete. That is because the histories, recorded when GeneProf was still under development, contained some information which presented a security risk and we thus had to remove it. We apologize for any inconvenience.

Genome Browser Track Selector and Display Options

Use this dialog to configure general display options and select additional genomic data tracks for the genome browser (cp. this page).
Firstly, you can change the image width and height per track to suit your screen resolution. Further options:
Besides these general display options, you can choose which tracks to display in the browser (genomic region datasets , cp. SubConcept 'Genomic Region Data'). Available tracks include an assortment of all public data available in GeneProf. If you open the genome browser from an experiment, this selection will also include all genomic region datasets (cp. SubConcept 'Genomic Region Data'). This is a great way to quickly compare your own data to those of others!
To select a track, browse through the table at the bottom of the page (you can search, sort and filter it in the same way as other GeneProf table, cp. for example, this page) and click the Add next to the track (add the tracks in the desired order). You can afterwards change the way the track is to be displayed in the browser by changing its label or colour to anything you like or by selecting a different track type. For public tracks, a sensible track type should be selected by default, but feel free to change it anyway! For tracks from your own experiments, GeneProf will try to guess a sensible type. The available track types are:
There's one additional option for WIG-like plots: You can choose to extend each region prior to calculation of coverage counts to a fixed length. This is often done for ChIP-seq alignments, where it makes sense to extend regions to the average length of sequenced DNA fragments (or half the fragment length). If you don't want to extend regions, enter '-1' into the respective text box or leave it blank, GeneProf will then use the regions' actual length.

Data Export

You can export all GeneProf-created datasets in a variety of popular formats. Which formats exactly are available, might depend on the type of data at hand -- for example, it wouldn't make sense to export gene expression data in a format specific for nucleotide sequences, would it? Check the list below for details about the different export formats.
A note of warning here: Exporting large datasets can be quite time-consuming. When you click any of the download links in this dialog, it might seem as if nothing was happening at all: Not so! Please be patient and refrain from repeatedly clicking the same link. Just wait a bit and a file dialog (asking you where to save the file) should appear in due time.
Supported export file formats:

Dataset Details

This dialog displays the same information as the dataset overview page. Please refer to the documentation of that page for more information: this page.
The dialog might sometimes be a little small to display the dataset appropriately. In this case we recommend open the dataset in a new window using the New Window button at the bottom right of the dialog.

Plotting

GeneProf features a range of tools for dynamically creating publication-quality plots for your data.
Use this dialog to select an appropriate plot type.

Plotting: Boxplots

Example Boxplot. We used this boxplot to compare the distribution of raw gene expression read counts (raw) with those scaled to account for differences in library size (RPM). The data shown is taken from gpXP_000023. We log2-transformed the data and changed labels and colour to achieve the plotting result shown here.
Use this dialog to create boxplots (also known as box-and-whisker plots) for your data (cp. this figure). Boxplots summarize five important statistics about the distribution of values in a group of numbers. You can use them with several sets of values to quickly (yet meaningfully) compare several data columns, e.g. the distribution of expression values (read counts & co.) in various samples.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Co-Occurrence Heatmap

Example Co-Occurrence Heatmap. We used this heatmap to look at the co-occurrence of potential transcription factor binding sites in the upstream region of genes from gpXP_000042. We changed labels and colour to achieve the plotting result shown here.
Use this dialog to create what we call a 'Co-Occurrence Heatmap' for your data (cp. this figure). The plot takes a number of columns as input and interprets their contents as boolean values (yes / no, true / false, all number greater than 0 will be interpreted as true, all smaller or equal to 0 as false) and calculates the frequencies of TRUE occurring together for each pair-wise comparison of columns. The frequencies are then depicted in a square heatmap with more intense colours representing a higher frequency of co-occurrence. In a way, these plots can serve as a replacement for Venn-diagrams if you have more than three datasets (cp. this page).
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all boolean columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Correlation Heatmap

Example Correlation Matrix. We used this heatmap to visualize the correlation matrix of gene expression patterns between different RNA-seq libraries from from gpXP_000065. We changed labels and colour to achieve the plotting result shown here.
Use this dialog to create what we call a 'Correlation Heatmap' for your data (cp. this figure). That's a visualization of a correlation matrix between different numeric columns (read up about correlation on Wikipedia; GeneProf uses Spearman's correlation coefficients).
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Heatmaps

Example Heatmap. We used this heatmap to compare the expression patterns (in RPM) between several RNA-seq libraries from gpXP_000023. We chose to log2-transform the data and changed labels and colour to achieve the plotting result shown here.
Use this dialog to create heatmaps for your data (cp. this figure). Heatmaps have become a popular tool in biology for the visualization of numeric data, in particular gene expression data. Heatmaps, when combined with hierarchical clustering, can help researchers to quickly spot patterns in the data hand.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
If you wish, you can hide the colour legend displayed in the upper left of the heatmap and you can disable the dendrograms (and henceforth the hiearchical clustering of the samples = columns and genes = rows).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Histograms

Example Histogram. We used this histogram to look at the distribution of fold changes (log2) between the expression of neural progenitors (NPC) and embryonic stemc cells (ESC) in gpXP_000106. We changed labels and colour to achieve the plotting result shown here. It appears that a similar number of genes is up- and down-regulated, with the majority of all genes being largely the same between both conditions.
Use this dialog to create histograms for your data (cp. this figure). Histograms summarize the distribution of values in a single datasets, i.e. they divide the entire range of values into 'bins' and count the number of times values falling into each of these bins are observed. GeneProf can create a plot combining several histograms into one large panel plot.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: PCA Plots

Example PCA Plot. We used this PCA plot to compare gene expression patterns (RPKM) from 7 different RNA-seq libraries from gpXP_000106. We only changed labels and colour to achieve the plotting result shown here. The first two principal components neatly separate the data into the libraries belonging to three different cell types.
Use this dialog to create a 'PCA plot' your data (cp. this figure). PCA stands for 'principal component analysis' and, put simply, it's a way to mathematically decompose large datasets into a set of representative numeric values, calld 'principal components' (abbreviated: PCs; read up about PCA on Wikipedia). Practically, PCA can be a neat way to discover non-trivial relationships between datasets: If the data can be accurately represented by its PCs, similar datasets should cluster together in the plots produced. In some cases, one may even attempt to assign a biological meaning to the PCs serving as a separator between two classes of datasets -- but you should be careful about this as it's not generally advisable to put too much meaning into the PCs.
GeneProf's PCA plots are actually a panel of four sub-plots: The first plot on the top-right tells you how well the PCA worked for the data at hand. Each PC can be said to explain a certain percentage of the variation in the data. For the subsequent plots we use the first 3 PCs, so those should sensibly explain a good amount of this variation. The plot shows two lines: The black one gives the percentage of the variation explained by each individual PC and the red one the cumulative percentage explained by the 0 to n-th PC, i.e. if we look at the red line at position x=3 they value on the y-axis tells us which percentage of the variation is explained by the first three PCs taken together. The other three plots show the contribution of the individual datasets to the first three PCs. The first plot compares PC1 with PC2, the second PC2 with PC3 and the last PC1 with PC3.
To get started, select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the format (choose from a list of popular formats), define the resolution and size.
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. For the PCA, data is, by default, centered and scaled (to make it better comparable), you can deactivate either of these transformations by ticking the respective checkbox.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Pie Charts

Example Pie Chart. We used this pie chart to check which types of features might be regulated by Klf1. The plot is based on putative binding sites from from gpXP_000084.
Use this dialog to create pie charts for your data (cp. this figure). Pie charts are popular for summarizing categorical data and how often each category is observed as a percentage of the whole data.
In GeneProf, a pie chart will be created by looking at all values of a column, finding all distinct value (this only makes sense if there's a rather limited range of values) and counting how many rows have each distinct value.
Sensible applications might to create a pie chart of the feature type (protein coding, miRNA, etc.) of all genes in a feature dataset, where the feature dataset is a selection of features of interest, e.g. differentially expressed genes or targets of a transcription factor.
To create a pie chart, first select the column with the values that should be plotted (for feature datasets, we'll select 'feature type' by default), then customize the appearance of the pie chart, by choosing its colour scheme, resolution, size and image format.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Scatter Plots

Example Scatter Plot. We used this scatter plot to compare gene expression profiles (as RPKM = reads per kilobase million) between several replicates from from gpXP_000106. We log2-transformed the data and changed labels and colour to achieve the plotting result shown here. Only a random subset of 10,000 genes is plotted here.
Use this dialog to create scatter plots for your data (cp. this figure). Scatter plots are fantastic tools for comparing different sets of numeric values. You'll immediately get a feel for how different datasets compare. GeneProf can plot a panel of multiple scatter plots to compare several datasets in one plot. Rather than plotting the same scatter plot twice (for the scatter plot of A vs. B is the mirror image of B vs. A), the lower left triangle of the panel plot will give the Pearson correlation efficients for the respective datasets.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
GeneProf plots up to 250,000 points, but you can reduce the number of points (a random subset will be taken) for visual clarity. Another way that might help to improve the plot is the use of a smooth scatter plot: Rather than plotting each individual point, smooth scatter plots show something similar to a cloud in the plot where areas with a higher density of plots are darker (or have a more intense colour) than those areas with a low density of points. Which plot is suitable for you depends entirely on the nature of your data.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Plotting: Venn Diagrams

Example Venn Diagram. We used this Venn diagram to compare two lists of genes with potential binding sites for Fli1 and Sfpi1 from gpXP_000042.
Use this dialog to create a Venn/Euler diagram for your data (cp. this figure). Venn diagrams visualize overlaps between lists as overlapping circles. GeneProf can plot up to three lists at once and their sizes will be proportional two the size of the datasets.
This tool uses the Vennerable package by Jonathan Swinton (Vennerable homepage). Other online tools for plotting Venn diagrams, which we find very useful, are Venny and BioVenn.
Right, so to plot a Venn diagram, we need to define the groups that are to be compared. In GeneProf, we define up to three groups based on the columns of the dataset at hand. For each group, select a combination of columns. A row is interpreted as belonging to this particular group, if any of the values in the selected columns evaluates to TRUE, that is, if it's value is TRUE or a number greater than 0. Now, one sensible application of this would be, to define one group each for boolean-value columns you've defined earlier, e.g. saying 'this is differentially expressed in condition A' and 'this is differentially expressed in condition B', then you can easily visualize the overlap of these two lists of differentially expressed genes by defining two groups in the Venn diagram dialog based on these two columns. If you do not select any columns in any one group, this group will be ignored. Thus, you can plot Venn diagrams for only two groups by simply leaving out one.
You can further customize the output plot by changing its resolution, size and image format.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.

Calculator

This dialog presents an alternative interface to various methods for manipulating the dataset open in the data table browser page (this page). You can choose from a range of common tasks often applied to large data tables.
Each of these tasks is based on an underlying workflow module (cp. SubConcept 'Workflow Modules') and you can customize them using the modules' usual configuration pages (see Chapter 'Modules').
When you finish configuring a task, GeneProf will create a new workflow component for it and add this component to the experiment's workflow. You then need to execute the workflow (i.e. enqueue it in the processing queue on the GeneProf servers) to perform the newly added calculations.
Be careful here: If you are currently editing the experiment's workflow in another browser window or tab, unsaved changes to the workflow will be lost! So, before you proceed, save any open workflows and close the browsers windows and tabs no longer required.

Filters

This dialog presents an alternative interface to various methods for applying a permanent filter to the dataset open in the data table browser page (this page). Mind, the filters will NEVER change the dataset at hand itself, but rather derive a new, filtered dataset from the data in this dataset.
Each of the filters is based on an underlying workflow module (cp. SubConcept 'Workflow Modules') and you can customize them using the modules' usual configuration pages (see Chapter 'Modules').
When you finish configuring the filter, GeneProf will create a new workflow component for it and add this component to the experiment's workflow. You then need to execute the workflow (i.e. enqueue it in the processing queue on the GeneProf servers) to run the newly added filter
Be careful here: If you are currently editing the experiment's workflow in another browser window or tab, unsaved changes to the workflow will be lost! So, before you proceed, save any open workflows and close the browsers windows and tabs no longer required.

Upload Main

This is where any data upload to one of your GeneProf experiments start. Choose your preferred method from the available choice by clicking any one option with your mouse.
For details about the upload methods, please refer to the upload tutorial (Tutorial 'Uploading Data to GeneProf').

Java File Uploader Tool

The uploader tool allows you to upload multiple files at a time to the GeneProf system. Files may be compressed in a variety of popular formats (ZIP, GZIP, BZ2) -- we prefer GZIP.
To start uploading files, click the Add Files button and browse your computer to locate the files. You can add multiple files at once or one at a time -- whatever you prefer.
Keep repeating the procedure until you've selected all files you want to upload, then click the Upload button to start the uploads. GeneProf will launch a program that manages the uploads in the background (there should be a little test tube icon in your task bar). You can monitor the progress of your uploads via this tool or via the GeneProf website.
Once all uploads have been processed, your experiment will become editable again and you can proceed with your data analysis.
We've worked hard to make the uploader tool as accessible as possible, but unfortunately it's impossible to foresee all problems and there might be some operating system / browser conmbinations on which the uploader tool does not work correctly. If you experience trouble, the page provides suggestions how to resolve your issues and links to alternative upload mechanisms. Please don't hesitate to get in touch, if you can't upload your files!

Simple File Uploader Tool

If the Java-based uploader tool (above) for some reason does not work for your, this simple version of the uploader tool might be the solution for you. Files may be compressed in a variety of popular formats (ZIP, GZIP, BZ2) -- we prefer GZIP. If you're file, despite compression, is still larger than the limit, you'll need to upload them to an external web server and use the URL uploader (see this page) instead. We apologize for any inconvenience and assure you, that we're working on a better solution!
So, using the uploader is really easy: Just click the upload button and select the file(s) you'd like to import into GeneProf. You can click the button repeatedly and select one file at a time or you can select more than one file at once using your CTRL-key (Mac-users: CMD-key). The upload does not start immediately, but files are rather put into a queue. When you've completed your file selection, ckick the Start Upload button. The files will then be uploaded one at a time until all files have been completed. The upload may take a very long time depending on the size of the files, but you can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!
When all uploads are complete, you can close the dialog. The experiment's main page will then be refreshed and you should be able to see the uploaded files in the Input Data section.
N.B. The uploader tool depends on Flash technology. All modern browsers support Flash via specific plugins -- which you will probably have installed already. If not, your browser software should ask you whether you want to install it automatically. Alternatively, you can download the latest version of the plugin from Adobe's download page. If, for any reason, this fail or you do not wish to install Flash, you can use a basic file uploader by clicking the simple uploader link. Note, though, that you can only upload one file at a time with the basic uploader.

URL Uploader Tool

If the data you'd like to upload has been deposited on some sort of web server (HTTP or FTP -- that is, something you can access in a web browser by entering a URL), you can upload it easily to GeneProf by providing the respective URLs. In this case, no restrictions apply to the size of the files uploaded.
To start the upload, simply copy & paste all URLs into the big text box, one URL per line, and click the Upload Files button. The upload will start immediately, but the process may take a very long time depending on the size of the files. You can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!

SRA Importer

If you'd like to re-analyse some published data, the SRA importer tool is probably the best way to quickly get hold of the data. Most published next-gen sequencing data nowadays will be deposited in the SRA [15] and you can import this data directly into GeneProf. Papers usually state the SRA accession number(s) for the raw data, or some include an accession number from the Gene Expression Omnibus (GEO), whose entry in turn refers to the project in the SRA.
If you know the SRA accession number (ideally the SRA project ID, something starting with SRP and followed by 6 digits), just enter it in the search text box. Alternatively, you may also try to search for a word and see what the SRA holds in store. Mind, though, that the number of retrieved results is limited, so your search needs to be rather specific.
After you've entered your search term, all matching entries (along with their identifiers, name and some annotation) will be listed in this dialog. The results are organised by SRA experiments, which usually correspond to one sequenced sample library. One project will consist of more than one of these experiments and each experiment will in turn contain one or more sequencing runs, i.e. the actual short read datasets.
When the search results have been displayed, GeneProf will initially show a question mark icon (?) next to each sequencing run. GeneProf will go off and check whether the raw data for this run is actually available in a downloadable format (unfortunately, for some experiments it isn't) and dynamically update this image and complete the information about the file size. If the data files could be located successfully, the icon will be changed to a green tick mark (green tick), otherwise a red X (red X).
Give GeneProf a moment to update all the file information, then tick the checkboxes next to all datasets you'd like to import and click the Import the Selected Data! button. The upload will happen on one of the compute servers, which means, at this point, your experiment will be entered into the processing queue. You cannot make any further modifications to the experiment until the process is complete, however, you can now close the browser window or navigate away from this page. You will be notified by email as soon as the upload is complete.

Import Data from Another Experiment (1)

GeneProf host an ever-increasing load of ready-analysed next-gen data and a lot of this data is public. You can import any dataset from a public experiment or from your own private, but finalised experiments (cp. SubConcept 'Finalization and Publication'). This can be useful if you'd like to try a different analysis method on some public data, if you'd like to perform a meta-analysis of data from different studies or if you'd simply like to compare different datasets in the scope of one experiment.
It should also be noted, that data imported from other experiments will not be replicated in the database, which means -- in practical terms and in your favour -- it will not count towards your disk space allowance!
To import data, first find the experiment you want to import data from by browsing through the list of available experiments. The list can be filtered dynamically and is divided into pages (cp. this page).
Once you located the experiment of interest, click the Import button. You'll be forwarded to another dialog.

Import Data from Another Experiment (2)

This dialog displays all datasets contained in the experiment you've chosen previously. The datasets are split up by their data type (cp. Concept 'Datasets').
Simply tick the checkboxes next to the datasets you'd like to import and click the Import Selected Dataset(s) button. The import will happen immediately and you'll be brought back to the experiment main page (cp. this page).

Remove Input Data

We need your confirmation before an input dataset is to be removed from an experiment. The removal step is irreversible, so the only way to recover a removed dataset is to upload it to GeneProf again.
Also, consider which implications removing a dataset might have on the experiment's workflow. If you've previously run an analysis and other components of the workflow depend on the input data to be removed, those parts of the workflow (and all intermediate results produced) will be reset and lost.
If you're not sure, check the workflow (via the workflow designer: this page) before you proceed. You can also remove input data directly in the workflow designer, in which case the consequences of the removal might be more evident.

Workflow Execution

When you are done designing your data analysis workflow, all modules are connected and all parameters set to your satisfaction, you can submit your experiment for execution on one of GeneProf's number crunching servers.
Mind, the job will be entered into a queue and it might take a while before the experiment is actually picked up and processed. Also, the execution of entire workflows can take hours (sometimes days).
If you'd like, we can notify you by email when processing is complete. In fact, we'll do this by default, so if you do NOT want to be email, un-tick the checkbox prior to submitting the experiment to the queue.

Workflow Interruption

You can interrupt any of your experiments which are currently being executed on the server or pending execution. However, it can, at times, take a few minutes (occassionally even up to an hour), before the experiment becomes editable again, so please be patient.

VDE Data Selector

Use this dialog to select all data to be used in the plots created by the visual data explorer (VDE, cp. this page).
The dialogs presents a table of all datasets of the previously selected type and you can tick the checkboxes next to the datasets you'd like to use.
The table includes the sample annotation associated to each of the datasets to help you make a decision. There are a number of ways to quickly select all datasets of interest:
You can toggle the selection of all datasets that belong to a certain annotation category (more information about sample annotations here: SubConcept 'Sample Annotation' and this page), by selecting the annotation value of interest from one of the drop-down menus below the column labels. For example, if you were to select embryonic stem cell from the menu below the header Cell Type all samples with this cell type would be toggled on or off (depending on whether they were selected before). You can use a combination of these selection methods to quickly select all dataset you're interested in. Just always bare in mind that selection will be toggled, i.e. if you first select all samples with cell type X and then all samples from a technology platform Y, all those samples that have cell type X and platform Y will be de-selected, because they were first toggled on and then off again.
To reset all selections, click the Clear All link at the top left of the table.

Bibliography


[1] Mortazavi, A and Williams, BA and McCue, K and Schaeffer, L and Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods (2008). PMID: 18516045.
[2] Zhang, Y and Liu, T and Meyer, CA and Eeckhoute, J and Johnson, DS and Bernstein, BE and Nusbaum, C and Myers, RM and Brown, M and Li, W and Liu, XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. (2008). PMID: 18798982.
[3] Durinck, S and Bullard, J and Spellman, PT and Dudoit, S. GenomeGraphs: integrated genomic data visualization with R. BMC Bioinformatics (2009). PMID: 19123956.
[4] Langmead, B and Trapnell, C and Pop, M and Salzberg, SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. (2009). PMID: 19261174.
[5] Trapnell, C and Pachter, L and Salzberg, SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). PMID: 19289445.
[6] Podicheti, R and Gollapudi, R and Dong, Q. WebGBrowse--a web server for GBrowse. Bioinformatics (2009). PMID: 19357095.
[7] Flicek, P and Aken, BL and Ballester, B and Beal, K and Bragin, E and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Fernandez-Banet, J and Gordon, L and Gräf, S and Haider, S and Hammond, M and Howe, K and Jenkinson, A and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Koscielny, G and Kulesha, E and Lawson, D and Longden, I and Massingham, T and McLaren, W and Megy, K and Overduin, B and Pritchard, B and Rios, D and Ruffier, M and Schuster, M and Slater, G and Smedley, D and Spudich, G and Tang, YA and Trevanion, S and Vilella, A and Vogel, J and White, S and Wilder, SP and Zadissa, A and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Smith, J and Searle, SM. Ensembl's 10th year. Nucleic Acids Res. (2010). PMID: 19906699.
[8] Rhead, B and Karolchik, D and Kuhn, RM and Hinrichs, AS and Zweig, AS and Fujita, PA and Diekhans, M and Smith, KE and Rosenbloom, KR and Raney, BJ and Pohl, A and Pheasant, M and Meyer, LR and Learned, K and Hsu, F and Hillman-Jackson, J and Harte, RA and Giardine, B and Dreszer, TR and Clawson, H and Barber, GP and Haussler, D and Kent, WJ. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. (2010). PMID: 19906737.
[9] Ouyang, Z and Zhou, Q and Wong, WH. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. U.S.A. (2009). PMID: 19995984.
[10] Nielsen, CB and Cantor, M and Dubchak, I and Gordon, D and Wang, T. Visualizing genomes: techniques and challenges. Nat. Methods (2010). PMID: 20195257.
[11] Fiume, M and Williams, V and Brook, A and Brudno, M. Savant: genome browser for high-throughput sequencing data. Bioinformatics (2010). PMID: 20562449.
[12] Fujita, PA and Rhead, B and Zweig, AS and Hinrichs, AS and Karolchik, D and Cline, MS and Goldman, M and Barber, GP and Clawson, H and Coelho, A and Diekhans, M and Dreszer, TR and Giardine, BM and Harte, RA and Hillman-Jackson, J and Hsu, F and Kirkup, V and Kuhn, RM and Learned, K and Li, CH and Meyer, LR and Pohl, A and Raney, BJ and Rosenbloom, KR and Smith, KE and Haussler, D and Kent, WJ. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. (2011). PMID: 20959295.
[13] Anders, S and Huber, W. Differential expression analysis for sequence count data. Genome Biol. (2010). PMID: 20979621.
[14] Flicek, P and Amode, MR and Barrell, D and Beal, K and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Gordon, L and Hendrix, M and Hourlier, T and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Kulesha, E and Larsson, P and Longden, I and McLaren, W and Overduin, B and Pritchard, B and Riat, HS and Rios, D and Ritchie, GR and Ruffier, M and Schuster, M and Sobral, D and Spudich, G and Tang, YA and Trevanion, S and Vandrovcova, J and Vilella, AJ and White, S and Wilder, SP and Zadissa, A and Zamora, J and Aken, BL and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Vogel, J and Searle, SM. Ensembl 2011. Nucleic Acids Res. (2011). PMID: 21045057.
[15] Leinonen, R and Sugawara, H and Shumway, M. The sequence read archive. Nucleic Acids Res. (2011). PMID: 21062823.
[16] Robinson, JT and Thorvaldsdóttir, H and Winckler, W and Guttman, M and Lander, ES and Getz, G and Mesirov, JP. Integrative genomics viewer. Nat. Biotechnol. (2011). PMID: 21221095.