Florian Halbritter, 27-Nov-2012
Table of Contents
Pages
Following below are descriptions and explanations for all the different types of web pages you'll encounter on your journeys through the GeneProf web interface.
GeneProf Homepage
This page is your entry point to GeneProf's web application. From here, you'll be able to access all the different components of the system. Let's have a look at how the page is organized (cp.
this figure).
Notice, first of all, at the very top of the page, a dark greenish navigation bar. The navigation bar provides quick access to most of GeneProf's function organized into different categories. The navigation bar is part of every page and you can use it to jump almost everywhere from anywhere within just a few clicks!
Right below the navigation bar, at the top of the actual page content you'll find a quick introduction to what GeneProf is and what it can do for you. Since you've already come this far, we'll assume you're well aware of GeneProf's purpose and scope by now (you may have a look at
Chapter 'Introduction', though).
On the right side of the page there'll be a number of greenish boxes. If you haven't signed in to your personal GeneProf account yet, there'll be one providing you with links to the login (see
this page) and registration (see
this page) pages. Furthermore, regardless of whether you've signed in or not, two more boxes provide you with some information about recent changes to GeneProf: The most recent public experiment and a section for news and announcements. It might be worth keeping an eye on the news section, just to make you aware of any changes that might have an effect on the way you use GeneProf.
The last items on the page are what's probably most interesting: We've surveyed a number of common tasks people have in mind when coming to GeneProf and you can access those parts of the application directly from this section (adequately labeled Things you might want to do). So what's on offer?
- Search for a Gene / Feature:
Are you interested in what GeneProf can tell you about a certain gene (or other transcriptional unit, such as miRNA's, pseudogenes, etc.)? Just enter your search term in the text box here and press ENTER. You can enter gene names (e.g. 'myc' if you're interested in the Myc gene) or external identifier (from Ensembl, RefSeq and a number of other databases) or probe identifiers (from Affymetrix or other microarray platforms) -- which identifiers exactly are supported, depends on the organism's gene reference set.
For more information about quick searches, please refer to the documentation of the gene search page (see
this page).
- Browse Public Data:
GeneProf provides access to a wealth of re-analyzed public next-gen data, which can be easily accessed in a number of ways:
- Browse public experiments now!This link will forward you to the experiment browser page (see this page). The experiment browser allows your to flick through, filter and search public experiments and to access each of these experiments in more detail. More information can be found in the documentation of the specific page.
- Browse public data tracks in the Genome Browser.This link will forward you to the genome browser page (see this page). GeneProf's simple in-built genome browser (based on Browse public experiments now![3]) provides quick visualization of aligned short read data to the genomic reference sequence. Data from different experiments can be easily combined in the same plot and compared. More information can be found in the documentation of the specific page.
- Visualize public data using a range of plots.This link will forward you to the visual data explorer page (see this page). The visual data explorer lets you visualize public data with a range of customizable, selected plots. More information can be found in the documentation of the specific page.
- Analyze your own Next-Gen Data:
Start a new (virtual) experiment by clicking this link, which will forward you to the experiment creation page (see
this page). Please refer to the documentation of the specific page for further information.
Important: You need to be signed in to your personal GeneProf user account to use this feature.
- Learn how to make better use of GeneProf:
User Pages
The pages below are in some way or another connected to GeneProf user accounts.
Registration
Interested in obtaining a GeneProf user account? Great! You'll need an account to gain access to the advanced functionality of the application, such as running your own experiments and analyses. Registration with GeneProf is free of charge.
Registration with GeneProf is simple. You'll just need to complete the form on the registration page (cp.
this figure), filling in some minimum information about yourself. Please provide your first name(s) and family name, as well as a valid email address. We ask you to repeat your email address just to avoid typos! You'll notice that both email text fields turn green if you've entered a valid email address and both email addresses match. If there's a typo (or if the address you've entered does not seem to comply with general guidelines for the format of email addresses) the text fields will be highlighted in red and you'll have a chance to correct your details.
Lastly, you will need to confirm that you're human. Sounds silly, but unfortunately this is necessary to prevent a large number of registrations by anonymous, automated programs. Just enter the two-word phrase displayed in the image into the corresponding text field. If you have problems reading the phrase, you can click the refresh button (the top-most of the three blue buttons next to the phrase) to request a different phrase.
Once you've completed both steps, click the big
Register! button to send off your registration application, which will then be processed by the system. Within a few minutes, you should receive an email to the address that you've provided. The email will contain your initial user name and a randomly generated password. You can use these information to sign into your new account via the login page (cp.
this page).
After your first login, you can change your user name and password to something more memorable via the user profile (cp.
this page).
User Login
This page (cp.
this figure) allows you to sign in to your personal GeneProf user account. You'll need to sign in in order to gain access to the advanced functionality of the application, such as running your own experiments and analyses.
Please provide your account details (user name and password) in the appropriate text fields. If you wish, you can tick the box saying
Remember me? to store your login details on your local computer and sign in automatically the next time you visit the page. This requires 'Cookies' to be enabled in your web browser (Google provides
a good tutorial on how to enable cookies in all major browsers.
Click the Start GeneProf button to complete the login.
If you've forgotten your user name or password or if you don't have a user account yet, you can follow the links at the bottom of the page to either request a new password or to register for an account (cp.
this page and
this page, respectively).
Forgotten Password
Have you forgotten your password? Don't worry, GeneProf can email you a new randomly generated password within minutes.
Just enter your email address into the appropriate text field on this page (cp.
this figure). You will also need to confirm that you're human. Sounds silly, but unfortunately this is necessary to prevent automated programs from requesting new passwords. Just enter the two-word phrase displayed in the image into the corresponding text field. If you have problems reading the phrase, you can click the refresh button (the top-most of the three blue buttons next to the phrase) to request a different phrase.
When you've completed both steps, click the big Reset Password button and check your email inbox. A new random password should have been emailed to you. Sometimes this can take a few minutes to appear in your inbox.
User Profile
This page (cp.
this figure) provides some essential information about a registered GeneProf user and provides a contact form, which you may use to get in touch with the respective user. Mind that, some of this information might be hidden, if the user did not choose to make it public.
Futhermore, the page will summarize all experiments owned by this user, which have been made public (if any). The list of experiments will be subdivided into pages (if necessary) and may be sorted, searched and filtered similar to the experiment browser page (see
this page).
If you've registered a user account and visit your own profile, the page additionally contains forms, that allow you to change your account details, including your email address and password.
If you are concerned about your privacy, you can choose not to display your name in your profile when other people visit it. Simply un-tick the box labelled Allow other users to view information about you on your profile? and click the Update My Personal Details, Preferences & Privacy Settings button. Similarily, you can prevent other users from contacting you via GeneProf (GeneProf won't ever share your email address directly) by un-ticking the box labelled Allow other users to contact you via GeneProf?.
Experiment Pages
Experiment Main Page
This page is the nexus of everything GeneProf is about! All data in GeneProf is centered around virtual experiments (cp.
Concept 'Experiments') and each experiment, all its associated data, the data analysis workflow used and the main results are summarised here.
As a consequence of the shere amount of what's in it, we understand the page might look a bit daunting at first, but don't worry: It will all become clear soon!
Firstly, please note that the same page will be used to display your own, in-progress experiments and finished, public experiments (cp.
SubConcept 'Finalization and Publication') and therefore the exact contents of the page depend on (other than what the experiment is actually about) the stage of the analysis process the experiment is in. For example, actions and links that are in some way concerned with modifying the experiment (like uploading data or modifiying the workflow, running wizards, etc.) will not be available for public experients and other experiments, which you are not (or no longer) allowed to modify, of course.
So, let's have a look at what's there: Every experiment page is split into five main sections: General information, popular tasks, input data & sample annotation, main outputs and data analysis workflow. More details about each of these sections follow below:
General Information
At the top of the page, general information about the experiment will be displayed. This is, basically, the information you (or the experiment's creator) has entered in the experiment creation page (cp.
this page) augmented with some additional information.
In addition to a verbose description of the experiment, you will find links to relevant publications and external web resources. All
reference datasets
used in the experiment will be listed as well as the technology platforms (e.g. sequencing machines) used. You'll also find information about the creator of the experiment (with a link to the user's profile, cp.
this page), the date the experiemnt was first created and the date when it was last modified. Click the last modified date to open a dialog with the complete analysis history of this experiment (cp.
this page)).
Note, that there's a link titled Permalink. You can use the URL given here to link to this specific experiment (it includes the experiments accession number, gpXP_...), e.g. for use in publications.
If this is your own experiment and not yet completed, this section will moreover include a progress indicator, highlighting which steps of the experiment creation process you've already dealt with and what still needs to be done, i.e. to complete a full experiment you ought to provide input data and sample annotation, define and execute an analysis workflow, mark the main outputs of the experiment and then finalise it (cp.
Concept 'Experiments').
Popular Tasks
In order to make your life a little bit easier, GeneProf summarises some of the most common tasks that might be of relevance to this dataset in this section. For in-progress experiments, for example, you'll find quick links to analysis wizards and experiment modification.
Some other popular tasks include: Browsing genomic data from the experiment in the genome browser (cp.
this page), exporting a printable report for this experiment in PDF format (this report will contain all the information covered in the experiment main page and we propose to include such a report in any publication using GeneProf as an analysis tool), exporting browser tracks (cp.
this page) and so on.
Input Data & Sample Annotation
This section of the page sums up the input data used in this experiment and the way in which these data has been annotated. You might want to read up about our interpretation of the concept of 'sample annotation' if you're confused about what you see here, see
SubConcept 'Sample Annotation'.
If you are allowed to modify the experiment, you can upload additional data to your experiment by clicking the
Upload button (which will open the upload dialog:
this page) and change the sample annotation by clicking the
Update Annotation button (which will forward to the sample annotation page:
this page). Note, that you have to upload some input data first, before you can edit the sample annotation.
Main Results
Usually, most people will only be interested in a few of the datasets produced in the course of a GeneProf data analysis workflow (cp.
SubConcept 'Main Results'). Those will be the main results, e.g. lists of genes regulated by a transcription factor, a table of gene expression values or one of the many summary statistics (cp.
ModuleCategory 'Summary Statistics').
The datasets are split into several tabs, and you can scroll through the content of each tab, however, especially on smaller computer screens, the display might at times get quite cramped. You can open each dataset in a new window / browser tab by clicking the button (Too small? Open in new page?)) at the top.
For experiment creators: You can choose which datasets are to be displayed in this place by connecting them to a
Main Experimental Results module in the workflow (cp.
module 'Main Experimental Results')).
Analysis Workflow
The last item on the experiment summary page, is a simplified illustration of the workflow used. In this illustration, all nodes of the same type in the graph (i.e. the workflow steps using the same module) have been collapsed into one and summary statistics modules have been left out for the sake of clarity. You can view the full details of the workflow in the workflow designer by clicking the accordingly labelled button (i.e.
Workflow Designer; cp.
this page).
If you are in the process of creating this experiment and have not yet created any workflow, you'll get a selection of workflow wizards (cp.
SubConcept 'Analysis Wizards') in this place instead. N.B. don't miss the drop-down list with further wizards below the big images with the most popular wizards!
Experiment Creation / Modification
Use this page to create a new experiment or modify the general information of an existing experiment. The only mandatory information to be provided is a short name for the experiment and a (slightly more verbose) description. As long as you don't intend to make your experiment public, this information is really just for yourself, but try to fill it in with something sensible anyway, just to help yourself keep track of your own data. You can always return to this page from the experiment main page (cp.
this page) later on to modify the information.
Please try to keep the experiment name short (ideally no more than 30-40 symbols). The description should contain information about the experimental setup, purpose and goal of the study. You can lay it out like an abstract.
Let's have a look at the other input fields. Data Ownership refers to the relation the analysis in this experiment has to the data that is being used: If you're analysing your own primary research data, that would be primary, while, if you are re-analysing data produced elsewhere or within the scope of another study, the type should be secondary. If you are using both, your own and other data, choose mixed.
If the data in this experiment has already been uploaded to the SRA [
15], you can also provide the SRA project accession to link the GeneProf experiment to the submission in this database. Just fill in the accession number (the letters SRA, SRP, ERP or ERA followed by 6 numbers) into the respective text box. You may also leave the field blank.
Published studies may furthermore include citations in their description. Also, if you're using external data, please cite the sources of those. To do so, fill in the Pubmed identifiers of the abstracts for the corresponding articles. You can include up to five citations per experiment. Please also select the citation type, that is, data if you re-use the research data from another publication, analysis if the cited publication presents the analysis given in this experiment or both if the citation details the analysis in this experiment and also presents (for the first time) the data used in the experiment.
Lastly, you may include up to three references to relevant external websites, e.g. your group's homepage or another website or database with further details about this experiment. Please enter a (short) title for the external website and provide the full URL (including http:// prefix).
To confirm your submission, click
Save Changes. You'll be forwarded to the experiment main page (see
this page) of this experiment.
Sample Annotation
Providing sample annotation for the data in your experiment is an essential step in the creation of any experiment (cp.
SubConcept 'Sample Annotation'). Sample annotation eases interpretation of your analysis, helps you to keep track of what's going on and, in some cases, allows GeneProf to do more of the work for you (i.e. appropriately annotated samples make it possible for GeneProf to guess a lot of the values required to configure analysis wizards).
Sample annotation, in GeneProf, comes in form of an annotation table. Each row in this table is meant to correspond to one biological sample and each column reports annotations of a different type. Each sample has some input data associated to it, i.e. the experimental measurements taken for this sample. One or more files might be linked to the same sample.
If you imported data from the SRA, some annotation has already been gather from the original data source and you will only have to complete or correct the annotation where necessary. Otherwise, GeneProf will create an empty annotation table for you and you need to provide all information yourself.
You can dynamically add and remove rows from the table using the Add Row button (in the action bar, bottom left of the page) and the remove buttons (red crosses in the row that you want to remove), respectively. You can also add and remove columns (i.e. sample annotation types) from a range of terms by clicking the Add Column button (in the action bar, bottom left of the page) or the remove column button (red crosses next to the column header of the column you want to remove), respectively.
Start by selecting the inputs associated to each sample: For each row, highlight all input datasets that belong to the same biological sample. You can select multiple sample by keeping the CTRL-key on your keyboard (Mac-users: CMD-key) pressed while you click with the mouse.
For most annotation types, you can just enter a textual annotation value directly into the text boxes in each row. A few annotation types take a controlled vocabulary of annotation terms only, for these you have to define the input value in a popup dialog that will appear as soon as you start typing in the column or when you click the little pencil icon next to the text field.
When you start typing, GeneProf will try to find similar annotation terms that other users have used (by looking through all public experiments) and suggest those to you. This can help to keep terminologies more consistent by enouraging re-use of the exact same terms. Please try to use suggested terms whereever appropriate. For your convience, we have also added a way to fill an entire column with the same value: Just fill in the desired value into one cell of a column and double-click this cell to copy the value across to all other cells in this column.
In general, please try to provide as much information as possible (at least, the one marked important in the listing below!), even annotations that you consider irrelevant to your experiment, if you have the information available. You never know, somebody else might benefit from it! Here's some additional information about the individual annotation types and some guidelines how to add fill them in:
- Age:The age of the individual, animal, cells or pool of any of those that input material was collected from. This is a free-text value.
- Antibody:This is for ChIP-seq samples. The annotation is meant to describe the antibody (or at least the antibody's target) that was used for immunoprecipitation. This is a free-text value.
- Cell Line:The cell line which input material was gathered from. This is a free-text value.
- Cell Type:(Important / Recommended). The cell type which input material was gathered from. This is a free-text value. Please use lower-case only and singular rather than plural. E.g. 'embryonic stem cell' or 'neuron'.
- Description:A (very) short description of the sample. Please no more than a few words about the sample not covered in other annotation columns. This is a free-text value.
- Developmental Stage:The developmental stage of the source of the input material, e.g. e15, embryonic, adult. This is a free-text value.
- Gender:The gender of the source of the input material. This is a free-text value, but should usually be only one of: 'male', 'female' or 'mixed'.
- Gene:(Important / Recommended). This annotation type is a way to link your sample with a feature in the reference datasets. We encourage the use of this annotation to denote transcription factors targeted in ChIP experiments or gene experimentally modified (e.g. knocked out) in differential expression studies. Values for this field are internal GeneProf identifiers of feature. You can set the values in a dialog, in which you first select the appropriate reference dataset
and then enter a gene name. Select the correct match from the list of suggestions. If you cannot find a gene, try an alternative name or enter a static identifier, e.g. from Ensembl.
- Label:(Important / Recommended). Please enter a label for each sample. The label will probably be used in many plots later on, so please keep it short! This is a free-text value.
- Organism:(Important / Recommended). The NCBI taxon identifier of the organism the input material was collected from. You can select the correct organism in the popup dialog by entering the name of an organism and choosing the correct match from a list of suggestions.
- Platform:(Important / Recommended). The technology platform used to gather experimental data for this sample. In GeneProf, we're mostly dealing with next-gen sequencing data, so this will usually be a hightroughput sequencing machine. Please select a value from the dropdown list in the popup dialog.
- SRA Accession:The accession number of this sample (that's usually an 'experiment' in the terms of the SRA, hence an accession number of the form SRX...), if applicable. The value will be used to construct a link to the SRA, so it should sensibly by a proper accession number!
- Sample Group:(Important / Recommended). Different biological samples might be grouped together if the belong to the same experimental condition. Such is often the case for biological replicates in a differential expression experiment. This is a free-text value. All samples of the same group should have EXACTLY the same value. If you fill in this column appropriately, the RNA-seq wizard will be able to fill in your input data groups for you. This value will probably be used in many plots later on, so keep it as short as possible.
- Strain:The strain (of an organism) that the input material was collected from, if appropriate. This is a free-text value.
- Time:For time-course experiments. The time at which this sample was collected, e.g. '30h' or '14d'. This is a free-text value.
- Tissue:(Important / Recommended). The tissue which input material was gathered from. This is a free-text value. Please use lower-case only and singular rather than plural. E.g. 'liver'.
- Treatment:Experimental manipulation or treatment performed on the sample. E.g. treatment with a drug or knockout of a gene. This is a free-text value, but please try to keep it as concise as possible.
Batch Browser Track Export
This tool allows you to quickly create a set of BED or WIG formatted files (see
this FAQ from the UCSC genome browser) which can be displayed as tracks in many genome browser software suites.
Start by choosing the track type: BED for tracks showing little boxes for each region in the exported dataset (e.g. for analysed ChIP-binding peaks) or WIG for coverage like plots (good for ChIP-seq binding patterns or RNA-seq expression data). Depending on the track type, you will have a few further options concerning the track layout.
For BED files, you can choose whether you want export all or only distinct regions in the dataset. In datasets of aligned reads, many regions will occur several times (once per aligned read), thus exporting 'distinct' regions only will keep the output a more manageable size.
More options are available for WIG files: Outputs can be smoothened to produce better-looking plots. The plotting window size determines the degree of smoothing (higher number = smoother plot). The intensity at each base position can either be calculated by distinct regions only or taking all regions into account. People usually look at distinct regions only for ChIP-seq data, in which multiple alignments to the same locus are believed to mostly arise from amplification biases, while RNA-seq data may sensibly take all regions into account (since the sequenced space, i.e. the transcriptome, is smaller, repeated alignments to the same region are more likely to stem from real signals). To create coverage plots, GeneProf will sum up the number of overlapping regions per bin (default size 5bp) across all chromosomes. You can extend regions to a fixed length, if desired. It's common practice to extend ChIP-seq alignments to the average fragment length (or half the average fragment length) to create more informative plots.
Once you've configured the general options, select the datasets you'd like to export by ticking the checkbox next to each dataset. You can change the name and description (in the UCSC browser, for example, the name is displayed to the left and the description on top of each track) and choose a colour for each track.
Finally, click the Start Download link to begin creating and downloading the tracks. It may take up to several minutes before your browser prompts you to select a destination for the downloaded file and the entire download process may take a lot longer, so please be patient.
Batch Rename Datasets
We've added this tool for your convenience to quickly change the names of several datasets in your experiment without having to re-run analysis steps in your workflow.
The page will list all datasets in your experiment. Just alter the dataset names in the text boxes of those datasets you'd like to change and click Submit Changes.
Experiment Browser
This page allows you to browse through all experiments in GeneProf's public repository. If you are logged in to your GeneProf user account, the listed experiments will also include your own experiments (even if they are not yet public; however, other people will, of course, not be able to see those!).
The experiments are organized in a big table which is divided into serval pages. You can skip to the next / previous page using the links (Next Page and Previous Page, respectively) at the bottom left of the page.
If you're looking for a specific experiment, you can dynamically filter the table by several criteria, e.g. by its internal GeneProf accession number (gpXP_...), name or owner (i.e. the experiment creator). To apply a filter, simply enter the text you are looking for into the textbox under the respective header and confirm with your ENTER-key. For Platform(s) and Status you can select from a list of all possible values in a dropdown list. By default, text searches will look for an occurrence of your search text anywhere in the corresponding column of the experiments, but you can change this behaviour by changing the search operator: ~ = find a match anywhere in the string (default), = = the entire string has to match your search exactly, != = the entire string MUST NOT match your search string and !~ = your search string MAY NOT appear ANYWHERE in the corresponding column of matched experiments.
You can furthermore sort the table by clicking the header of a specific column. By default, the table will be sorted in ascending order by the values in this column. Click again to reverse the sort order (i.e. descending). You can also combine sorting with filters.
Click either the accession number or the
Magnifying Glass button to open the experiment's main page (cp.
this page).
The Workflow Designer
Welcome to the GeneProf's graphical workflow designer! Now, this is probably the most complex and feature-rich page in the entire application. It might look a bit scary at first, but don't worry, it's really quite straight-forward to use!
The workflow designer page (cp.
this figure) will be used for viewing existing data analysis workflows (cp.
Concept 'Workflows') and also to create new ones from scratch or to modify those created with an analysis wizard (cp.
SubConcept 'Analysis Wizards'). If you're viewing an existing workflow from somebody else's experiment, you, of course, won't be able to modify the workflow, so the modification components of the workflow will be disabled (or hidden), so in this case, ignore those parts of the documentation of this page that refer to anything about modifying workflows.
Visually, GeneProf's workflows consist of boxes and arrows. Boxes represent steps in the analysis process and arrows show how these steps are linked together, i.e. the output of one process might become the input of another (cp.
this figure). You can track what happened to any one dataset by following its development through the entire workflow along the arrows.
Procedurally, one first constructs a workflow by adding workflow components to the workflow canvas and connecting steps with arrows. Afterwards, workflow steps can be customized in more detail by altering the parameters of a workflow component. At this stage, all workflow components will be
pending, i.e. the underlying analysis has not been executed yet. Pending workflow components have dashed borders (cp.
this figure). When you're done constructing the workflow, you can execute it. The experiment will be entered into a processing queue on the GeneProf compute cluster and each component will soon be executed one step at a time. Workflow components that have been successfully executed are than marked with solid borders. If a process fails to complete correctly, the box will have red borders. You can then check the workflow history (cp.
this page to find out what went wrong. After a workflow has been executed, you can add additional steps without having to re-run those components that have already been executed. If you change parameters of existing components, though, or if you change their inputs, the components (and all those depending on them) will be marked pending again and you have to re-run the analysis.
Going into a bit more detail, each
workflow component (= each box) may take several inputs and outputs -- these are represented as little boxes on the left (inputs) and right (outputs) of the workflow components. We call them '
ports' and its these port which are connected by arrows. Inputs and outputs are datasets of a certain type (cp.
Concept 'Datasets'). The appearance of the ports (i.e. their colour and symbol) indicates which kind of data is expected at each point and you can only connect output ports of one type to input ports of the same type. Usually, a workflow component needs to be fully connected (that is, all input ports have to have to have an input connected to them), but some inputs might be optional. Optional input ports are marked with a
question mark. Some input ports can take more than one input connection. These ports are marked with a
plus sign. You can connect different components easily by
drag & drop: Just click the output port of one component with your mouse, then keep your mouse button pressed while dragging the arrow over to the input port (of the same colour) of another component and release the mouse button. These are the different port types and respective icons used:
You can add additional workflow components, from the
workflow palette. The workflow palette is the grey box at the right of the workflow designer (you can drag & drop it around, if it's in the way). In the palette, you'll find all supported modules (cp.
SubConcept 'Workflow Modules' and
Chapter 'Modules'). They are divided into tabs and categories by purpose and scope: The tabs give a rough grouping into basic / general purpose modules (
General), modules for RNA-seq and ChIP-seq data (
RNA/ChIP) and meta-modules (that is, modules that combine several other into concise, higher-order modules:
Meta). Each module is represented as a box and the inputs it expects and outputs it will produce are clearly marked. You can hover over a module to get a short description. If you want to add an instance of a module, just drag & drop it over to the workflow canvas.
Another way to add a component is to duplicate an existing component. To do so, simply select the component you want to duplicate and click the
duplicate icon (

). You can remove components in a similar manner: Either click the
delete icon (

) next to a selected component.
Right, so having explained the basics, let's look at the workflow components in more detail. Apart from the name of the module used, each workflow component contains some additional information and links (cp.
this figure): You can configure the behaviour of each module (or examine the parameters that have been used in a public experiment) by clicking the
Module Parameters link. A new, module-specific dialog will open that allows you to customize all parameters this module takes (please refer to the documentation of each module for more details, cp.
Chapter 'Modules'). You need to save the workflow after adding new components or changing connections before you can configure the modules. Furthermore, there are labels for all inputs and outputs of the module. These labels appear in the same order as the ports, so if you're confused about which port is which, check the order of these labels. The
output labels are links to the details of the output dataset. So if you want to examine intermediate outputs of a workflow, which are not included in an experiments main outputs, click one of these links.
Further actions are available in the action bar: You can
Save the workflow (

) or
Execute it (

). If you want to get rid of an entire workflow and start from scratch, click the
Purge icon (

). This will delete all workflow componets, but the inputs. If you'd like to quickly examine the parameters of multiple processes, you can switch to a more detailed view of each workflow component by clicking the
Toggle View icon (

): Each workflow component will then include a list of all parameters inside its box. Click the button again to switch back to the simple view. You can also directly upload additional data to the workflow (cp.
this page) and examine the experiment history (cp.
this page).
Wizard Pages
Please see the following sections for detailed explanations of the individual analysis wizard (cp.
SubConcept 'Analysis Wizards') pages along with use-case examples.
All-in-one RNA-seq Analysis Wizard
What the wizard will do..
We assume that each datasets reports short read sequences representing transcripts (or fragments thereof), possibly from different biological conditions and (ideally) with a number of replicates for each condition. Each biological / experimental condition is considered one group of samples and the individual datasets belonging to each group will (initially) be processed separately, i.e. each sequenced library will independently subjected to a quality control step (where the user specifies how strict this quality control should be) and then aligned to the
reference
genome or transcriptome (cp.
SubConcept 'Reference Data' and
SubConcept 'Alignment').
The workflow will then calculate gene expression read counts for each dataset on the basis of aligned reads overlapping known transcriptional features (cp.
SubConcept 'Feature Data'), assigning ambiguously aligned reads proportionally to their most likely regions of origin. The read counts are reported normalised as RPM (reads per million) and RPKM (reads per kilobase million).
If there is more than one sample group, the wizard will also calculate the statistical significance of observed differences between groups (for every possible comparison) and filter the results by a given significance threshold (again, for every possible comparison between two groups).
Moreover, the wizard will supplement the analysis with useful statistics about the input sequence before and after quality control, the read alignment and the gene expression patterns in all groups detected.
How to configure the wizard
The wizard page enables you to customize the analysis process in a minimum of hands-on time by simple answering a few basic questions about the nature of your input data, the quality control and alignment steps and the calculation of gene expression scores and their comparison.
Step 1: Experiment Design
If you're starting off with raw files (i.e. data that has not yet been parsed into a compatible dataset, cp.
SubConcept 'Parsing'), you now need to specify an appropriate parser for the sequence
data files
you have uploaded (usually those should be in either FASTA or FASTQ format). Actually, GeneProf will try to guess the format of the files for you -- that usually works quite well, but if you actually do know the format of your files, you might not want to take any risks and specify the format yourself. In particular the quality score format (cp.
SubConcept 'Quality Scores and Formats') can be tricky to guess. If in doubt, ask your sequencing provider about the exact format of the files they've provided you with. If you already have
sequences
as input data (e.g. after importing data from the SRA or from another experiment (cp.
Tutorial 'Uploading Data to GeneProf')), you don't need to worry about this step.
Next you need to tell GeneProf what the input data is really all about, that is, you need to specify which input datasets belong together. The idea here is, that you create one 'group' for each biological / experimental condition. For example, you might have expression data from different cell types / tissues, in this case you should create one group per cell type / tissue. Or you might have a treatment and control-type experiment (e.g. before and after administration of a drug, knockdown of a gene, etc.): Again, each of the two conditions should be one group in terms of the wizard. You can add additional groups (if required) by using the Add Group button. In each group, select all input datasets that belong to it. You can select more than one sample in each select box by keeping your CTRL-key (Mac users: CMD-key) pressed while clicking with the mouse. You can remove all selections from a group by clicking on Clear. If there are no selections in a group, the entire group will be ignored. Please also enter a descriptive, but short label for each group.
If you have provided adequate sample annotation previously (cp.
this page), the wizard will try to guess the group layout for you. However, please make sure that the selection is sensible! If the samples have been annotated by
Sample Group, the wizard creates one group for each distinct value of this annotation column and puts all associated input datasets in the respective groups.
If there are less than 6 groups in total, you can choose to compare each group against each other to check for differentially expressed genes. This option is activated by default, but if you have to many groups, it will be deactivated (the reason for this is, that the number of possible comparisons rises very quickly to insane numbers after this: 2 groups = 1 comparisons, 3 groups = 3 comparisons, 4 groups = 6 comparisons, 5 groups = 10 comparisons, 6 groups = 15 comparisons, 7 groups = 21 comparisons, 8 groups = 28 comparisons, etc.). Of course, you can still use GeneProf's module for differential expression analysis later on via the workflow designer (cp.
this page) defining exactly which comparisons you want to make yourself.
Step 2: Quality Control & Alignment
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp.
this page). Please refer to the documentation of the quality control and pre-processing module (cp.
module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp.
SubConcept 'Quality Scores and Formats').
Next, tell GeneProf a little about the sample preparation / sequencing strategy you've followed. There are, essentially, two different ways of studying transcription using next-generation sequencing technologies: One can either sequence randomly primed parts of transcript sequences (this is usually referred to as RNA-seq) or otherwise transcripts may be restriction-enzymatically digested and the digested fragments are sequenced afterwards (this is known as digital tag profiling, Tag-seq, DeepSAGE and many other names). There are further variations of these protocols, but, in terms of data analysis, they are basically the same.
So, if you have performed an RNA-seq experiment, just keep the default, otherwise select either one of the given Tag-seq options (if you've used either NlaIII or DpnII as a restriction enzyme) or enter custom values in the text boxes to match the behaviour of your restriction enzyme. The idea here is, that the restriction enzyme recognizes a specifc DNA subsequence, so one can attach this known subsequence to the front of each read and then trim the read to a fixed length, which is again given by the behaviour of the restriction enzyme.
Afterwards, choose how to do the sequence alignment (cp.
SubConcept 'Alignment'): You can use one of two popular tools for alignment: Bowtie (default; [
4]) or TopHat [
5]. Both of these programs are great tools for aligning sequences to a reference sequence (genome / transcriptome) and, in fact, TopHat internally makes use of Bowtie. TopHat has the additional benefit that it can map short reads that cross exon-boundaries by using a compendium of known splice junctions. If you have rather long reads in your input libraries, or if you're dealing with paired-end / mate-pair reads, the probability that reads span multiple exons is actually quite high and you might be well advised to use TopHat. If you have short reads (e.g. from a Tag-seq) procedure or if you'd like to align directly to known transcript sequences, Bowtie will do a great job and it's a lot quicker than TopHat.
You can further customize the behaviour of the alignment software (no matter whether you use TopHat or Bowtie), by adjusting the number of mismatches allowed and the permitted alignment ambiguity. When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. You can either discard ambiguous alignments (by dragging the slider in such a way, that A = 1) or allow for up to 10 possible alignments. Reads have more than 10 possible alignments will be discarded and those with a low degree of ambiguity will be assigned to all possible loci giving a higher weight to loci where many other reads aligned to (which are assumed to be more likely the actual origin of the ambiguous reads; cp.
module 'Quantitate Gene Expression' for further details). Please note that increasing the number of mismatches might increase the overall ambiguity of alignments, so it's important to find some sort of trade-off between ambiguity and alignment coverage. If in doubt, you can keep the default for now and experiment with other parameters later on via the workflow designer (cp.
this page).
Step 3: Gene Association, Quantitation & Differential Expression
In the last step of the analysis, gene expression intensities will be calculated by summing up read counts that aligned to known genes. By default, only reads aligning to exons will contribute towards the expression counts, but you may wish to include intronic alignments as well. Furthermore, you need to tell us if your sequencing libraries are strand-specific. Normal RNA-seq protocols do not retain strand-information and reads on either strand will be counted towards the expression of a gene. However, a number of modified protocols have been published that maintain strand-specificity, so if you're using one of these protocols, make sure to tell the wizard about it, so we can calculate more accurate read counts (you'll also get additional anti-sense read counts for each gene).
Note that the last two choices are only necessary to be made if you decided to align to the genome. If you're using Bowtie instead to align directly to known transcript sequences, the association to known genes is trivial and no parameters need to be set.
Lastly, decide on a significance threshold for differntial expression between experimental conditions. By default, the p-value (FDR-adjusted as calculated using [
13]) is required to be less or equals to 0.05, which is moderately stringent, but you might want to relax this criterion a bit and possible filter the data further later on.
Once you're satisfied with all parameter settings, click the big
Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp.
this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.
Example
All-in-one ChIP-seq Analysis Wizard
What the wizard will do..
We assume that each dataset reports short read sequences representing fragmented DNA enriched for interaction with a DNA-binding protein. You may study more than one such protein at a time and, at the same time, have more than one dataset for each factor. Different datasets for the same protein will be merged into one and subsequently subjected to a quality control step. The wizard allows the user to specify how strict this quality control step should be.
All sequences parsing quality control will then be aligned (using
Bowtie [
4]) to the genome of a chosen
reference
(cp.
SubConcept 'Reference Data') and a peak finding algorithm (
MACS [
2]) will be used for each protein individually to detect regions of enriched binding (so called 'peaks'). This step, ideally, makes use of a background signal (e.g. randomly fragmented input DNA or DNA precipitated with a non-specific antibody) to achieve a more reliable detection of enrichment, although the wizard allows users to run the analysis without a background signal.
The peaks for all proteins are merged into one
genomic region
dataset (cp.
SubConcept 'Genomic Region Data'). GeneProf will then find genes neighboring these putative binding sites and report the genes close to binding sites of each protein (i.e. gene potentially regulated by the protein). Optionally, the wizard will also calculate the transcription factor association strength (
TFAS [
9]) between each feature in the reference and each protein in the input data.
Moreover, the wizard will supplement the analysis with useful statistics about the input sequence before and after quality control, the read alignment and the ChIP-peaks detected.
How to configure the wizard
The wizard page enables you to customize the analysis process in a minimum of hands-on time by simple answering a few basic questions about the nature of your input data, the quality control and alignment steps and the peak detection and gene association.
Step 1: Experiment Design
If you're starting off with raw files (i.e. data that has not yet been parsed into a compatible dataset, cp.
SubConcept 'Parsing'), you now need to specify an appropriate parser for the sequence
data files
you have uploaded (usually those should be in either FASTA or FASTQ format). Actually, GeneProf will try to guess the format of the files for you -- that usually works quite well, but if you actually do know the format of your files, you might not want to take any risks and specify the format yourself. In particular the quality score format (cp.
SubConcept 'Quality Scores and Formats') can be tricky to guess. If in doubt, ask your sequencing provider about the exact format of the files they've provided you with. If you already have
sequences
as input data (e.g. after importing data from the SRA or from another experiment (cp.
Tutorial 'Uploading Data to GeneProf')), you don't need to worry about this step.
Next you need to tell GeneProf what the input data is really all about, that is, you need to specify which input datasets belong together. The idea here is, that you create one 'group' for each DNA-binding protein that you ChIP'ed. If you did several ChIP's for the same protein, but under different experimental conditions, each should go into a separate group. You can add additional groups (if required) by using the Add Group button. In each group, select all input datasets that belong to it. You can select more than one sample in each select box by keeping your CTRL-key (Mac users: CMD-key) pressed while clicking with the mouse. You can remove all selections from a group by clicking on Clear. If there are no selections in a group, the entire group will be ignored. Please also enter a descriptive, but short label for each group.
Control data (e.g. randomly fragmented input DNA or DNA precipitated with a non-specific antibody) should not be added as a regular group, but instead be added to the background signal group (that exists by default, below the other groups in the wizard page). If you do not have any background signal, you can leave the group empty.
If you have provided adequate sample annotation previously (cp.
this page), the wizard will try to guess the group layout for you. However, please make sure that the selection is sensible! If the samples have been annotated by
SampleGroup, the wizard creates one group for each distinct value of this annotation column and puts all associated input datasets in the respective groups. Watch out! If one of your samples groups was something like 'Control' there will be a group for that now, although it should actually be the background signal. You need to clear the selection in the group the wizard has created for you and instead select the respective dataset(s) in the background signal group. If no
Sample Group annotation is available, the wizard will fall back to the
Gene annotation, assuming that this details the protein that has been ChIP'ed. So the wizard will create one group for each different gene annotated and put all samples without a
Gene annotation into the background signal group.
Step 2: Quality Control & Alignment
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp.
this page). Please refer to the documentation of the quality control and pre-processing module (cp.
module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp.
SubConcept 'Quality Scores and Formats').
Optionally, you can also decide to trim reads to a fixed length prior to quality control. This might make sense if you know that the sequenced reads are longer than the actual biological sequences that went into the process or if you think that for another reason sequence information beyond a certain point is unreliable or nonsensical to use. If you do not want to trim the reads, just leave to field blank or set it to zero.
Next, choose how to do the sequence alignment (cp.
SubConcept 'Alignment'). At the moment, the only parameter you can change here is the number of mismatches allowed. When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. This wizard will discard all sequences for which the alignment is ambiguous. Please note that increasing the number of mismatches might increase the overall ambiguity of alignments, so it's important to find some sort of trade-off between ambiguity and alignment coverage. If in doubt, you can keep the default for now and experiment with other parameters later on via the workflow designer (cp.
this page).
Step 3: Peak Calling & Feature Association
The last step is all about how you want to go about detecting enriched binding peaks in the aligned data (enriched binding peaks correspond to putative binding sites, or so the theory). GeneProf makes use of a peak detection algorithm called
MACS [
2]. If you give a background signal, this algorithm will estimate an empirical false discovery rate (FDR) which can be used to filter out peaks which have a high likelihood of being false. By default, the threshold is set so that all accepted peaks have to have a FDR of less or equal to 0.1 (that is, theoretically, we expect less than 10% of all accepted peaks to be wrong), but you can adjust this threshold in whichever way you like. Also, if you do not have a background signal, the FDR threshold will be ignored.
You'll also need to fill in the average length of the sequenced DNA fragments in your ChIP libraries. This is a parameters that influences the peak calling procedure of MACS and, unfortunately, we cannot guess this from your data. Usually, that's something between 200 and 500bp, but it really depends on your sample preparation procedure. By default, the fragment length is set to 400bp, which is a value we've frequently observed in published studies.
Lastly, you need to decide how the detected peaks should be associated to features. Traditionally, the idea was that transcription factor bind in the promoter regions of the genes they regulate, but it is becoming increasingly evident that this does not always appear to be the case. GeneProf, by default, associates a binding peak with a (potentially, but not necessarily regulated) feature, if the binding peak is up to 20kb upstream or up to 1kb downstream of that feature's transcription start site (TSS). If several features are within this range of the peak, all features will be linked to the binding site. You can change the window sizes (upstream and downstream) or decide only to associate the closest feature, if you like.
We recommend to, in addition to above binary assignment of binding sites to genes, calculate a continuous score that suggest how strongly a binding factor associates to each gene. GeneProf can calculate a transcription factor association score on the basis of all the enrichment of a TF binding signal over the background signal in all binding peaks nearby the TSS of each gene according to a formula proposed by [
9]. It has been suggested that these continuous scores correlate well with the regulation of gene expression by transcription factors. By default, GeneProf will calculate the TFAS, if you, however, do not wish to do so (e.g. this might make little sense for non-TF ChIP data), you can disable this feature by un-ticking the checkbox.
Once you're satisfied with all parameter settings, click the big
Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp.
this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.
Example
Quality Control Wizard
What the wizard will do..
This wizard has been designed to make it easier to add some quality control steps for your sequencing data to an existing workflow. Unlike, say, the RNA-seq or ChIP-seq wizard (cp.
this page and
this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!
The wizard will add a pre-processing and quality control module for each selected dataset individually perform read trimming and quality filters according to the settings chosen by the user.
How to configure the wizard
The wizard page lists all ready-processed
sequence datasets
(cp.
SubConcept 'Sequence Data') in your experiment. Click the checkboxes next to all datasets you'd like to apply the wizard to.
If desired, you can change the name of the modified output datasets (i.e. after QC), by altering the name given in the individual textboxes.
You can now configure the behaviour of the wizard in two aspects: Pre-processing and quality control. Pre-processing refers to modifications applied to the reads in a sequencing library prior to any other quality control.
This wizard currently only supports two methods of pre-processing: You can trim reads to a fixed length (to do so, tick the checkbox and drag the slider to the desired length) and you can attach a pre-fix (i.e. a fixed DNA subsequence) to the beginning of each read (to do so, tick the checkbox and enter the prefix into the textbox). These options have been motivated by DeepSAGE- / TagSeq- libraries, for which these pre-processing steps are essential, however, the trimming step, in particular, might be useful for other scenarios as well.
Proceed by specifying how GeneProf should deal with your data in terms of quality control. The wizard allows you to choose between four levels of strictness here: By default, only a very lenient filter will he applied to the data removing only very low quality reads. If in doubt, this option will do for a start. Note that you can always change the parameters of the quality control step later on via the workflow designer (cp.
this page). Please refer to the documentation of the quality control and pre-processing module (cp.
module 'Modify and Filter Sequences') for details about the different options specified by each strictness level. You might also want to read up about quality scores and how they help to detect erroneous reads (cp.
SubConcept 'Quality Scores and Formats').
That's it! Once you're satisfied with all parameter settings, click the big
Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp.
this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.
Alignment Wizard
What the wizard will do..
This wizard has been designed to make it easier to add sequence alignment (cp.
SubConcept 'Alignment') for several
sequence datasets
to an existing workflow. Unlike, say, the RNA-seq or ChIP-seq wizard (cp.
this page and
this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!
How to configure the wizard
The wizard page lists all ready-processed
sequence datasets
(cp.
SubConcept 'Sequence Data') in your experiment. Click the checkboxes next to all datasets you'd like to align.
If desired, you can change the name of the output datasets (i.e. after alignment), by altering the name given in the individual textboxes.
Next, choose the reference dataset (cp.
SubConcept 'Sequence Data') giving the genome sequence that you would like to align to by picking one from GeneProf's list of recommended reference sets.
The wizard will use
Bowtie [
4] to align the sequences to the chosen genome. You can configure the behaviour of the aligner by changing two parameters: The number of mismatches and the level of ambiguity allowed.
When one aligns sequences to a genome, one typically allows for a number of mismatches between each sequence to be aligned and the reference (chromosome) sequence. This accomodates for polymorphisms and sequencing errors. You can use the slider to decrease or increase the number if mismatches from 0 (none) to up to 3. We have found the default of 2 to work quite well in general, you might, however, want to increase the number for very long sequences (say longer than 50bp) or decrease it for short sequences (less than 30bp).
Not all sequences can be aligned uniquely to one position in the genome. You can either discard ambiguous alignments (by dragging the slider in such a way that A = 1) or allow for up to 10 possible alignments. it is possible to use ambiguously aligned reads for gene expression data (cp.
module 'Quantitate Gene Expression' for further details), but for ChIP-seq data most people decide to discard ambiguous aligments, although this is by no means imperative. If in doubt, you can keep the defaults for now and experiment with other parameters later on via the workflow designer (cp.
this page).
That's it! Once you're satisfied with all parameter settings, click the big
Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp.
this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.
Gene Expression Wizard
What the wizard will do..
This wizard has been designed to make it easier to add the necessary components for deriving gene expression counts from aligned short read data to an existing workflow (using this module:
module 'Quantitate Gene Expression'). Unlike, say, the RNA-seq or ChIP-seq wizard (cp.
this page and
this page), this wizard does NOT create an entire analysis workflow, but rather creates the components necessary for one step of the entire procedure!
How to configure the wizard
The wizard page lists all ready-processed
genomic region datasets
(cp.
SubConcept 'Genomic Region Data') in your experiment. It is assumed here, that these stem from an alignment of a next-gen sequencing library that in some way reports reads from transcript sequencing. If that is not the case, results of the gene expression quantization might make little sense. Click the checkboxes next to all datasets you'd like to use.
Gene expression intensities will be calculated by summing up read counts that aligned to known genes. By default, only reads aligning to exons will contribute towards the expression counts, but you may wish to include intronic alignments as well. Furthermore, you need to tell us if your sequencing libraries are strand-specific. Normal RNA-seq protocols do not retain strand-information and reads on either strand will be counted towards the expression of a gene. However, a number of modified protocols have been published that maintain strand-specificity, so if you're using one of these protocols, make sure to tell the wizard about it, so we can calculate more accurate read counts (you'll also get additional anti-sense read counts for each gene).
That's it! Once you're satisfied with all parameter settings, click the big
Accept Settings & Create Workflow button. GeneProf will now construct an analysis workflow according to your wishes and present an overview of it in the next page. If you are happy, you can execute the analysis right then and there, otherwise you can choose to look at the workflow in detail in the workflow designer (cp.
this page), where you can further customize the analysis, or you can completely discard the workflow and return to the experiment main page.
Dataset Pages
The pages below are all in some way or another connected to datasets (cp.
Concept 'Datasets').
Dataset Overview Page
This is the main overview page of a GeneProf dataset (cp.
Concept 'Datasets'). The page usually give a brief overview about the dataset and might include a sample of the full data, but what information exactly is displayed on the page depends on the type of dataset. Any dataset overview page (regardless of its type) can be linked to using the
permalink given on this page. Note, however, that access to datasets in experiments which have not been made public (cp.
SubConcept 'Finalization and Publication') is subject to access restrictions and you might have to log into your user account (cp.
this page) before you can display the page.
Dataset Overview for raw data files
: In addition to the name of the dataset, the page will include the name of the file originally uploaded and a sample of the file's contents (the first 10 lines). You can download the unprocessed file by clicking the Download File button at the bottom of the page.
Dataset Overview for sequence datasets
: In addition to the name of the dataset, the page includes the total number of sequences in the dataset (
Size) and a sample of the first 10 sequences in the dataset (in the order read into the dataset, which can probably be considered random for all practical purposes). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. FASTA or FASTQ) by clicking the
Export Data button (see
this page).
Dataset Overview for genomic region datasets
: In addition to the name of the dataset, the page includes the total number of distinct regions in the dataset (
Size). Watch out! Don't confuse the number of regions with the number of aligned reads (for genomic region datasets reporting alignments of short-read libraries): Each genomic region might have more than one read aligned to it and also one read might align to more than one region (alignment ambiguity!). The page includes a sample of the first 10 regions in the dataset (in an order specified by the creator of the dataset or by internal ID, by default). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. BED or WIG) by clicking the
Export Data button (see
this page), produce customized high-quality plots based on the data in this dataset by clicking the
Plot Data button (see
this page) or open a dynamic browser table to look at all entries in this dataset by clicking the
Browse Complete Data Table button (see
this page).
Dataset Overview for feature datasets
: In addition to the name of the dataset, the page includes the total number of features for which the dataset contains information (
Size). These might be all features found expressed in a sample, a list of gene bound by a transcription factor, etc. -- what it is depends completely on the process that preoduced the dataset, really. The page includes a sample of the first 10 features in the dataset (in an order specified by the creator of the dataset or by internal ID, by default). You can click the ID (GeneProf's internal identifier) to find out more about any one entry. You may export the data in this dataset in a variety of formats (e.g. Excel Spreadsheets or tab-delimited text files) by clicking the
Export Data button (see
this page), produce customized high-quality plots based on the data in this dataset by clicking the
Plot Data button (see
this page) or open a dynamic browser table to look at all entries in this dataset by clicking the
Browse Complete Data Table button (see
this page).
Dataset Overview for reference datasets
: The main body of this page gives essentially the same information as the one for
feature datasets
. The information in the reference dataset will typically contain a rich set of annotation for each feature in the reference, in particular, a lot of identifiers from other databases that can be used to draw connections to other data. Like other datasets, the features in the reference can be browsed and exported (see
this page and
this page). A reference dataset is furthermore supplemented with two additional
sequence datasets
, one for chromosome sequences and one for transcript sequences. Both of those are linked on this page and you can investigate them in more detail by clicking the accession number.
Dataset Overview for 'special' datasets
: There is no fixed layout for the display of a 'special' dataset. Typically, the page will include links to downloadable files, tables with summary statistics and / or plots. The information shown on this page is governed by the module producing the dataset, e.g. the summary statistics modules:
ModuleCategory 'Summary Statistics'.
Dataset Details Browser
GeneProf's dynamic table browser (cp.
this figure) is a convenient tool for looking through large table-like datasets, be it
genomic regions
,
features
or whatever other type of tabular data (cp.
Concept 'Datasets')). The behaviour of this tool is very much the same as the experiment browser (cp.
this page)).
Large tables are devided into pages and you can easily advance to the next or go back to the previous page using the page control in the bottom left of the screen. The tables can be sorted by the values in one column by clicking the header of this column. Click the same header again to toggle the sort order (ascending / descending).
You can also filter the table dynamically (and without modifying the underlying dataset) using the filter input fields underneath the table headers. To apply a filter, simply enter the text you are looking for into the textbox under the respective header and confirm with your ENTER-key. For some columns (those with a limited range of values), you can select from a list of all possible values in a dropdown list. By default, text searches will look for an occurrence of your search text anywhere in the corresponding column of the experiments, but you can change this behaviour by changing the search operator: ~ = find a match anywhere in the string (default), = = the entire string has to match your search exactly, != = the entire string MUST NOT match your search string and !~ = your search string MAY NOT appear ANYWHERE in the corresponding column of matched records.
For most datasets, you can add additional information (i.e. more columns) to the table. For example, for
feature datasets
you can benefit from the rich set of annotations that comes with the
reference dataset
its features refer to, such as the feature type (protein coding, short RNA, etc.) or identifiers from many biological databases. To change which columns are to be displayed, open the
Display Options (in the 'additional actions' part of the page, cp.
this figure and
this page).
Importantly, if you change the the columns displayed in one of your own, unfinalised datasets, the change will be stored as part of the dataset and the next time you open this page GeneProf will remember your settings. If, on the other hand, you change the display options of a public or finalised dataset (cp.
SubConcept 'Finalization and Publication'), the changes will be lost when you leave the page.
Further actions might be available depending on the type of dataset you are looking at and on whether its your own, editable data or a public or finalised dataset. Apart from the display options, you can always decide to export the table data in a variety of formats (cp.
this page) and you may create plots based on the data in the table (cp.
this page). For editable datasets, you may also have the option to apply a permanent filter to the data (cp.
this page) or to calculate additional columns (cp.
this page). N.B. the latter two options will NEVER change the dataset itself, but rather create an additional step that will be added to the workflow and produce a new dataset with the chosen modifications.
Genome Browser
GeneProf features a simple, integrated genome browser built on
GenomeGraphs [
3]. The genome browser is meant to provide you with a tool to quickly examine and compare genomic alignments from your own and public sequencing experiments and does by no means constitute a full-blown replacement for such great tools as the
Ensembl [
14,
7],
UCSC [
12,
8],
IGV [
16] or other genome browsers [
6,
11,
10]. Note that GeneProf supports batch exports for browser tracks which should be compatible with the aforementioned tools (cp.
this page).
So, that being said, let's have a look at the browser page. It consists of four parts:
- Navigation, preferences and track selection:
To change the region displayed, select the right chromosome from the drop-down menu and enter the start- and end-coordinate. Alternatively, you can enter the name of a gene in the last text box and select the correct entry from a list of suggestions. GeneProf will then automatically fill in the chromosome, start and end coordinate for you. Click the Submit Query button to update the plot displayed below.
If you'd like to customize the data included in the browser, open the display options and track selection dialog by clicking
Choose Tracks & Other Display Options (cp.
this page. You can change the image width and height per track to suit your screen resolution. Further options:
- Hide transcript variants?:By default, GeneProf will display features located in the displayed region, but not all their transcript variants / splice forms. If you untick this box, all transcript variants will be shown separately, each on a separate line.
- Separate strands?:Tick this box to split read coverage by the strand the genomic regions in each respective dataset are located on. This applies to WIG-like tracks only. By default, strand-information will be ignored and the coverage will be calculated as if all regions were located on the same strand.
- Normalise intensities?:
Change this option to rescale counts to 'reads per million': GeneProf will divide each coverage value x by the total number of regions in the dataset and multiply the result by 1,000,000. This applies to WIG-like tracks only. This is particularly useful if you want to compare tracks reporting alignments from transcript sequencing datasets in order to avoid being confused by vastly different sizes of input libraries. The counts can either be normalised by dividing by the total number of distinct alignments, by the total number of aligned reads or by using a cutom scalar, that you may enter for each track individually.
- Scale y-axis?:Change this option to alter the scale for all WIG-style (see below) tracks shown. This might make for more meaningful comparisons. You can either scale all y-axes so that they use the same minimum/maximum (automatically determined across all tracks) or set a custom minimum/maximum yourself.
Besides these general display options, you can choose which tracks to display in the browser (
genomic region datasets
, cp.
SubConcept 'Genomic Region Data'). Available tracks include an assortment of all public data available in GeneProf. If you open the genome browser from an experiment, this selection will also include all genomic region datasets (cp.
SubConcept 'Genomic Region Data'). This is a great way to quickly compare your own data to those of others!
To select a track, browse through the table at the bottom of the page (you can search, sort and filter it in the same way as other GeneProf table, cp. for example,
this page) and click the
Add next to the track (add the tracks in the desired order). You can afterwards change the way the track is to be displayed in the browser by changing its label or colour to anything you like or by selecting a different track type. For public tracks, a sensible track type should be selected by default, but feel free to change it anyway! For tracks from your own experiments, GeneProf will try to guess a sensible type. The available track types are:
- BED-like tracks:
BED-like tracks will display each individual region in the dataset as a box spanning from it's start to end coordinate.
- DistinctBED:Shows only distinct genomic regions, i.e. if the same region (e.g. alignment coordinates) appear more than once in the dataset only one of them will be shown.
- AlLBED:Show all genomic regions in the dataset. WARNING: For datasets of large short read alignments, there might be many, many such regions and the plots can easily appear very crowded.
WIG-like tracks:
WIG-like plots can be used to visualize the coverage of regions across the genome (which is great for RNA-seq and ChIP-seq reads). GeneProf will sum up the number of regions (e.g. aligned reads) covering each individual base position and display one point in the browser plot at a height corresponding to the coverage count.
- DistinctWIG:Consider only distinct genomic regions, i.e. if the same region (e.g. alignment coordinates) appear more than once in the dataset, they will only contribute once to the coverage count. This is the recommended display method for ChIP-seq alignments.
- AllWIG:Consider all genomic regions. This is the recommended display method for RNA-seq and Tag-seq alignments.
Named regions:
Use this plot types for limited-size (say less than half a million) region-of-interest datasets, such as ChIP-seq binding peaks. These datasets ought to contain one or more columns with boolean flags, that indicate for each region whether it belongs to a certain group. A record will be shown in the genome browser for each such region where a flag is TRUE and the name of the column will be used as a label. Use the Choose labels.. link to choose the columns to be used for labels.
There's one additional option for WIG-like plots: You can choose to extend each region prior to calculation of coverage counts to a fixed length. This is often done for ChIP-seq alignments, where it makes sense to extend regions to the average length of sequenced DNA fragments (or half the fragment length). If you don't want to extend regions, enter '-1' into the respective text box or leave it blank, GeneProf will then use the regions' actual length.
- Browser plot:
This section displays the region of the genome currently selected. You can zoom in / out and skip to the right / left using the buttons on the top right.
Note that, by default, the plot only includes one track showing the features of the
reference dataset
(cp.
SubConcept 'Reference Data') in the region at hand. You can add many additional datasets using the track selector (see above).
- Plot export / persistance:
You can export the current browser plot in a variety of high-quality formats for inclusion in presentations, publications, grant proposals or other documents. Just select the desired format and size and click the Export! button. The exported plot will load in a new window and can subsequently be saved in whatever way you usually save images in your browser of choice (on most systems by right-clicking the image and selecting 'Save image as..' or a similar option).
It's also possible to save the current genome browser session permanently so you can quickly return to it at a later point in time. All you need to do to persist the current session is to enter a short label into the relevant text box and click
Save!. Note that this will overwrite any other saved sessions with the same name! You can later on restore any saved sessions from the genome browser start page (cp.
this page).
- Genome browser session navigation:
If there are multiple genome browser sessions active at the moment, the bottom of the page will display a small thumbnail of each session. You can click any one of them to quickly jump between sessions.
Visual Data Explorer
The Visual Data Explorer (VDE; cp. also this tutorial:
Tutorial 'Examining Public Next-Gen Data..') has been designed to make it easy (or at least easier) to discover patterns across many of these datasets in an intuitive, visual way.
The idea here is that you choose a collection of datasets of the same type and from the same organism and group those by an annotation category, i.e. a common property that they share, and then subsequentlt visualise the values in the dataset with one of the available types of plots.
So, to get started, select a
reference dataset
(cp.
SubConcept 'Reference Data'). The VDE lists all public references, which have public datasets associated to them.
Next select a data type. There are five types to choose from:
- Has Binding Site? (Yes / No):A binary values for each gene indicating whether this gene has a nearby binding site for the DNA-binding proteins in the selected datasets (0 = no, 1 = yes). The definition of what is nearby is given by the creator of each individual dataset, but is usually given as: Up to 20kb upstream of the transcription start site (TSS) or at most 1kb downstream of it. The binary gene assignments should be taken with care: An ennrichment of binding does not necessarily imply regulation (nor does a lack of such binding in this limited window say that their is no regulation happening). The assignments are furthermore subject to thresholding and differences between libraries and might, at times, vary drastically between different studies. The reasons for this are (yet) not entirely clear and deserve further investigation.
- TFAS:The TFAS (transcription factor association strength; cp. module 'Calculate TFAS' and [9]) calculated for each feature in the reference dataset with respect to the factors (DNA-binding protein) in the selected datasets on the basis of the enrichment in binding towards a background signal (fold-change) in putative binding sites on the same chromosome and given as a function of distance to the TSS. For example, a gene will have a high TFAS if there are many strong binding sites close to its TSS and a low score if there is only one distance binding site.
- Expression (raw) / Expression (RPM) / Expression (RPKM):Estimates of gene expression intensity calculated from RNA-seq and Tag-seq libraries by summing up read counts aligning to each gene's locus. The numbers are either un-normalized read counts ([raw]), normalized to account for differences in library size (reads per million; [RPM]) or normalized to account for differences in library size and for different transcript lengths (reads per kilobase million, cp. [1]; [RPKM]).
Depending on which organism you have chosen, not all of these types might be available -- that is, because GeneProf does not (yet) have enough public data for all supported organisms. Mind, that this might change in the future as we add more data to the system, so it might be worth checking again later on if you are missing relevant data for your organism of choice.
Proceed by selecting all datasets you'd like to take into account. Click the
Select Dataset(s).. button, which opens popup dialog listing all dataset of the chosen reference and type (see
this page for help). Pick as many as you like, but bare in mind that plots might become overcrowded and patterns might be disturbed by comparing completely unrelated samples.
Now select a criterion by which to group the data (Group data by..). Note, that not all samples with have sensible annotation values for each given category. We strive to complete the annoation as well as possible, yet not always all information is available and / or applicable for each given sample. If no annotation is available the given samples will appear in the plots as Unknown or N/A. You don't have to group the data, but it remains questionable whether the VDE can tell you much without any grouping.
Lastly, decide which type of plot you'd like to create. GeneProf's VDE currently support three types:
- Histograms:Histograms visualize the distribution of values per category. The x-axis gives the value divided into a number of bins and the x-axis specifies how often a value in the given bin has been observed in the datasets at hand. For more details about histograms please refer to this Wikipedia article. Example use case: Comparing how strongly different transcription factors associate with how many different genes (data type: TFAS, group by: Gene or Antibody).
- Correlation Matrix:A common way to compare one or more samples, conditions or, in fact, any list of numbers is to calculate the (Spearman) correlation between them (check this Wikipedia article for a good explanation). If one calculates this correlation for more than two samples, one might lay out the correlation coefficients calculated for each pair-wise comparison in a two-dimensional table, which is called the correlation matrix.. The VDE can calculate the correlation matrix for your selected data and visualise the results in a clustered heatmap. This is a brilliant way to visualise the similarity between different datasets. Example use case: Comparing DNA-binding patterns of different proteins (data type: Has Binding Site? (Yes / No), group by: Gene or Antibody).
- Principal Component Analysis:Another way of discovering similarity patterns between samples is to use principal component analysis (PCA). Now, this is mathematically a bit complicated, but the idea is to reduce the dimension of the data by extracting the most explanatory values from it (the principal components = PCs). Each dataset contributes to a different degree to the individual PCs and we can visualise these contributions in a two-dimensional plot, i.e. plotting one PC against one other PC. GeneProf only plots the first three PCs (PC1 vs. PC2, PC2 vs. PC3 and PC1 vs. PC3), which are those that give most information about the data. Similar datasets should cluster together in these plots. However, the 3 PCs might not manage the represent the data sufficiently. The VDE therefore includes a fourth plot (top left) which tell you approximately how much of the data is summarized by the PCs. Look at the third point on the red line to see an estimate of how much (percent) of the data is summarized by the first three PCs taken together. For more information about PCA have a look at this Wikipedia article.Example use case: Comparing expression in different cell types (data type: Expression (RPM), group by: Cell Type).
Clicking the Update Plot button will refresh the VDE plot (this may sometimes take up to a few minutes, but is usually rather quick).
You can save the plot in a variety of high-quality formats using the Export High Quality Images form at the bottom right of the page.
Gene Search
Looking for information about a particular gene or other feature? Just enter the name of the gene you are looking for into the text box (cp.
this figure) and hit your
ENTER-key or click the
Search button. The page will display a list of all features matching your query, grouped by reference dataset. You can continue on to the details page (cp.
this page) for each individual gene by clicking its accession number.
You can search for gene names or stable identifiers in a number of public databases (Ensembl, RefSeq, etc. -- exactly which identifiers are supported depends on the
reference set
, cp.
SubConcept 'Reference Data'). By default, GeneProf will search through the textual information available for each gene and try to match it to your query, i.e. searching for 'nanog' will match (amongst others) both, the
NANOG and the
NANOGP1 genes in human). You can refine your search results by ticking the
Exact matches only-box, which will cause searches to match the entire string. In the previous example, only the
NANOG gene will be returned. Note that searches are case-insensitive, so 'nanog' and 'nANoG' will both return the same result. You can further refine the search by selecting the organism (or rather the reference dataset) from the drop-down box next to the text field.
Record Details
This page provides information about a particular record (a gene, sequence or other data point of any sort) in GeneProf. Depending on the type of data displayed, the contents of the page will vary slightly. Compare
this figure and
this figure for examples of how a gene summary page looks like. We'll first summarize what the page looks like for other kinds of data and then come back to the gene-centric summary example (which is, by far, the most extensive summary page).
For
sequences
(cp.
SubConcept 'Sequence Data', e.g. chromosome or transcript sequences or next-gen sequencing short reads), the page will show the sequence's total length (number of nucleotides), the general nucleotide distribution across the sequence (as a pie chart) as well as the (up to 480 first bases) of the sequence itself and any corresponding quality values (if any;
SubConcept 'Quality Scores and Formats'). You'll also have the option to download the sequence in FASTA or FASTQ format.
For
genomic regions
(cp.
SubConcept 'Genomic Region Data', e.g. aligned reads, transcription factor binding sites, genomic bins, etc.) the page simply displays the genomic location (chromosome, start, end, ..), a plot of this region in its genomic neighborhood (including other regions from the same dataset, if any) as well as a table of all annotations available for this region (e.g. identifiers of aligned reads, enrichment values from peak finding algorithms, etc.).
Finally, we return to look at gene-centric summary pages (i.e. those for
features
, cp.
SubConcept 'Feature Data'). These pages might contain a lot more information than those for other types of data. For a start, the page contains all information available for this gene in the
reference dataset
it belongs to (cp.
SubConcept 'Reference Data'), such as its name, identifiers in a variety of databases and its genomic location. By the way, most of this information has been gathered from
Ensembl [
14,
7]. If there's any additional data about this gene in the dataset you're viewing it from (e.g. RNA-seq expression data), it will be displayed alongside the generic information.
Interestingly, below the general information, you'll find an additional section labeled Public GeneProf Data for this Feature. Here, we summarize the public data that is available in GeneProf for this particular feature. Firstly, if any of the public datasets in GeneProf report expression values (e.g. from an RNA-seq experiment) for this feature, there will be a plot detailing the feature's expression across all these datasets. It's a bar chart including confidence intervals, as you might be familiar with. The plot is summarizing expression values by a group of sample annotation values, by default by the cell line the expression samples have been derived from. You can easily change the grouping criterion by clicking any one of the links below the image (it might take a moment to reload, so please be patient).
Secondly, you will furthermore find a subsection concerned with
DNA-binding activity involving this feature -- either actively or passively. Let me explain.. If the feature is a transcription factor (DNA-binding protein), GeneProf might contain ChIP-seq data for this feature. If so, there will be a summary of what's known about this feature's DNA binding activity in form of two tables: The first table gives an overview of all public datasets reporting genes potentially regulated by this factor and the total number of feature reported bound in this dataset (importantly, this is the number of features (genes) with a binding site for the transcription factor somewhere in the proximity of their transcription start site (TSS, typically at most 20kb upstream and 1kb downstream of the TSS). This is different from the total number of binding sites! One binding site might be in the proximity of more than one feature (i.e. count twice, thrice, ..) or more than one binding site might count only once if they are all near the same feature). You can click the ID of any dataset to continue on to the details page for this dataset and investigate the binding in more detail. Additionally, there will be a list of the 25 features most strongly regulated by this factor. The estimated strength of regulation is calculated as a TFAS score on the basis of binding site enrichments (cp. TFAS module:
module 'Calculate TFAS' and [
9]). If more than one dataset with TFAS scores is available, we'll take an average of all scores to decide the order.
Lastly, you might be interested in transcription factors potentially regulating this feature's expression. If there's any data available, there will be a list of factors which have one or more binding sites in the features proximity along with the tissues, cell types, etc. in which this binding has been observed and the datasets that report these binding sites. Again, you can click the dataset identifiers to continue on to more details.
Processing Queue
The processing queue (cp.
this figure) provides information about the workload currently handled by our servers. Due to their computational demands (next-gen data is big and therefore costly to process) all analysis pipelines are executed on a network of number-crunching machines. Any one of these machines is waiting for new experiments to be submitted to the queue and then these will be picked up sequentially by one of the processors.
This page displays a list of all experiments that are currently in the queue along with their current status (pending = the experiment has not yet been picked up and is awaiting its turn in the queue, executing = the experiment is currently being processed by one of the servers).
For your own experiments, you'll also get an estimate of how much (longer) the execution of the experiment is expected to take. Beware, this is merely a rough estimate based on previous execution times of similar processes! The estimate might in cases be way off and should be treated only as a guideline.
Please note that information about experiments owned by other users will be partially obscured so to protect their (and their data's) privacy. For your own experiments, you can jump directly to the experiments main page, where you can check the current progress via the workflow designer (see
this page) and experiment history (see
this page).
Issue Tracking Pages
Issue Tracker
The page provides an overview of all outstanding issues reported in GeneProf's database. The displayed table can be filtered and sorted dynamically, please refer to the documentation of the experiment browser for instructions about how to (cp.
this page)
Click the ID of any issue to view its report in detail (cp.
this page).
Interestingly enough, the same page also allows you to submit new reports yourself. So, if you've noticed something going wrong or if you've come up with a great idea, this is the place to be!
To submit a new report, please first check through the list of open issues to make sure nobody else already pointed out the same thing (if so, you may wish to comment on the existing issue report, cp.
this page). If there's no report in the system yet, first select whether the submitted issue is a
bug report (a report about a fault you've found in the software), a
feature request (a suggestion for a new component or functionality that would improve GeneProf) or
other (for any other sort of correspondence).
You can make submitted reports private, so other users (apart from the administrators, of course) won't be able to see them, if you're concerned about your privacy. We'd like to encourage you to keep reports public whereever possible, though, to avoid replication of the same issues.
Enter a short, descriptive title (ideally no longer than 30 symbols) and a verbose, detailed description of the issue. Please, also read the submission guidelines (
Section 'Reporting Bugs and Feature Req..') for more information about what should be included in the report.
By default, you'll be notified of all updates to this issue by email. If you do not wish to be notified, un-tick the textbox corresponding checkbox.
If you are reporting a bug, it might be advantegeous if administrators may access your log files and experiment histories. We realize that these might contain confidential details about your research and, albeit we ensure you that we will indeed treat any information confidential, if you would not like us to look into relevant reports, just un-tick the corresponding checkbox. If the report is not concerned with a bug report, this checkbox will be ignored.
Lastly, you need to confirm that you have read the submission guidelines (
Section 'Reporting Bugs and Feature Req..'). Please do indeed read those to make sure you are aware of all information you are meant to provide in order to make it possible for us to deal with your report in a quick and efficient manner.
Click the Submit Report button to add your new report to the issue database.
Issue Report
This page shows the details of an issue report. An issue may either be a bug (some fault in the software), a feature request (a suggestion about how to improve GeneProf) or any other sort of correspondence.
The report page contains details about who submitted the report, when it was submitted and the current status of the report: New = a newly submitted report, no administrator has yet responded to it. Feedback = An administrator has reponded to this issue, however, it's not yet being dealt with. Assigned = We are currently working on this issue. Closed = The issue has been closed un-resolved, either because we were unable to reproduce it or because it has become irrelevant for some other reason. Resolved = The issue has successfully been resolved.
Most importantly, the page contains all correspondence about this issue: Starting with the initial report, any responses from administrators will be listed. Also, any registered GeneProf user may comment on a report (for example, to add their own experiences about a bug or to support a feature request) and such comments will also be listed in the report page.
If you'd like to comment on a report yourself, you can do so by entering your comment into the text box at the bottom of the page. Please read the submission guidelines (see
Section 'Reporting Bugs and Feature Req..') and confirm that you have read them, then submit your comment by clicking on the
Submit Comment button.
Pop-Up Windows and Dialogs
Dataset Display Options
Use this dialog to choose which columns you'd like to show in the dataset browser (cp.
this page). It's easy: There are two boxes listing all columns which are available for this dataset. The box on the right contains currently selected columns and the one on the left contains those that are currently not selected. You can change your selection by double-clicking a column label in either box (which will switch that column over to the other box, i.e. either include it in the list of displayed columns or remove it) or you can select multiple column labels at once and use the
Add and
Remove buttons.
You can change the order in which columns will be displayed by moving the selected columns (in the box on the right) up or down. The column at the top of the list will be displayed as the first column in the browser (left-most). Select one or more columns in the list and use the Up and Down buttons to change the order.
Finally, you can change the way in which the table is to be sorted (by default, when the page is loaded). Just pick the column to sort by from the dropdown list and choose a sort order.
If the dataset you change the display options for is owned by you (i.e. it's in one of your own experiments), the changes will be saved in the dataset and the next time you open the dataset, the previously selected column will be shown. Otherwise, the display options will only apply to the current page.
Experiment History
This dialog provides an overview of changes to the experiment at hand and its data analysis workflow. It also contains a record of all programs that have been executed in the course of the analysis of this experiment and of their outputs.
The overview table is sorted chronologically. Everytime the experiment is being executed, a snapshot of the workflow will be taken and stored in the history. You can check in which way the workfow at this time point is different from the latest version (i.e. the one currently in use, which you will see in the experiment's workflow designer, cp.
this page) if you click the small workflow icon (

) next to one of the records.
Outputs of analysis modules (cp.
SubConcept 'Workflow Modules') that have been executed in the past can be examined by clicking the history icon (

) next to one record. This might be particularly helpful, if you'd like to find out what went wrong, if the execution of one component failed.
Important: The histories of some earlier, public experiments (experiments with an accession number < 100) might be incomplete. That is because the histories, recorded when GeneProf was still under development, contained some information which presented a security risk and we thus had to remove it. We apologize for any inconvenience.
Genome Browser Track Selector and Display Options
Use this dialog to configure general display options and select additional genomic data tracks for the genome browser (cp.
this page).
Firstly, you can change the image width and height per track to suit your screen resolution. Further options:
- Hide transcript variants?:By default, GeneProf will display features located in the displayed region, but not all their transcript variants / splice forms. If you untick this box, all transcript variants will be shown separately, each on a separate line.
- Use same scale on y-axis?:Tick this box to use the same scale for all WIG-style (see below) tracks shown. This might make for more meaningful comparisons.
- Separate strands?:Tick this box to split read coverage by the strand the genomic regions in each respective dataset are located on. This applies to WIG-like tracks only. By default, strand-information will be ignored and the coverage will be calculated as if all regions were located on the same strand.
- Normalise heights to PER-MILLION counts?:
Tick this box to normalize coverage counts to 'reads per million': GeneProf will divide each coverage value x by the total number of regions in the dataset and multiply the result by 1,000,000. This applies to WIG-like tracks only. This is particularly useful if you want to compare tracks reporting alignments from transcript sequencing datasets in order to avoid being confused by vastly different sizes of input libraries.
Besides these general display options, you can choose which tracks to display in the browser (
genomic region datasets
, cp.
SubConcept 'Genomic Region Data'). Available tracks include an assortment of all public data available in GeneProf. If you open the genome browser from an experiment, this selection will also include all genomic region datasets (cp.
SubConcept 'Genomic Region Data'). This is a great way to quickly compare your own data to those of others!
To select a track, browse through the table at the bottom of the page (you can search, sort and filter it in the same way as other GeneProf table, cp. for example,
this page) and click the
Add next to the track (add the tracks in the desired order). You can afterwards change the way the track is to be displayed in the browser by changing its label or colour to anything you like or by selecting a different track type. For public tracks, a sensible track type should be selected by default, but feel free to change it anyway! For tracks from your own experiments, GeneProf will try to guess a sensible type. The available track types are:
There's one additional option for WIG-like plots: You can choose to extend each region prior to calculation of coverage counts to a fixed length. This is often done for ChIP-seq alignments, where it makes sense to extend regions to the average length of sequenced DNA fragments (or half the fragment length). If you don't want to extend regions, enter '-1' into the respective text box or leave it blank, GeneProf will then use the regions' actual length.
Data Export
You can export all GeneProf-created datasets in a variety of popular formats. Which formats exactly are available, might depend on the type of data at hand -- for example, it wouldn't make sense to export gene expression data in a format specific for nucleotide sequences, would it? Check the list below for details about the different export formats.
A note of warning here: Exporting large datasets can be quite time-consuming. When you click any of the download links in this dialog, it might seem as if nothing was happening at all: Not so! Please be patient and refrain from repeatedly clicking the same link. Just wait a bit and a file dialog (asking you where to save the file) should appear in due time.
Supported export file formats:
- Plain Text Files:Any kind of data can be exported in plain text files. The data from the selected columns of the dataset will be written to a test file with each column separated by a certain symbol (by default a TAB, which gives the text file the appearance of a tabular layout). Text files are often the format of choice for computational biologist and computer programmers, since they are easily parsed into other computer programs.
- Spreadsheets:GeneProf can also write datasets (of any kind) in form of spreadsheets (which you can open in Microsoft Excel, OpenOffice Calc or similar office applications). However, spreadsheets are limited to datasets with at most 65,535 rows!
- XML:Any kind of data can be exported as XML. XML is a format for storing data in a computer-processable manner. These files will, however, be difficult to read for humans and are only useful for software that has been specifically developed to deal with them. Please refer to this Wikipedia article for more information.
- R-Data:Any kind of data may furthermore be exported as a RData object. These files can be loaded directly into R (R project) for further processing.
- BED:For genomic region datasets
only. A commonly used format for genome browser tracks. The UCSC Genome Browser details this format in its FAQ.. If you want to export browser tracks for more than one dataset, you can use the batch exporter tool (cp. this page) which can be accessed from the experiment main page (this page).
- WIG:For genomic region datasets
only. A commonly used format for genome browser tracks. The UCSC Genome Browser details this format in its FAQ.. If you want to export browser tracks for more than one dataset, you can use the batch exporter tool (cp. this page) which can be accessed from the experiment main page (this page).
- One-Coordinate File:For genomic region datasets
only. A number of software tools dealing with genomic data expect files in a format specifying only one coordinate per region (in addition to the chromosome). We've added this export format, so you can use GeneProf in conjunction with such tools for your data analysis.
- FASTA:For sequence datasets
only. FASTA is a commonly used, plain-text file format for nucleotide (or amino acid) sequence information. Each sequence consists of a header line (introduced by the symbol >) with information such as the sequence name and one or more lines containing the actual sequence. More information on Wikipedia.
- FASTQ:For sequence datasets
only. FASTQ is a plain-text file format for nucleotide (or amino acid) sequence information with additional per-base quality information. Each sequence consists of a header line (introduced by the symbol @) with information such as the sequence name, followed by a line containing the actual sequence, then another (redundant) header line and a line containing he per-base quality values encoded as ASCII characters. FASTQ files exported by GeneProf used Sanger-style quality scores (cp. SubConcept 'Quality Scores and Formats'). More information on Wikipedia.
- Raw Sequences:For sequence datasets
only. A simple format for sequence data, containing one nucleotide sequence per line.
Dataset Details
This dialog displays the same information as the dataset overview page. Please refer to the documentation of that page for more information:
this page.
The dialog might sometimes be a little small to display the dataset appropriately. In this case we recommend open the dataset in a new window using the New Window button at the bottom right of the dialog.
Plotting
GeneProf features a range of tools for dynamically creating publication-quality plots for your data.
Use this dialog to select an appropriate plot type.
Plotting: Boxplots
Use this dialog to create boxplots (also known as box-and-whisker plots) for your data (cp.
this figure). Boxplots summarize five important statistics about the distribution of values in a group of numbers. You can use them with several sets of values to quickly (yet meaningfully) compare several data columns, e.g. the distribution of expression values (read counts & co.) in various samples.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Co-Occurrence Heatmap
Use this dialog to create what we call a 'Co-Occurrence Heatmap' for your data (cp.
this figure). The plot takes a number of columns as input and interprets their contents as boolean values (yes / no, true / false, all number greater than 0 will be interpreted as true, all smaller or equal to 0 as false) and calculates the frequencies of
TRUE occurring together for each pair-wise comparison of columns. The frequencies are then depicted in a square heatmap with more intense colours representing a higher frequency of co-occurrence. In a way, these plots can serve as a replacement for Venn-diagrams if you have more than three datasets (cp.
this page).
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all boolean columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Correlation Heatmap
Use this dialog to create what we call a 'Correlation Heatmap' for your data (cp.
this figure). That's a visualization of a correlation matrix between different numeric columns (
read up about correlation on Wikipedia; GeneProf uses Spearman's correlation coefficients).
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats), define the resolution and size.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Heatmaps
Use this dialog to create heatmaps for your data (cp.
this figure). Heatmaps have become a popular tool in biology for the visualization of numeric data, in particular gene expression data. Heatmaps, when combined with hierarchical clustering, can help researchers to quickly spot patterns in the data hand.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
If you wish, you can hide the colour legend displayed in the upper left of the heatmap and you can disable the dendrograms (and henceforth the hiearchical clustering of the samples = columns and genes = rows).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Histograms
Use this dialog to create histograms for your data (cp.
this figure). Histograms summarize the distribution of values in a single datasets, i.e. they divide the entire range of values into 'bins' and count the number of times values falling into each of these bins are observed. GeneProf can create a plot combining several histograms into one large panel plot.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: PCA Plots
Use this dialog to create a 'PCA plot' your data (cp.
this figure). PCA stands for 'principal component analysis' and, put simply, it's a way to mathematically decompose large datasets into a set of representative numeric values, calld 'principal components' (abbreviated: PCs;
read up about PCA on Wikipedia). Practically, PCA can be a neat way to discover non-trivial relationships between datasets: If the data can be accurately represented by its PCs, similar datasets should cluster together in the plots produced. In some cases, one may even attempt to assign a biological meaning to the PCs serving as a separator between two classes of datasets -- but you should be careful about this as it's not generally advisable to put too much meaning into the PCs.
GeneProf's PCA plots are actually a panel of four sub-plots: The first plot on the top-right tells you how well the PCA worked for the data at hand. Each PC can be said to explain a certain percentage of the variation in the data. For the subsequent plots we use the first 3 PCs, so those should sensibly explain a good amount of this variation. The plot shows two lines: The black one gives the percentage of the variation explained by each individual PC and the red one the cumulative percentage explained by the 0 to n-th PC, i.e. if we look at the red line at position x=3 they value on the y-axis tells us which percentage of the variation is explained by the first three PCs taken together. The other three plots show the contribution of the individual datasets to the first three PCs. The first plot compares PC1 with PC2, the second PC2 with PC3 and the last PC1 with PC3.
To get started, select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the format (choose from a list of popular formats), define the resolution and size.
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. For the PCA, data is, by default, centered and scaled (to make it better comparable), you can deactivate either of these transformations by ticking the respective checkbox.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Pie Charts
Use this dialog to create pie charts for your data (cp.
this figure). Pie charts are popular for summarizing categorical data and how often each category is observed as a percentage of the whole data.
In GeneProf, a pie chart will be created by looking at all values of a column, finding all distinct value (this only makes sense if there's a rather limited range of values) and counting how many rows have each distinct value.
Sensible applications might to create a pie chart of the feature type (protein coding, miRNA, etc.) of all genes in a feature dataset, where the feature dataset is a selection of features of interest, e.g. differentially expressed genes or targets of a transcription factor.
To create a pie chart, first select the column with the values that should be plotted (for feature datasets, we'll select 'feature type' by default), then customize the appearance of the pie chart, by choosing its colour scheme, resolution, size and image format.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Scatter Plots
Use this dialog to create scatter plots for your data (cp.
this figure). Scatter plots are fantastic tools for comparing different sets of numeric values. You'll immediately get a feel for how different datasets compare. GeneProf can plot a panel of multiple scatter plots to compare several datasets in one plot. Rather than plotting the same scatter plot twice (for the scatter plot of A vs. B is the mirror image of B vs. A), the lower left triangle of the panel plot will give the Pearson correlation efficients for the respective datasets.
Select all data columns you'd like to include in the plot, by selecting them in the box on the right and clicking the Include selected columns in plot! button. By default, GeneProf will select all numeric columns which are displayed in the dataset's detail table. You can remove selections by clicking the Remove button and, if you wish, you may change the labels used in the plot by altering the values in the respective text boxes.
Below, you'll find several more options for customizing the output plot: You can change its appearance, by changing the plot title, colour and format (choose from a list of popular formats, define the resolution and size).
You might also like to apply a transformation to your data before you create the plot: Often converting raw numbers to a logarithmic scale will improve the visual appeal as well as expressiveness of a plot. You can also ignore data rows which have values of 0 for all selected columns or you may try to scale the data in such a way that all columns add up to the same value (1,000,000), although that's generally not a good idea and you should only do this in special cases (e.g. for raw gene expression read counts between different sequencing libraries).
GeneProf plots up to 250,000 points, but you can reduce the number of points (a random subset will be taken) for visual clarity. Another way that might help to improve the plot is the use of a smooth scatter plot: Rather than plotting each individual point, smooth scatter plots show something similar to a cloud in the plot where areas with a higher density of plots are darker (or have a more intense colour) than those areas with a low density of points. Which plot is suitable for you depends entirely on the nature of your data.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Plotting: Venn Diagrams
Use this dialog to create a Venn/Euler diagram for your data (cp.
this figure). Venn diagrams visualize overlaps between lists as overlapping circles. GeneProf can plot up to three lists at once and their sizes will be proportional two the size of the datasets.
This tool uses the
Vennerable package by Jonathan Swinton (
Vennerable homepage). Other online tools for plotting Venn diagrams, which we find very useful, are
Venny and
BioVenn.
Right, so to plot a Venn diagram, we need to define the groups that are to be compared. In GeneProf, we define up to three groups based on the columns of the dataset at hand. For each group, select a combination of columns. A row is interpreted as belonging to this particular group, if any of the values in the selected columns evaluates to TRUE, that is, if it's value is TRUE or a number greater than 0. Now, one sensible application of this would be, to define one group each for boolean-value columns you've defined earlier, e.g. saying 'this is differentially expressed in condition A' and 'this is differentially expressed in condition B', then you can easily visualize the overlap of these two lists of differentially expressed genes by defining two groups in the Venn diagram dialog based on these two columns. If you do not select any columns in any one group, this group will be ignored. Thus, you can plot Venn diagrams for only two groups by simply leaving out one.
You can further customize the output plot by changing its resolution, size and image format.
When you click the Plot! button, your customized plot will be rendered in a new browser window or tab. You can then go back to this dialog and change the parameters to achieve the look you prefer.
Calculator
This dialog presents an alternative interface to various methods for manipulating the dataset open in the data table browser page (
this page). You can choose from a range of common tasks often applied to large data tables.
When you finish configuring a task, GeneProf will create a new workflow component for it and add this component to the experiment's workflow. You then need to execute the workflow (i.e. enqueue it in the processing queue on the GeneProf servers) to perform the newly added calculations.
Be careful here: If you are currently editing the experiment's workflow in another browser window or tab, unsaved changes to the workflow will be lost! So, before you proceed, save any open workflows and close the browsers windows and tabs no longer required.
Filters
This dialog presents an alternative interface to various methods for applying a permanent filter to the dataset open in the data table browser page (
this page). Mind, the filters will NEVER change the dataset at hand itself, but rather derive a new, filtered dataset from the data in this dataset.
When you finish configuring the filter, GeneProf will create a new workflow component for it and add this component to the experiment's workflow. You then need to execute the workflow (i.e. enqueue it in the processing queue on the GeneProf servers) to run the newly added filter
Be careful here: If you are currently editing the experiment's workflow in another browser window or tab, unsaved changes to the workflow will be lost! So, before you proceed, save any open workflows and close the browsers windows and tabs no longer required.
Upload Main
This is where any data upload to one of your GeneProf experiments start. Choose your preferred method from the available choice by clicking any one option with your mouse.
Java File Uploader Tool
The uploader tool allows you to upload multiple files at a time to the GeneProf system. Files may be compressed in a variety of popular formats (ZIP, GZIP, BZ2) -- we prefer GZIP.
To start uploading files, click the Add Files button and browse your computer to locate the files. You can add multiple files at once or one at a time -- whatever you prefer.
Keep repeating the procedure until you've selected all files you want to upload, then click the Upload button to start the uploads. GeneProf will launch a program that manages the uploads in the background (there should be a little test tube icon in your task bar). You can monitor the progress of your uploads via this tool or via the GeneProf website.
Once all uploads have been processed, your experiment will become editable again and you can proceed with your data analysis.
We've worked hard to make the uploader tool as accessible as possible, but unfortunately it's impossible to foresee all problems and there might be some operating system / browser conmbinations on which the uploader tool does not work correctly. If you experience trouble, the page provides suggestions how to resolve your issues and links to alternative upload mechanisms. Please don't hesitate to get in touch, if you can't upload your files!
Simple File Uploader Tool
If the Java-based uploader tool (above) for some reason does not work for your, this simple version of the uploader tool might be the solution for you. Files may be compressed in a variety of popular formats (ZIP, GZIP, BZ2) -- we prefer GZIP. If you're file, despite compression, is still larger than the limit, you'll need to upload them to an external web server and use the URL uploader (see
this page) instead. We apologize for any inconvenience and assure you, that we're working on a better solution!
So, using the uploader is really easy: Just click the upload button and select the file(s) you'd like to import into GeneProf. You can click the button repeatedly and select one file at a time or you can select more than one file at once using your CTRL-key (Mac-users: CMD-key). The upload does not start immediately, but files are rather put into a queue. When you've completed your file selection, ckick the Start Upload button. The files will then be uploaded one at a time until all files have been completed. The upload may take a very long time depending on the size of the files, but you can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!
When all uploads are complete, you can close the dialog. The experiment's main page will then be refreshed and you should be able to see the uploaded files in the Input Data section.
N.B. The uploader tool depends on
Flash technology. All modern browsers support
Flash via specific plugins -- which you will probably have installed already. If not, your browser software should ask you whether you want to install it automatically. Alternatively, you can download the latest version of the plugin from
Adobe's download page. If, for any reason, this fail or you do not wish to install
Flash, you can use a basic file uploader by clicking the
simple uploader link. Note, though, that you can only upload one file at a time with the basic uploader.
URL Uploader Tool
If the data you'd like to upload has been deposited on some sort of web server (HTTP or FTP -- that is, something you can access in a web browser by entering a URL), you can upload it easily to GeneProf by providing the respective URLs. In this case, no restrictions apply to the size of the files uploaded.
To start the upload, simply copy & paste all URLs into the big text box, one URL per line, and click the Upload Files button. The upload will start immediately, but the process may take a very long time depending on the size of the files. You can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!
SRA Importer
If you'd like to re-analyse some published data, the SRA importer tool is probably the best way to quickly get hold of the data. Most published next-gen sequencing data nowadays will be deposited in the SRA [
15] and you can import this data directly into GeneProf. Papers usually state the SRA accession number(s) for the raw data, or some include an accession number from the Gene Expression Omnibus (GEO), whose entry in turn refers to the project in the SRA.
If you know the SRA accession number (ideally the SRA project ID, something starting with SRP and followed by 6 digits), just enter it in the search text box. Alternatively, you may also try to search for a word and see what the SRA holds in store. Mind, though, that the number of retrieved results is limited, so your search needs to be rather specific.
After you've entered your search term, all matching entries (along with their identifiers, name and some annotation) will be listed in this dialog. The results are organised by SRA experiments, which usually correspond to one sequenced sample library. One project will consist of more than one of these experiments and each experiment will in turn contain one or more sequencing runs, i.e. the actual short read datasets.
When the search results have been displayed, GeneProf will initially show a
question mark icon (

) next to each sequencing run. GeneProf will go off and check whether the raw data for this run is actually available in a downloadable format (unfortunately, for some experiments it isn't) and dynamically update this image and complete the information about the file size. If the data files could be located successfully, the icon will be changed to a
green tick mark (

), otherwise a
red X (

).
Give GeneProf a moment to update all the file information, then tick the checkboxes next to all datasets you'd like to import and click the Import the Selected Data! button. The upload will happen on one of the compute servers, which means, at this point, your experiment will be entered into the processing queue. You cannot make any further modifications to the experiment until the process is complete, however, you can now close the browser window or navigate away from this page. You will be notified by email as soon as the upload is complete.
Import Data from Another Experiment (1)
GeneProf host an ever-increasing load of ready-analysed next-gen data and a lot of this data is public. You can import any dataset from a public experiment or from your own private, but finalised experiments (cp.
SubConcept 'Finalization and Publication'). This can be useful if you'd like to try a different analysis method on some public data, if you'd like to perform a meta-analysis of data from different studies or if you'd simply like to compare different datasets in the scope of one experiment.
It should also be noted, that data imported from other experiments will not be replicated in the database, which means -- in practical terms and in your favour -- it will not count towards your disk space allowance!
To import data, first find the experiment you want to import data from by browsing through the list of available experiments. The list can be filtered dynamically and is divided into pages (cp.
this page).
Once you located the experiment of interest, click the Import button. You'll be forwarded to another dialog.
Import Data from Another Experiment (2)
This dialog displays all datasets contained in the experiment you've chosen previously. The datasets are split up by their data type (cp.
Concept 'Datasets').
Simply tick the checkboxes next to the datasets you'd like to import and click the
Import Selected Dataset(s) button. The import will happen immediately and you'll be brought back to the experiment main page (cp.
this page).
Remove Input Data
We need your confirmation before an input dataset is to be removed from an experiment. The removal step is irreversible, so the only way to recover a removed dataset is to upload it to GeneProf again.
Also, consider which implications removing a dataset might have on the experiment's workflow. If you've previously run an analysis and other components of the workflow depend on the input data to be removed, those parts of the workflow (and all intermediate results produced) will be reset and lost.
If you're not sure, check the workflow (via the workflow designer:
this page) before you proceed. You can also remove input data directly in the workflow designer, in which case the consequences of the removal might be more evident.
Workflow Execution
When you are done designing your data analysis workflow, all modules are connected and all parameters set to your satisfaction, you can submit your experiment for execution on one of GeneProf's number crunching servers.
Mind, the job will be entered into a queue and it might take a while before the experiment is actually picked up and processed. Also, the execution of entire workflows can take hours (sometimes days).
If you'd like, we can notify you by email when processing is complete. In fact, we'll do this by default, so if you do NOT want to be email, un-tick the checkbox prior to submitting the experiment to the queue.
Workflow Interruption
You can interrupt any of your experiments which are currently being executed on the server or pending execution. However, it can, at times, take a few minutes (occassionally even up to an hour), before the experiment becomes editable again, so please be patient.
VDE Data Selector
Use this dialog to select all data to be used in the plots created by the visual data explorer (VDE, cp.
this page).
The dialogs presents a table of all datasets of the previously selected type and you can tick the checkboxes next to the datasets you'd like to use.
The table includes the sample annotation associated to each of the datasets to help you make a decision. There are a number of ways to quickly select all datasets of interest:
You can toggle the selection of all datasets that belong to a certain annotation category (more information about sample annotations here:
SubConcept 'Sample Annotation' and
this page), by selecting the annotation value of interest from one of the drop-down menus below the column labels. For example, if you were to select embryonic stem cell from the menu below the header Cell Type all samples with this cell type would be toggled on or off (depending on whether they were selected before). You can use a combination of these selection methods to quickly select all dataset you're interested in. Just always bare in mind that selection will be toggled, i.e. if you first select all samples with cell type X and then all samples from a technology platform Y, all those samples that have cell type X and platform Y will be de-selected, because they were first toggled on and then off again.
To reset all selections, click the Clear All link at the top left of the table.
Bibliography
[1] Mortazavi, A and Williams, BA and McCue, K and Schaeffer, L and Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq.
Nat. Methods (2008). PMID:
18516045.
[2] Zhang, Y and Liu, T and Meyer, CA and Eeckhoute, J and Johnson, DS and Bernstein, BE and Nusbaum, C and Myers, RM and Brown, M and Li, W and Liu, XS. Model-based analysis of ChIP-Seq (MACS).
Genome Biol. (2008). PMID:
18798982.
[3] Durinck, S and Bullard, J and Spellman, PT and Dudoit, S. GenomeGraphs: integrated genomic data visualization with R.
BMC Bioinformatics (2009). PMID:
19123956.
[4] Langmead, B and Trapnell, C and Pop, M and Salzberg, SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Genome Biol. (2009). PMID:
19261174.
[5] Trapnell, C and Pachter, L and Salzberg, SL. TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics (2009). PMID:
19289445.
[6] Podicheti, R and Gollapudi, R and Dong, Q. WebGBrowse--a web server for GBrowse.
Bioinformatics (2009). PMID:
19357095.
[7] Flicek, P and Aken, BL and Ballester, B and Beal, K and Bragin, E and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Fernandez-Banet, J and Gordon, L and Gräf, S and Haider, S and Hammond, M and Howe, K and Jenkinson, A and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Koscielny, G and Kulesha, E and Lawson, D and Longden, I and Massingham, T and McLaren, W and Megy, K and Overduin, B and Pritchard, B and Rios, D and Ruffier, M and Schuster, M and Slater, G and Smedley, D and Spudich, G and Tang, YA and Trevanion, S and Vilella, A and Vogel, J and White, S and Wilder, SP and Zadissa, A and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Smith, J and Searle, SM. Ensembl's 10th year.
Nucleic Acids Res. (2010). PMID:
19906699.
[8] Rhead, B and Karolchik, D and Kuhn, RM and Hinrichs, AS and Zweig, AS and Fujita, PA and Diekhans, M and Smith, KE and Rosenbloom, KR and Raney, BJ and Pohl, A and Pheasant, M and Meyer, LR and Learned, K and Hsu, F and Hillman-Jackson, J and Harte, RA and Giardine, B and Dreszer, TR and Clawson, H and Barber, GP and Haussler, D and Kent, WJ. The UCSC Genome Browser database: update 2010.
Nucleic Acids Res. (2010). PMID:
19906737.
[9] Ouyang, Z and Zhou, Q and Wong, WH. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells.
Proc. Natl. Acad. Sci. U.S.A. (2009). PMID:
19995984.
[10] Nielsen, CB and Cantor, M and Dubchak, I and Gordon, D and Wang, T. Visualizing genomes: techniques and challenges.
Nat. Methods (2010). PMID:
20195257.
[11] Fiume, M and Williams, V and Brook, A and Brudno, M. Savant: genome browser for high-throughput sequencing data.
Bioinformatics (2010). PMID:
20562449.
[12] Fujita, PA and Rhead, B and Zweig, AS and Hinrichs, AS and Karolchik, D and Cline, MS and Goldman, M and Barber, GP and Clawson, H and Coelho, A and Diekhans, M and Dreszer, TR and Giardine, BM and Harte, RA and Hillman-Jackson, J and Hsu, F and Kirkup, V and Kuhn, RM and Learned, K and Li, CH and Meyer, LR and Pohl, A and Raney, BJ and Rosenbloom, KR and Smith, KE and Haussler, D and Kent, WJ. The UCSC Genome Browser database: update 2011.
Nucleic Acids Res. (2011). PMID:
20959295.
[13] Anders, S and Huber, W. Differential expression analysis for sequence count data.
Genome Biol. (2010). PMID:
20979621.
[14] Flicek, P and Amode, MR and Barrell, D and Beal, K and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Gordon, L and Hendrix, M and Hourlier, T and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Kulesha, E and Larsson, P and Longden, I and McLaren, W and Overduin, B and Pritchard, B and Riat, HS and Rios, D and Ritchie, GR and Ruffier, M and Schuster, M and Sobral, D and Spudich, G and Tang, YA and Trevanion, S and Vandrovcova, J and Vilella, AJ and White, S and Wilder, SP and Zadissa, A and Zamora, J and Aken, BL and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Vogel, J and Searle, SM. Ensembl 2011.
Nucleic Acids Res. (2011). PMID:
21045057.
[15] Leinonen, R and Sugawara, H and Shumway, M. The sequence read archive.
Nucleic Acids Res. (2011). PMID:
21062823.
[16] Robinson, JT and Thorvaldsdóttir, H and Winckler, W and Guttman, M and Lander, ES and Getz, G and Mesirov, JP. Integrative genomics viewer.
Nat. Biotechnol. (2011). PMID:
21221095.