GeneProf Manual


Florian Halbritter, 27-Nov-2012

Table of Contents

Chapter: Introduction

Chapter: Concepts Explained

Chapter: Tutorials

Chapter: Modules

Chapter: Pages

Chapter: Advanced Topics

Chapter: Frequently Asked Questions (FAQ)

Tutorials


How to Create a GeneProf Experiment and Start Analyzing your Data

Getting started analyzing your own data with GeneProf couldn't be easier! All you need to do, is to create a new experiment. Experiments, in GeneProf, combine a collection of input data with a complete analysis workflow into one tightly coupled, meaningful unit (cp. Concept 'Experiments'). You will need to log in to your personal GeneProf user account first (cp. user registration and login pages, PageCategory 'User Pages').
To start a new experiment access the experiment creation page (cp. this page) either through the navigation bar under the topic Public & Personal Data >> Start a New Experiment or from the main menu, where you will find a link in the Things you might want to do.. >> Analyse your own Next-Gen Data section. Alternatively, you can follow this link to the experiment creation page.
Enter a little information about your experiment into the respective text boxes (for detail, see this page). Mind, only the name and a short description are mandatory. You can always return to fill in or edit these information later. Confirm your new experiment by clicking Save Changes.
That's it. Simple, isn't it? Having created your own experiment, you can always return to it via the navigation menu: Public & Personal Data >> My Experiments >> NAME-OF-YOUR-EXPERIMENT.
To get going with an actual data analysis, all you need to do now is, to upload some data to your new experiment (see the next tutorial, Tutorial 'Uploading Data to GeneProf') and create an analysis workflow. It's advisable to use one of the analysis wizards the first time you do this, cp. SubConcept 'Analysis Wizards' and PageCategory 'Wizard Pages'.

Summary:

  1. Go to the experiment creation page.
  2. Describe your experiment and confirm your input.
Where to go from here: Tutorial 'Uploading Data to GeneProf'

Uploading Data to GeneProf

Before you can get started analyzing next-gen data with GeneProf (whether your own or public data), you'll need to include it in your own experiment. There are a number of different ways for you to load data into your GeneProf experiment, which can all be accesses from the experiment main page (see this page).
Upload Data. Get your data into GeneProf by clicking this button. You can either upload files from your computer, import them from an internet database or from another GeneProf experiment.
Just below the general experiment details you'll find a section called Input Data. In a new experiment, this will be empty since there's no data in it yet, of course! Let's see how we can change this.
You'll notice a button in this section labeled Upload (also see this figure). If you click the button, a pop-up dialog will appear and present you with four options on how to proceed: Upload files from your computer or external URLs, import data from an existing experiment or import data from the NCBI's Sequence Read Archive. If you click any of these options, you'll be forwarded to another dialog that allows you to actually import the data. Please refer to the individual tutorials below for more information about each upload procedure.
Method 1: Upload Files from Your Local Computer
If the Java-based uploader tool (above) for some reason does not work for your, this simple version of the uploader tool might be the solution for you. Files may be compressed in a variety of popular formats (ZIP, GZIP, BZ2) -- we prefer GZIP. If you're file, despite compression, is still larger than the limit, you'll need to upload them to an external web server and use the URL uploader (see this page) instead. We apologize for any inconvenience and assure you, that we're working on a better solution!
So, using the uploader is really easy: Just click the upload button and select the file(s) you'd like to import into GeneProf. You can click the button repeatedly and select one file at a time or you can select more than one file at once using your CTRL-key (Mac-users: CMD-key). The upload does not start immediately, but files are rather put into a queue. When you've completed your file selection, ckick the Start Upload button. The files will then be uploaded one at a time until all files have been completed. The upload may take a very long time depending on the size of the files, but you can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!
When all uploads are complete, you can close the dialog. The experiment's main page will then be refreshed and you should be able to see the uploaded files in the Input Data section.
N.B. The uploader tool depends on Flash technology. All modern browsers support Flash via specific plugins -- which you will probably have installed already. If not, your browser software should ask you whether you want to install it automatically. Alternatively, you can download the latest version of the plugin from Adobe's download page. If, for any reason, this fail or you do not wish to install Flash, you can use a basic file uploader by clicking the simple uploader link. Note, though, that you can only upload one file at a time with the basic uploader.

Summary:

  1. Access the file upload utility from your own experiment: Upload >> Upload files from your computer.
  2. Click the 'Add Files' button and select the file(s) you want to upload.
  3. Repeat the previous step, if necessary, to include all files you need.
  4. Click Upload to start the upload and.. wait: This can take several hours. You can, however, leave the page and do something else, the GeneProf uploader tool runs in the background and will continue uploading your files so long as your computer is switched on (there should be a little GeneProf icon in your taskbar to indicate the uploader is active)
Method 2: Upload Files from A Web Resource
If the data you'd like to upload has been deposited on some sort of web server (HTTP or FTP -- that is, something you can access in a web browser by entering a URL), you can upload it easily to GeneProf by providing the respective URLs. In this case, no restrictions apply to the size of the files uploaded.
To start the upload, simply copy & paste all URLs into the big text box, one URL per line, and click the Upload Files button. The upload will start immediately, but the process may take a very long time depending on the size of the files. You can leave the process running in the background and continue working in another browser window. Just don't close the uploader dialog or the browser window containing it, otherwise your upload will be interrupted!

Summary:

  1. Copy & Paste the URLs of all files you'd like to upload into the text area. One URL per line.
  2. Click Upload Files to start the upload.
  3. Do not leave this page or switch off the computer. This may take several hours. Please note: If access to the URL requires some sort of authentication, downloading data directly might not work. In this case, you will have to download the files manually first and then upload them via the interface.
Method 3: Upload Sequences from the Sequence Read Archive
If you'd like to re-analyse some published data, the SRA importer tool is probably the best way to quickly get hold of the data. Most published next-gen sequencing data nowadays will be deposited in the SRA [14] and you can import this data directly into GeneProf. Papers usually state the SRA accession number(s) for the raw data, or some include an accession number from the Gene Expression Omnibus (GEO), whose entry in turn refers to the project in the SRA.
If you know the SRA accession number (ideally the SRA project ID, something starting with SRP and followed by 6 digits), just enter it in the search text box. Alternatively, you may also try to search for a word and see what the SRA holds in store. Mind, though, that the number of retrieved results is limited, so your search needs to be rather specific.
After you've entered your search term, all matching entries (along with their identifiers, name and some annotation) will be listed in this dialog. The results are organised by SRA experiments, which usually correspond to one sequenced sample library. One project will consist of more than one of these experiments and each experiment will in turn contain one or more sequencing runs, i.e. the actual short read datasets.
When the search results have been displayed, GeneProf will initially show a question mark icon (?) next to each sequencing run. GeneProf will go off and check whether the raw data for this run is actually available in a downloadable format (unfortunately, for some experiments it isn't) and dynamically update this image and complete the information about the file size. If the data files could be located successfully, the icon will be changed to a green tick mark (green tick), otherwise a red X (red X).
Give GeneProf a moment to update all the file information, then tick the checkboxes next to all datasets you'd like to import and click the Import the Selected Data! button. The upload will happen on one of the compute servers, which means, at this point, your experiment will be entered into the processing queue. You cannot make any further modifications to the experiment until the process is complete, however, you can now close the browser window or navigate away from this page. You will be notified by email as soon as the upload is complete.

Summary:

  1. Search for SRA data by entering an accession number of search term into the text field.
  2. Tick the boxes next to all datasets you'd like to upload.
  3. Click Import the Selected Data! to start the upload.
  4. You may browse away from the page and even switch off your computer now: The download is executed on our compute cluster. You will be notified by email when it's complete.
Method 4: Import Data from Another Experiment
GeneProf host an ever-increasing load of ready-analysed next-gen data and a lot of this data is public. You can import any dataset from a public experiment or from your own private, but finalised experiments (cp. SubConcept 'Finalization and Publication'). This can be useful if you'd like to try a different analysis method on some public data, if you'd like to perform a meta-analysis of data from different studies or if you'd simply like to compare different datasets in the scope of one experiment.
It should also be noted, that data imported from other experiments will not be replicated in the database, which means -- in practical terms and in your favour -- it will not count towards your disk space allowance!
To import data, first find the experiment you want to import data from by browsing through the list of available experiments. The list can be filtered dynamically and is divided into pages (cp. this page).
Once you located the experiment of interest, click the Import button. You'll be forwarded to another dialog.
This dialog displays all datasets contained in the experiment you've chosen previously. The datasets are split up by their data type (cp. Concept 'Datasets').
Simply tick the checkboxes next to the datasets you'd like to import and click the Import Selected Dataset(s) button. The import will happen immediately and you'll be brought back to the experiment main page (cp. this page).

Summary:

  1. Look for the experiment you want to import data from and click the import button.
  2. Tick the boxes next to all datasets you'd like to import.
  3. Click the Import Selected Dataset(s) button.
  4. The import should happen immediately and you'll be brought back to the experiment main page.
Where to go from here: this page (find out how to annotate the files you've uploaded), Tutorial 'Discovering Transcription Fact..', Tutorial 'Differential Expression Analys..'

Differential Expression Analysis (RNA-seq, Tag-seq)

In this tutorial, we'll be creating a new GeneProf experiment, upload some sample data to it and annotate the these files and then subsequently use the All-in-one RNA-seq Analysis Wizard to create an analysis workflow for this data. We'll also see how one might go about interpreting the outputs produced. Let's get started!

RNA-seq Step 1: Obtain Sample Data, Create an Experiment and Upload the Files

In this tutorial, we will be using some mock sample data simulating a typical RNA-seq experiment for two conditions (A and B) in mouse (also cp. this figure). Please start by downloading the sample data from this link.). The ZIP archive contains four files, namely sequences in FASTQ format (cp. SubConcept 'Quality Scores and Formats') for two biological conditions with two replicates each.
FileFormatSample Group / ConditionNumber of ReadsRead Length
RNAseq-SampleData.zipZIP archive---
-- A1.fqFASTQA25,00025bp
-- A2.fqFASTQA25,00025bp
-- B1.fqFASTQB25,00025bp
-- B2.fqFASTQB25,00025bp
RNA-seq Tutorial Step 1. Get the sample data and upload it to your own new experiment. Then annotate the samples.
You don't need to extract the files from the archive or do anything with it at all: GeneProf will later on automatically detect that it's a ZIP archive and extract the data contained itself.
Right, so let's create a virtual experiment which will use for this excercise (call it anything you like). If you're not sure, how to go about starting a new experiment, have a look at this tutorial first: Tutorial 'How to Create a GeneProf Exper..'.
By the way, a public version of this demo experiment is available under accession number gpXP_000119. Note, however, that the raw data files in this experiment have been replaced directly with the parsed sequence output, as it is always done, when an experiment is finalized and made public (cp. SubConcept 'Finalization and Publication').
After creating the new experiment, upload the sample data you've just obtained (i.e. the whole ZIP, or alternatively the FASTQ files contained). Not sure how to upload data? Please refer to this tutorial: Tutorial 'Uploading Data to GeneProf'.
Finally, we should annotate the data files we've just uploaded. Providing proper sample annotation will make it easier to run the analysis wizards in the next step.
In the experiment main page (cp. this page scroll down until you find the section titled Sample Groups and Experimental Factors, then click the Annotate Samples button.
GeneProf will automatically create a default, blank annotation table for us with four rows and three annotation columns: Label, Organism and Platform (for details about sample annotation types and how to use this particular page, see this page). Having four different biological samples, this layout is indeed what we want, so we won't have to change much. GeneProf already selected an association to one of the input files in each column, so we just need to fill in the blank cells.
Let's label the samples 'A1', 'A2', 'B1' and 'B2', respectively (make sure to fill in the label in the row that has been associated to the respective file!). They all come from mouse, so next click the pencil icon in the Organism column of one of the rows, type 'mouse' into the text box (or 'mus musculus', if you like) and select mouse from the dropdown list, the confirm your selection by clicking Ok. GeneProf will fill in the NCBI taxon ID for M. musculus (which is 10090). Rather than repeating this step for each other row, we can just double-click the value we filled in to copy the value over to all other cells in this column. Next choose the technology platform used to produce the data: Again, click one of the pencil icons, then select Illumina Genome Analyzer from the dropdown list and confirm with Ok. Fill in the entire column with the same value.
Now, we'd like to add one additional annotation. So far, there's no indication whatsoever than A1 and A2 (and B1 and B2) actually belong to the sample biological condition (i.e. they are different sample, but replicates of each other). We'll add a Sample Group column to include these information. Click the Add Column button, and choose Sample Group, then confirm with Ok. Enter 'A' for sample A1 and A2 and 'B' for samples B1 and B2.
That will do for this tutorial. In a real experiment, please try to add as much information as possible to the sample annotation!

Summary:

  1. Download the sample data.
  2. Create a new experiment for this tutorial.
  3. Upload the sample data to the new experiment.
  4. Annotate the data. Make sure to add a Sample Group column.

RNA-seq Step 2: Define an Analysis Workflow and Execute it on the Servers

RNA-seq Tutorial Step 2. Open the RNA-seq wizard, check all parameters and save and execute the workflow.
On the experiment main page, the experimental progress indicator will now show that the first two steps are complete (i.e. Inputs and Samples have been provided), so now it's time to define and run the data analysis. We'll use the All-in-one RNA-seq Analysis Wizard to do this quickly and easily.
GeneProf's analysis wizards (cp. SubConcept 'Analysis Wizards' and this figure) have been designed to make this step as painless and straightforward as possible, so really, it should take us only a few clicks!
Open the All-in-one RNA-seq Analysis Wizard from the experiment main page either by scrolling all the way to the bottom of the page and clicking the big RNA-seq icon or by choosing the wizard from the drop-down list in the section Things you might want to do.. (almost at the top of the page).
The short version of the tutorial at this step would go like this: Click Accept Settings & Create Workflow button at the bottom of the page. We're done here. The default settings of the wizard will run exactly the type of analysis we want. Well, let's have a closer look at the page anyway:
The wizard requires us to answer a few questions about the nature of this experiment. We need to start by selecting the reference dataset to be used. Since we've filled in the organism in the sample annotation, GeneProf has already selected the recommended mouse reference for us and we don't need to do anything.
Since our input data consists of raw data files (GeneProf doesn't yet know they are sequences ), we also need to tell GeneProf how to parse the files into a compatible dataset (cp. SubConcept 'Parsing'). Well, actually, GeneProf will have already guess that the files are in FASTQ format (by looking at the top few lines of each file), so again, we don't need to change anything here. In general, it might be worth checking the GeneProf's guess, though -- sometimes we might make a mistake!
We also need to let GeneProf know which input samples belong to the same group of data, so we'll know how to compare them when we check for differential gene expression. Luckily, we've filled in the Sample Group annotation for our data, so GeneProf has already completed this step for us and created two group (A and B) with the correct input data associated to each group.
The remaining steps of the wizard configuration allow us to customize the created analysis workflow. We're quite happy with the default, which will run a basic quality control step (we can always make this stricter after the first run of the workflow by modifying the parameters in the workflow designer, cp. this page), align the sequences to the mouse reference genome (with Bowtie [6]; if we had longer sequences (ours are only 25bp long) or paired-end reads, we'd choose TopHat [7] here instead) and then calculate gene expression counts and compare these with DESeq [12]. We could, for example, adjust the FDR (false discovery rate) threshold at this stage to make the selection of differentially expressed genes more (lower threshold) or less (higher threshold) strict, but we can easily change this later on, so let's just see what comes out of the analysis first.
Click the big Accept Settings & Create Workflow button at the bottom of the page to complete the wizard configuration and confirm the proposed workflow (an overview will be shown on the upcoming page) by clicking Save and execute this workflow now!.
The experiment will now be entered into the processing queue (cp. this page) and all analysis steps will soon be executed on one of the number-crunching servers behind GeneProf. Since we're dealing with some small demo datasets, this should actually not take very long at all, but we might have to wait a while before it's our turn in queue.
In any case, GeneProf will send you an email as soon as the analysis is complete, so now might be a good time to take a little break :-). You can browse away from this page, log out of GeneProf or even shut down your computer. Of course, you can also stay on the processing queue page, which will automatically refresh every once in a while, so you can monitor the progress.

Summary:

  1. Open the All-in-one RNA-seq Analysis Wizard for your experiment.
  2. Confirm that all settings are what you want them to be.
  3. Click Accept Settings & Create Workflow and execute the experiment.

RNA-seq Step 3: Check the Outputs

RNA-seq Tutorial Step 3. Examine the summary statistics, open the complete results table and browse it. Optionally, create some extra plots.
Once you receive the email telling you that your analysis is complete, you can go back to have a look at the outputs (there should be a link in that email directly linking to your experiment, mind that you might have to log in again, though; cp. this figure).
Right, so on the experiment main page scroll down to the Main Experimental Results section. Alongside the main part of the analysis, the wizard has also included several components in the workflow which create informative summary statistics at various steps of the workflow. We'll use these to confirm that everything worked out well (or find out where there's room for improvement).
Check the tab labelled Sequence Data Statistics before Quality Control first. This overview summarizes the input data just after parsing it from the raw data files, i.e. before any quality control was performed. You'll see that the four input datasets contained 25,000 sequences each. To get an idea about the quality of the data, we'd recommend having a look at (a) the nucleotide distribution and (b) the quality scores across the length of the reads (i.e. with progressing sequencing cycle). Note, you can open the summary in a new browser window / tab by clicking the Too small? Open in a new page? link at the top of the tab -- this can make it a lot easier to look at all those plots which may easily come out a bit cramped in the small tab inside the experiment main page.
If you scroll down a bit, you'll see a plot with the title Nucleotide Composition per Cycle. This is our first stop. We would usually expect the distribution of nucleotides to be overall uniform across the entire length of all reads -- hey, usually we'd be looking at millions of reads here and chances are that the likelihood of observing any one nucleotide at any read position should be pretty much the same. Small deviations from uniformity are nothing to worry about, but sometimes we see striking non-uniformity, in particular at the beginning of reads, which might indicate that something has gone wrong there. We could then, for example, decide to trim of the first few bases of each read to improve subsequent alignment results. For our example data, the distribution is quite acceptable (the distributions look a bit skewed until about cycle 10 or 11, but let's hope that's not too big a problem). Note, however, how there is an accumulation of N's at positions 7 and 19 in dataset B2! This might indicate that something did go slightly wrong in the sequencing process at these sequencing cycles.
Next, we'd look at the plot just below the previous one: Average Quality Score per Cycle (cp. SubConcept 'Quality Scores and Formats'). We can instantly see that the reported read qualities drop drastically towards the ends of the reads. That's a common observation for all sequencing datasets, but modern sequencers should typically behave a lot better than this. Using GeneProf's quality control modules (cp. module 'Modify and Filter Sequences') we could now decide to trim reads further (by editing the parameters via the workflow designer, this page)), say after the quality score of a read drops below a certain threshold, but bear in mind that the sequences are already quite short and trimming them further might make it difficult to unambiguously identify where in the genome they came from.
Interestingly, note how there's also a little dip in the average read quality of B2 at position 19 -- just where we found an accumulation of N's.
The last plot we want to look at in this tutorial is titled Average Quality Scores. This plot can be useful to decide on an appropriate threshold, if you filter out low quality reads by average read quality. The wizard, with the configuration that we used, would discard all reads with an average quality of less than 8. That's pretty lenient really and actually, the graph shows that there were no sequences with an average quality that low in our dataset, thus the quality control step shouldn't really have done anything. Let's confirm this by checking the main experimental results tab 'Sequence Data Statistics after Quality Control': Indeed, the filtered datasets are just the same size as the input data! All plots should also show exactly the same data.
Of course, in a real experiment this wouldn't usually be the case, but yeah, we could probably be a little bit stricter here and try to eliminate more erroneous or artifactual reads.
Anyway, let's look at the Read Alignment Statistics and, in particular, at the Alignment Ambiguity plot. The plot shows which percentage of reads from each input library could be aligned successfully to the reference genome. Reads that could not be aligned uniquely to one position in the genome will also be included, so long as they could be aligned with an acceptable (low) degree of ambiguity (up to 10 possible alignments). In our experiment, about 40-50% of all reads could be aligned uniquely to one position in the genome and another some 10% could be aligned with a low degree of ambiguity, leaving about 40% which could not be aligned at all or were to ambiguous. That's not amazing, but acceptable. One could consider adjusting the alignment parameters (e.g. number of mismatches allowed) or optimizing the quality of the input data (by quality control measures) to improve the results. Sometimes it might make sense to trim unaligned reads and try to re-align them afterwards. Such advanced steps can be implemented using the customizable workflow designer (cp. this page e.g. with module 'Quality Control + Bowtie Itera..').
Ok, we're getting closer to seeing some actual biology now (or: as much biology as one can expect to see from a mock sample experiment..): The Gene Expression Statistics tab gives a summary of the expression estimates calculated on the basis of the aligned data. We'll see, for example, that the most strongly expressed genes seem to be fairly consistent within the samples of B, but not so much for A (always keep in mind that any expression measure calculated from only 25,000 reads will be less than reliable!). A hierarchically clustered heatmap shows us, that the samples belonging to the same groups do indeed cluster together and it does look like there might be several quite different groups of genes in this data. Further plots below (correlation matrix and PCA plot) confirm within-group similarities. We also see that most of the expression observed comes from protein coding genes (as would be expected), but there is some evidence for the transcription of pseudo-genes and other transcriptional features.
Finally, there's the actual, main output of the experiment: Gene Expression with Differential Expression Statistics. That's basically a big table, with the gene expression estimates (as raw read counts, reads per million = RPM, and reads per kilobase million = RPKM) for each feature that was found to be expressed in the data at hand. The tab in the experiment main page only shows 10 rows from this table, but we can open the complete results in a new page by clicking Browse Complete Data Table (cp. this page). In the table, we can look for all the genes we might be interested in, or look at the ones most differentially expressed. Sort the table, for example, by the column adj. P A /B, which is the measure of significance (FDR-corrected) calculated by DESeq [12] that the genes are differentially expressed between conditions A and B. At this stage, we could also create additional plots to visualize the expression in our datasets (cp. this page) by clicking the Plots button. A filtered dataset, containing only differentially expressed genes is available, if you have a look at the workflow in detail (cp. this page). In this case, the 'Compliant Features' output of the Basic Feature Annotations Filter (cp. module 'Basic Feature Annotations Filter') module reports 389 genes differentially expressed (using an FDR threshold of 0.05 as defined earlier in the analysis wizard).
Another great way to look at the outputs of an experiment is to examine to aligned data in the genome browser. We'll not go into detail here, but please have a look at the documentation of the genome browser page (this page) or at this section of another tutorial (Subsection 'Method 2: Examining Public Dat..') for details.

Summary:

  1. Examine the Main Experimental Results of your experiment, in particular, the sequence data statistics before and after quality control -- what do you notice? The alignment statistics and gene expression statistics give you further clues about how well your analysis worked out.
  2. Have a look at the Gene Expression with Differential Expression Statistics table in detail. Sort the table by different criteria and try to find some 'interesting' genes.

RNA-seq Step 4: Adjust Parameters and Re-Run the Analysis

RNA-seq Tutorial Step 4. Open the workflow designer, modify the FDR filter threshold and re-run the analysis.
So, we've successfully run our initial analysis workflow and produced some results. In many cases, we might not be completely satisfied with these intial results and a little further tweaking might significantly improve out outputs.
Rather than starting all over again from scratch, we can now use the workflow designer (cp. this page) to adjust parameters of individual components in the existing workflow. We can the re-run only the updated parts of the workflow (see this figure). Conveniently, any changes to the workflow can be tracked later on via the experiment history (cp. this page), so it's easy to keep track of what we've done!
We've pointed out a few points earlier on, where one might take action to improve analysis results. For example, we frequently adjust the pre-processing / quality control parameters after the first run of an analysis pipeline.
For this tutorial, let's keep it as simple as possible, though, and adjust only one single parameter of the workflow: Say, we might not be happy with the differentially expressed genes we got in the end. Maybe we were a bit too strict with our FDR-cutoff in the first place? Let's relax the criteria a bit and accept all genes with an FDR-adjusted p-value of less or equal than 0.1 (rather than 0.05 as previously).
In the workflow designer, locate the Basic Feature Annotations Filter (cp. module 'Basic Feature Annotations Filter') component (should be at the bottom-right of the workflow) and click the Module Parameters link. A new dialog will appear allowing us to change parameters as we please. If you scroll down a little, adj. P A / B is less or equal 0.05 it says there. Change the number to 0.1 and save the change by clicking Apply Changes. In the workflow designer, the workflow component will now have a grey dashed box around it rather than a solid black box -- this indicates that the component has not yet been executed. Note that changes won't be saved until you click Save in the menu bar, so if you've made a mistake, just refresh the page to return to the last saved version of the workflow.
This is the only change we want to make, so click the Execute button to save the workflow and submit the experiment for execution to the processing queue.
Running the modified component of the workflow should only take a few seconds (you'll get an email again to notify you when processing is complete) and afterwards you can examine the updated outputs as before.
In the same fashion, you can update all components of the workflow. More experienced users can even add or replace new components to the workflow as they see fit. Iteratively running and modifying the workflow is a great, explorative way to run your data analysis and optimize your results!

Summary:

  1. Open the workflow designer (this page).
  2. Locate the Basic Feature Annotations Filter component and relax the FDR filter criteria to 0.1 or less.
  3. Re-run the analysis workflow and have a look at the updated results.

Discovering Transcription Factor Binding Sites (ChIP-seq)

In this tutorial, we'll be creating a new GeneProf experiment, upload some sample data to it and annotate the these files and then subsequently use the All-in-one ChIP-seq Analysis Wizard to create an analysis workflow for this data. We'll also see how one might go about interpreting the outputs produced. Let's get started!

ChIP-seq Step 1: Obtain Sample Data, Create an Experiment and Upload the Files

In this tutorial (cp. this figure), we will be using some mock sample data simulating a typical ChIP-seq experiment profiling the binding of two transcription factors (A and B) in mouse. Please start by downloading the sample data from this link.). The ZIP archive contains three files, namely sequences in FASTQ format (cp. SubConcept 'Quality Scores and Formats'), one for each transcription factor (TFs A and B) and one control (say, binding of a non-specific antibody).
FileFormatTranscription Factor / Antibody TargetNumber of ReadsRead Length
ChIPseq-SampleData.zipZIP archive---
-- A.fqFASTQA250,01725bp
-- B.fqFASTQB250,01525bp
-- Control.fqFASTQControl250,00425bp
ChIP-seq Tutorial Step 1. Download the sample data, create a new experiment, upload the data and annotate the files.
You don't need to extract the files from the archive or do anything with it at all: GeneProf will later on automatically detect that it's a ZIP archive and extract the data contained itself.
Right, so let's create a virtual experiment which will use for this excercise (call it anything you like). If you're not sure, how to go about starting a new experiment, have a look at this tutorial first: Tutorial 'How to Create a GeneProf Exper..'.
By the way, a public version of this demo experiment is available under accession number gpXP_000120. Note, however, that the raw data files in this experiment have been replaced directly with the parsed sequence output, as it is always done, when an experiment is finalized and made public (cp. SubConcept 'Finalization and Publication').
After creating the new experiment, upload the sample data you've just obtained (i.e. the whole ZIP, or alternatively the FASTQ files contained). Not sure how to upload data? Please refer to this tutorial: Tutorial 'Uploading Data to GeneProf'.
Finally, we should annotate the data files we've just uploaded. Providing proper sample annotation will make it easier to run the analysis wizards in the next step.
In the experiment main page (cp. this page scroll down until you find the section titled Sample Groups and Experimental Factors, then click the Annotate Samples button.
GeneProf will automatically create a default, blank annotation table for us with four rows and three annotation columns: Label, Organism and Platform (for details about sample annotation types and how to use this particular page, see this page). Having four different biological samples, this layout is indeed what we want, so we won't have to change much. GeneProf already selected an association to one of the input files in each column, so we just need to fill in the blank cells.
Let's label the samples 'A', 'B' and 'Control', respectively (make sure to fill in the label in the row that has been associated to the respective file!). They all come from mouse, so next click the pencil icon in the Organism column of one of the rows, type 'mouse' into the text box (or 'mus musculus', if you like) and select mouse from the dropdown list, the confirm your selection by clicking Ok. GeneProf will fill in the NCBI taxon ID for M. musculus (which is 10090). Rather than repeating this step for each other row, we can just double-click the value we filled in to copy the value over to all other cells in this column. Next choose the technology platform used to produce the data: Again, click one of the pencil icons, then select Illumina Genome Analyzer from the dropdown list and confirm with Ok. Fill in the entire column with the same value.
Now, we'd like to add one additional annotation in order to link the datasets reporting transcription factors back to the genes that encode these factors and therefore to tie them in more closely with the data in the GeneProf database. We'll add a Gene column to include these information. Click the Add Column button, and choose Sample Group, then confirm with Ok. Now, the mock data for this experiment does not actually represent any gene, so just pick one at random for datasets A and B. We'll arbitrarily choose dem1 and edem1 here.
We also want to add another column called Sample Group. This annotation type enables us to group several samples that represent the same factor into one. For the analysis, GeneProf will then merge the sequences of all these samples prior to further analysis. Now, in this dummy experiment there's actually only one sample per group, but let's add the sample group annotation anyway -- this will make running the wizard in the next step easier. Enter 'A', 'B' and 'Control' (just like the labels.. or any other name you'd like to give the sample groups) into the respective fields.
That will do for this tutorial. In a real experiment, please try to add as much information as possible to the sample annotation!

Summary:

  1. Download the sample data.
  2. Create a new experiment for this tutorial.
  3. Upload the sample data to the new experiment.
  4. Annotate the data. Make sure to add a Gene and a Sample Group column.

ChIP-seq Step 2: Define an Analysis Workflow and Execute it on the Servers

ChIP-seq Tutorial Step 2. Open the ChIP-seq wizard, adjust the group selections and fragment size, then save and execute the workflow.
On the experiment main page, the experimental progress indicator will now show that the first two steps are complete (i.e. Inputs and Samples have been provided), so now it's time to define and run the data analysis. We'll use the All-in-one ChIP-seq Analysis Wizard to do this quickly and easily.
GeneProf's analysis wizards (cp. SubConcept 'Analysis Wizards' and this figure) have been designed to make this step as painless and straightforward as possible, so really, it should take us only a few clicks!
Open the All-in-one ChIP-seq Analysis Wizard from the experiment main page either by scrolling all the way to the bottom of the page and clicking the big ChIP-seq icon or by choosing the wizard from the drop-down list in the section Things you might want to do.. (almost at the top of the page).
The default settings of the wizard will be just alright for most of our analysis, but let's check through all parameters and see where we have to change something.
The wizard requires us to answer a few questions about the nature of this experiment. We need to start by selecting the reference dataset to be used. Since we've filled in the organism in the sample annotation, GeneProf has already selected the recommended mouse reference for us and we don't need to do anything.
Since our input data consists of raw data files (GeneProf doesn't yet know they are sequences ), we also need to tell GeneProf how to parse the files into a compatible dataset (cp. SubConcept 'Parsing'). Well, actually, GeneProf will have already guess that the files are in FASTQ format (by looking at the top few lines of each file), so again, we don't need to change anything here. In general, it might be worth checking the GeneProf's guess, though -- sometimes we might make a mistake!
We also need to let GeneProf know which input samples belong to the same group of data and which data defines the background signal. Luckily, we've filled in the Sample Group annotation for our data, so GeneProf has already done some of the work for us by creating three groups (A, B and Control) with the respective input data associated to each group. Now, this is not entirely correct: Each of the groups here is meant to correspond to one DNA-binding protein -- the 'Control' dataset, however, reports the background signal (non-specific binding activity) and should not be selected as a normal group, but in its own, special section (just below the group selection). So, what we want to do is, to remove the group DNA-binding protein group 'Control' and add the data instead to the background signal selection. To do so, we simply click the clear link below the label of the 'Control' group: The association with the Control input dataset will now be removed. If there's no input data associated to any one group, that group will be ignored. No we need to add this dataset to the background signal: Just highlight the Control dataset in the select box in the section Which samples represent the background signal?.
The remaining steps of the wizard configuration allow us to customize the created analysis workflow. We're quite happy with the default, which will run a basic quality control step (we can always make this stricter after the first run of the workflow by modifying the parameters in the workflow designer, cp. this page) and align the sequences to the mouse reference genome (with Bowtie [6] and up to two mismatches between the input sequences and the reference genome).
GeneProf will look for enriched binding peaks in each of the input data groups (A and B) in comparison to the background signal (Control) and call this peaks putative transcription factor binding sites. You can customize the stringency of this peak detection step by adjusting the FDR (false discovery threshold; higher threshold = more peaks, lower threshold = fewer peaks), but let's keep the default of 0.1 for now -- we can (and will) easily adjust that later on. One parameter that GeneProf can unfortunately not guess from your input data, is the average fragment length of the sequenced DNA-fragments. That's usually something between 200 and 500bp, but really depends on your sample preparation protocol and technology. We've set the default to 400bp, but if you know any more about the fragment length in your own data, please do adjust this value to get better results. In the sample data, the average fragment length was only 200bp, so change the value to 200, please.
Peaks can be associated with known genes in order to infer potential regulatory links (i.e. does gene X have a binding site for transcription factor Y? Well, then X might potentially be regulated by Y). There's not definitive way of doing that, but GeneProf, by default, will look at all genes and if there is a binding site up to 20kb upstream or 1kb downstream of this gene's transcription start site, the gene will be flagged as 'yes, there is a binding site'. These gene associations are inherently faulty and should be taken with care. The wizard allows you to change the default behaviour by adjusting window sizes. For the sake of this tutorial, let's keep the defaults.
Click the big Accept Settings & Create Workflow button at the bottom of the page to complete the wizard configuration and confirm the proposed workflow (an overview will be shown on the upcoming page) by clicking Save and execute this workflow now!.
The experiment will now be entered into the processing queue (cp. this page) and all analysis steps will soon be executed on one of the number-crunching servers behind GeneProf. Since we're dealing with some small demo datasets, this should actually not take very long at all, but we might have to wait a while before it's our turn in queue.
In any case, GeneProf will send you an email as soon as the analysis is complete, so now might be a good time to take a little break :-). You can browse away from this page, log out of GeneProf or even shut down your computer. Of course, you can also stay on the processing queue page, which will automatically refresh every once in a while, so you can monitor the progress.

Summary:

  1. Open the All-in-one ChIP-seq Analysis Wizard for your experiment.
  2. Clear the selections in the Control group to remove the group and add that dataset to the background signal instead.
  3. Change the average fragment length to 200bp.
  4. Click Accept Settings & Create Workflow and execute the experiment.

ChIP-seq Step 3: Check the Outputs

ChIP-seq Tutorial Step 3. Examine the summary statistics, open a complete results table and have a look at the alignments in the genome browser.
Once you receive the email telling you that your analysis is complete, you can go back to have a look at the outputs (there should be a link in that email directly linking to your experiment, mind that you might have to log in again, though; cp. this figure).
Right, so on the experiment main page scroll down to the Main Experimental Results section. Alongside the main part of the analysis, the wizard has also included several components in the workflow which create informative summary statistics at various steps of the workflow. We'll use these to confirm that everything worked out well (or find out where there's room for improvement).
Check the tab labelled Sequence Data Statistics before Quality Control first. This overview summarizes the input data just after parsing it from the raw data files, i.e. before any quality control was performed. You'll see that the four input datasets contained approximately 25,000 sequences each. To get an idea about the quality of the data, we'd recommend having a look at (a) the nucleotide distribution and (b) the quality scores across the length of the reads (i.e. with progressing sequencing cycle). Note, you can open the summary in a new browser window / tab by clicking the Too small? Open in a new page? link at the top of the tab -- this can make it a lot easier to look at all those plots which may easily come out a bit cramped in the small tab inside the experiment main page.
If you scroll down a bit, you'll see a plot with the title Nucleotide Composition per Cycle. This is our first stop. We would usually expect the distribution of nucleotides to be overall uniform across the entire length of all reads -- hey, usually we'd be looking at millions of reads here and chances are that the likelihood of observing any one nucleotide at any read position should be pretty much the same. Small deviations from uniformity are nothing to worry about, but sometimes we see striking non-uniformity, in particular at the beginning of reads, which might indicate that something has gone wrong there. We could then, for example, decide to trim of the first few bases of each read to improve subsequent alignment results. For our example data, the distribution is indeed very uniform and quite acceptable. Note, however, how the percentage of N's (i.e. uncertain nucleotides, where the sequencer could not accurately determine the correct nucleotide at one position) appear to build up a little towards the ends of the reads, esp. in datasets B and Control.
Next, we'd look at the plot just below the previous one: Average Quality Score per Cycle (cp. SubConcept 'Quality Scores and Formats'). We can instantly see that the reported read qualities drop drastically towards the ends of the reads. That's a common observation for all sequencing datasets, but modern sequencers should typically behave a lot better than this. Using GeneProf's quality control modules (cp. module 'Modify and Filter Sequences') we could now decide to trim reads further (by editing the parameters via the workflow designer, this page)), say after the quality score of a read drops below a certain threshold, but bear in mind that the sequences are already quite short and trimming them further might make it difficult to unambiguously identify where in the genome they came from. Worryingly, the overall quality of the Control sample is rather low (and notably lower than the quality of the other two datasets). We'll have to expect that many of the sequences in this dataset might be erroneous.
The last plot we want to look at in this tutorial is titled Average Quality Scores. This plot can be useful to decide on an appropriate threshold, if you filter out low quality reads by average read quality. The wizard, with the configuration that we used, would discard all reads with an average quality of less than 8. That's pretty lenient really and actually, the graph shows that there were only a few sequences with an average quality that low in datasets A and B, but quite a number in the Control dataset. Thus the quality control step should have left datasets A and B nearly untouched, but have discarded a part of the control data. Let's confirm this by checking the main experimental results tab 'Sequence Data Statistics after Quality Control': Indeed, datasets A and B are almost the same size as before, but nearly 100,000 control sequences were discarded! In a real experiment, this might be a reason for concern, because it becomes questionable whether a small dataset would still provide an accurate measurement of the genome-wide background binding activity -- in this tutorial, let's just hope that we can still get something useful out of the data!
Anyway, let's look at the Read Alignment Statistics and, in particular, at the Alignment Ambiguity plot. The workflow constructed by the wizard does not allow any ambiguity in the alignment process, so all sequences for which no unique match in the genome could be identified will be reported as 'unaligned'. The plot shows that in our example experiment, only about 40-50% of all reads could be aligned uniquely to one position in the genome. That's certainly not amazing, but might be just what you get -- remember, the quality of our input reads was not particularly good, so a high percentage of erroneous reads will decrease the alignment success rate. One could consider adjusting the alignment parameters (e.g. number of mismatches allowed) or optimizing the quality of the input data (by quality control measures) to improve the results. Sometimes it might make sense to trim unaligned reads and try to re-align them afterwards. Such advanced steps can be implemented using the customizable workflow designer (cp. this page e.g. with module 'Quality Control + Bowtie Itera..').
Ok, we're getting closer to seeing some actual biology now (or: as much biology as one can expect to see from a mock sample experiment..): The TFBS Statistics tab gives a summary of the putative binding sites identified on the basis of enriched binding peaks in the aligned data. The peak finding algorithm MACS [4] has called 157 peaks for A and 188 peaks for B. Almost 60% of A's peaks are located in the promoter regions of known genes (for B it's less than 40%), but many of the peaks are also intragenic or far away from any known gene. The last four plots of the summary page tell us a little about how similar the binding sites of the different transcription factors are: We count the co-localization frequency for each transcription factor (i.e. how ofen do the binding sites of one factor overlap with the binding sites of the other). Each of the plots is progressively more permissive in the way these overlaps are calculated: The first plot looks for peaks which are EXACTLY the same, i.e. with exactly the same start and end coordinates. The next plot looks for overlapping binding sites or all binding sites within 100bp distance from each other. The other two plots allow for a distance of at most 250 or 500bp, respectively. In our example, there's not much similarity between the binding sites of both factors, apparently. Only 2% of the binding sites of both factors are within 100bp of each other. In real experiments, we frequently observe that many binding sites seem to be occupied by multiple transcription factors.
Finally, there's the actual, main output of the experiment: Regulated Genes and TFAS. They are basically both big tables, which report for each gene and each factor (A and B) whether there is a putative binding site within the proximity of its transcription start site (Regulated Gens;see the window size criteria set in the wizard earlier) or a numeric score for how strongly a transcription factor associates with each gene in the reference (TFAS; calculated on basis of the fold change enrichments in all binding peaks identified, cp. module 'Calculate TFAS'). You can open each of these tables in a new window by clicking Browse Complete Data Table (cp. this page). In the table, you can look for all the genes you might be interested in, or look at the ones with the highest association score (TFAS) for factors A and B by sorting the table , for example, by the column TFAS: Fold Change (A). At this stage, we could also create additional plots to visualize the binding patterns in our datasets (cp. this page) by clicking the Plots button. By the way, a genomic region dataset with all identified binding peaks can be found, if you have a look at the workflow in detail (cp. this page).
Another great way to look at the outputs of an experiment is to examine to aligned data in the genome browser. Open the genome browser for the data in this experiment from the experiment's main page (use the Browse button in the general info section or the in-built genome browser link in the Some things you might want to do.. section; cp. this page and this page). Use the genome browser to have a look at the region 120,593,000 to 120,599,000 on chromosome 4. Include the alginment tracks for A, B and Control (see this page for information about how to select tracks and navigate the browser) and set the 'Expand Regions' option to 100 bp for each track (i.e. half the average fragment length). You'll see a rather neat binding peak for both factors (A and B) just upstream of the Dem1 gene (Dem1 is located on the reverse strand; mind, this is mock data! We've made up these peaks!). It seems that both transcription factor occupy the same binding site and could potentially regulate the expression of the Dem1 gene.

Summary:

  1. Examine the Main Experimental Results of your experiment, in particular, the sequence data statistics before and after quality control -- what do you notice? The alignment statistics and TFBS statistics give you further clues about how well your analysis worked out.
  2. Have a look at the Regulated Genes and TFAS tables in detail. Sort the table by different criteria and try to find some 'interesting' genes.
  3. Open the alignment tracks for A, B and C in the genome browser and have a look at region chr4:120,593,000-120,599,000.

ChIP-seq Step 4: Adjust Parameters and Re-Run the Analysis

ChIP-seq Tutorial Step 4. Open the workflow designer, modify the Assign TFBS to Genes component, set the upstream window size to 2000 and re-run the analysis.
So, we've successfully run our initial analysis workflow and produced some results. In many cases, we might not be completely satisfied with these intial results and a little further tweaking might significantly improve out outputs.
Rather than starting all over again from scratch, we can now use the workflow designer (cp. this page) to adjust parameters of individual components in the existing workflow. We can the re-run only the updated parts of the workflow (see this figure). Conveniently, any changes to the workflow can be tracked later on via the experiment history (cp. this page), so it's easy to keep track of what we've done!
We've pointed out a few points earlier on, where one might take action to improve analysis results. For example, we frequently adjust the pre-processing / quality control parameters after the first run of an analysis pipeline.
For this tutorial, let's keep it as simple as possible, though, and adjust only one single parameter of the workflow: Say, we might not be happy with the way binding peaks were associated with genes. From previous experiments, we might know that our transcription factors only regulate genes which are bound directly in their promoter region, so associating genes with factor that bind up to 20kb upstream would not make sense.
In the workflow designer, locate the Assign TFBS to Genes (cp. module 'Assign TFBS to Genes') component (should be at the bottom-right of the workflow) and click the Module Parameters link. A new dialog will appear allowing us to change parameters as we please. Let's narrow the window size by dragging the slider control in such a way to allow only 2000b upstream of a TSS. Save the change by clicking Apply Changes. In the workflow designer, the workflow component will now have a grey dashed box around it rather than a solid black box -- this indicates that the component has not yet been executed. Note that changes won't be saved until you click Save in the menu bar, so if you've made a mistake, just refresh the page to return to the last saved version of the workflow.
This is the only change we want to make, so click the Execute button to save the workflow and submit the experiment for execution to the processing queue.
Running the modified component of the workflow should only take a few seconds (you'll get an email again to notify you when processing is complete) and afterwards you can examine the updated outputs as before.
In the same fashion, you can update all components of the workflow. More experienced users can even add or replace new components to the workflow as they see fit. Iteratively running and modifying the workflow is a great, explorative way to run your data analysis and optimize your results!

Summary:

  1. Open the workflow designer (this page).
  2. Locate the Assign TFBS to Genes component and change the window size to 2000 bp upstream.
  3. Re-run the analysis workflow and have a look at the updated results.

Examining Public Next-Gen Data using GeneProf

GeneProf hosts a wealth of public, ready-analysed next-gen data in its databases. There are a number of ways to explore this data and in this tutorial we will look at some of these and how you might derive useful biological knowledge using them.

Method 1: Retrieving Information about a Particular Gene

A quick and easy way to dig into the data in GeneProf is to search for information on a gene of your interest. GeneProf incorporates a wealth of static information gathered from Ensembl [13,8] and combines this information with the latest analysis results from GeneProf.
Finding Info about a Gene. Reports about gene-centric information in GeneProf can be retrieved quickly and contain rich information about the gene's expression and potential gene regulatory factors involved with it.
So, let's get started shall we? Pick your favourite gene (e.g. 'nanog') and search for it using the gene search tool (Public Data >> Search for a Gene; please refer to the documentation of the specific page to find out more about searching: this page). Choose the Nanog gene from mouse for this exercise.
The gene report page (also cp. this page) summarises all the information we have available on Nanog. We'll find, amongst other things, that there are five known transcript variants with varying 3' and 5' ends and that the gene has been implicated in embryonic development and stem cell maintenance (as by annotation from the Gene Ontology [1]).
Now, you can find this sort of information in many other databases and, in fact, some of these are linked such as Ensembl (where most of the data comes from) and iHOP, where you can find more information about this gene in the literature. The really interesting part -- and what's unique to GeneProf -- is what comes below, under the heading Public GeneProf Data for this Feature: Next-gen data and results of GeneProf analyses that are concerned with this gene.
This information is dynamic and constantly changing as new data is analysed and added to the public repository, but the information show should include at least the following:
Right at the top of the section, we list GeneProf experiments that specifically deal with this particular gene, e.g. for Nanog you'll find the experiments based on the great studies by Chen et al. ([2] and gpXP_000012) and Marson et al. ([3] and gpXP_000028), which both performed ChIP-seq to study (amongst others) the binding of the transcription factor Nanog. We could follow the links here to look at these studies in detail, but let's skip that for the moment and see what else is on offer.
The next section details the expression of the gene at hand by cell type (default). The plot is based on expression values (in reads-per-million; the higher the number, the stronger the gene is expressed) from all public GeneProf experiments. You can easily arrange the expression by another category by clicking on one of the links under Group expression by.., e.g. Tissue. You can find out more details about where the numbers come from by clicking Click here for more details!, which will expand a panel containing a table from all datasets, some of their annotation and the RPM values. Again, you can click the relevant links to examine each dataset in detail.
As we mentioned before, Nanog is a transcription factor and the next section summarises what GeneProf knows about its DNA-binding activity. Firstly, there is a table that list all datasets reporting genes potentially bound by this gene. Mind, 'potentially regulated' means that in the data, we observed an enrichment for binding of the factor (a 'binding peak') in the proximity of the transcription start site (TSS) of the gene. We usually use a a window size of 20kb upstream and 1kb downstream of the binding site for this assignment, although these thresholds might have been adjusted for certain analyses. The binary gene assignments should be take in with care: An ennrichment of binding does not necessarily imply regulation (nor does a lack of such binding in this limited window say that their is no regulation happening). The numbers are furthermore subject to thresholding and differences between libraries and might, at times, vary drastically between different studies. The reasons for this are (yet) not entirely clear and deserve further investigation.
Next, GeneProf lists the 25 most strongly bound genes. The binding strength is given by the TFAS (transcription factor association strength [9]), which we calculate on the enrichment (fold change) in all binding sites near the TSS of a gene as a function of their distance to the binding site. For Nanog we see that the most strongly bound gene appears to be Pou5f1, another well-known, stem cell master regulator.
Lastly, we're looking at transcription factor binding the other way round, i.e. which DNA-binding proteins (transcription factors, histone modifications, ..) show an enrichment in the neighbourhood of Nanog's TSS? Mind, that this association of binding factor is subject to the same uncertainty as mentioned earlier. For Nanog, amongst others, there are apparent binding sites for the stem cell regulators Pou5f1, Sox2 and Klf4 near Nanog's promoter.
However, at this points, we do not know whether these binding sites are actually shared / overlapping. We'll be looking at this question in the next part of this tutorial.

Summary:

  1. Open the gene search page: Public Data >> Search for a Gene.
  2. Search for a gene (e.g. 'nanog') and pick the exact gene you want from the list.
  3. Look at the information about the gene's expression in different tissues / conditions and DNA-binding activity by and near this gene.

Method 2: Examining Public Data in the Genome Browser

Genome browsers are great, visual tools to explore genomic data such as the coverage of aligned next-gen sequencing reads along chromosomes. The intensity of alignment signal per genomic region can give clues about gene expression (RNA-seq; the more reads align to the exons of a gene, the stronger the gene is expressed) or DNA-binding (ChIP-seq; many aligned read fragments piling up in one place indicate that there is probably a region specifically bound by the protein studied, e.g. a transcription factor binding site).
Exploring Genomic Data. GeneProf's genome browser can help to explore what's going on near your favourite gene.
GeneProf has a simple genome browser (cp. this page; based on [5]) built into its web interface, which you can use to easily and quickly examine genomic data (GeneProf provides many public tracks from analysed experiments) regions of interest (say your favourite genes and their neighbourhood). For users planning to do extensive genome browsing, we'd recommend to export the tracks they are interested in in a popular format (BED, WIG) via the batch exporter tool (cp. this page) and to import those tracks into a specialised genome browser software -- loading times in GeneProf's browser can be a bit long and those specialised tools provide additional functionality which GeneProf does not attempt to rival.
In this tutorial, we'll be looking at some transcription factor binding near the Nanog in mouse (see previous section). Let's first open the genome browser by clicking Public Data >> Genome Browser in the navigation bar and selecting the mouse reference dataset (cp. SubConcept 'Reference Data') as the genome we want to look at.
The genome browser will initially display one arbitrary gene locus. Let's jump to the Nanog locus, by typing 'nanog' into the gene search box. GeneProf will suggest all genes matching the search term (in this case, only one, Nanog) and when you click the gene in the list, the chromosome, start and end coordinate fields will be filled in automatically to enclose that gene's locus. The browser graphic will not yet be updated, so we can make futher adjustments to the locus. Adjust the start coordinate to extend the displayed region some 7.5kb upstream of Nanog's TSS, to 122650000, and click Update Display.
After a few seconds, the browser graphic will be refreshed and should now display the Nanog gene.
This is not very exciting so far, so let's add some additional tracks. Open the track selector dialog by clicking Choose Tracks & Other Display Options. You browse through, search and filter loads of public genomic data here. Pick the one called 'Chen2008 Data: Nanog', which are describing the binding pattern observed in the study mentioned earlier [2]. Also, add some expression data: Filter for 'guttman2010' in the Expression category and look for 'Guttman2010 Data: NPC.1' and 'Guttman2010 Data: ESC.1' (all from gpXP_000106which is based on data from [10]) and then add them to the track selection, too. Finally, click Refresh at the bottom of the panel or Update Display (as before) to update the browser graphic.
The plot now shows three additional tracks: The binding pattern for Nanog near its own TSS appears to suggest that their might be a (possibly auto-regulative) binding site some 5kb upstream of Nanog (where there is a visible peak in the binding). The expression tracks, below, confirm that the gene is highly expressed in embryonic stem cells (ESC), with expression being evident across the gene region, peaking in all exons, and not expressed in neural progenitor cells (NPC; the few dots in the plots are likely to arise either from measurement errors or from random sporadic / partial transcription).
In a similar fashion, you can examine the local neighbourhood of the genes you might be studying in the lab. Have a look at the repotoire of tracks available, maybe there's something else of interest to you. We constantly add more public tracks as new analyses are completed.

Summary:

  1. Open the genome browser page: Public Data >> Genome Browser.
  2. Select an organism of your choice.
  3. Open the Choose Tracks & Other Display Options panel.
  4. Select all datasets you want to look at by ticking the boxes next to them.
  5. Specify the genomic region you want to look at by entering a gene name into the text box and selecting one from the pop-up list.
  6. Click Update Display to refresh the browser graphic.

Method 3: Discover Patterns in Large Datasets using the Visual Data Explorer

There's a lot of data publicly available in GeneProf and the collection is rapidly growing. The Visual Data Explorer (VDE; cp. also this page) has been designed to make it easy to discover patterns across many of these datasets in an intuitive, visual way.
For the purpose of this tutorial, we'll be looking at transcription in different brain tissues. To get started, open the VDE by selecting Public Data >> Visual Data Explorer in the navigation bar. Choose the mouse reference dataset (cp. SubConcept 'Reference Data') and keep the default data type 'Expression [RPM]', that is, expression intensities reported in reads per million (with a basic normalisation applied to account for differences in library size).
For now, leave the plot type as default (Principal Component Analysis). We now need to choose the datasets we'd like to examine. Click the Select Dataset(s).. button, which will open a dialog listing all the public data of the chosen type. You can either select individual dataset by ticking the checkboxes next to them, or -- more quickly -- toggle the selection of all datasets of a certain type using the dropdown menus under each column header. Let's select all datasets that report expression in the CastEiJ strain by choosing CastEiJ from the dropdown list under the header Strain and confirming the selection by clicking Apply Changes.
Visual Data Explorer. The Visual Data Explorer enables you to quickly discover patterns in large collections of datasets.
At the time of writing this tutorial, doing so selects 23 datasets which all come from gpXP_000059 (based on data from [11]). Click the Update Plot button to refresh the VDE plot.
You'll see a panel plot consisting of four sub-plots, visualising the results of a principal component analysis (PCA; see this Wikipedia article). The first plot gives you an idea about how well the PCA worked for the data at hand. Importantly, the red line summarises which percentage of the variance of the data could be explained by the first 1, 2, 3, .. (given of the x-axis) principal components. In this case, the first three principal components represent just over 60% of the variation in the data, with most of this fraction covered by the first two principal components (PCs). That's not bad. The three remaining graphs, plot the contribution of each dataset to the three first PCs (PC1 vs. PC2 = top-right, PC2 vs. PC3 = bottom-left and PC1 vs. PC3 = bottom-right). Apparently, the first two PCs separate the data quite well into three groups. Does this grouping possibly have a biological meaning?
Choose to Group the data by.. >> Tissue and update the plot. Heureka! The three groups correspond to three distinct regions of the brain, which are now highlighted by using a different colour and symbols for each group.
In a similar way, you can use the VDE to plot correlation matrices (which can be another way of discovering clusters of datasets) and further plot types will be added soon (you may even wish suggest new types, see Section 'Reporting Bugs and Feature Req..'). Grouping your data can make it easy to discover patterns. The quality of the grouped plots, however, depends heavily on the quality of the annotation available for the datasets. As a rule of thumb, the Cell Type, Tissue and Platform tend to be well-annotated. Also, for DNA-binding you might choose to group by Gene, which, for transcription factors, denotes the gene product targeted by the ChIP antibody.

Summary:

  1. Open the VDE page: Public Data >> Visual Data Explorer.
  2. Select mouse as the reference organism. Leave plot type and data type as default.
  3. Click Select Dataset(s).. and select all datasets with strain CastEiJ.
  4. Group the data by Tissue.
  5. Click Update Plot to refresh the graphic.

Method 4: Examine Public Experiments in Detail

One of the great points about GeneProf is, that all data analysed in the system can be examined in detail and each analysis is fully transparent and can be traced back to every single step. In this part of the tutorial, we'll check one particular experiment more closely to give you a feel how to go about looking at experiments in general.
Open the experiment browser by clicking Public Data >> Browse Experiments in the navigation bar. You can either flick through all experiments listed page by page or you can filter the table to quickly search for a particular experiment. Here, we want to look at an experiment based on data from an RNA-seq experiment in mouse embryonic stem cells, fibroblasts and neural progenitor cells by Guttman et al. [10]. Search for the study by entering 'guttman' into the filter textbox under the header Name and press your ENTER-key. The study we're looking for has the GeneProf expression number gpXP_000106 and should now show up in the filtered list of experiments. Open the experiment by clicking on the magnifying glass icon.
You can now read through the general description of the experiment. You'll see that this is a 'secondary experiment', i.e. it's based on data produced independently from the analysis carried out here. There's also a link to the publication describing the data used in this paper, if you'd like to find out more about it.
Below the general information, GeneProf highlights a few popular actions that you might want to perform with this experiment, e.g. you can export a PDF report summarising the entire experiment, inputs, outputs and the analysis workflow, you can examine the genomic data in the experiment in the genome browser or you can export these data in a variety of formats to import them into another specialised genome browser software.
Interestingly, the next part of the page provides an overview of all input data and their annotation. You'll see that there are 7 input sequence datasets (imported from the SRA) coming from 4 biological samples from the aforementioned cell types.
Now -- and most importantly -- the next part lists the main outputs of the experiment. The creators of the experiment decided which datasets produced as a part of the analysis workflow were the most important results and those will be show here: Each dataset in a different tab. For this experiment, these will be: Sequence Data Statistics before and after quality control, read alignment statistics, gene expression statistics and gene expression with differential expression.
The first four datasets (i.e. the summary statistics, also cp. ModuleCategory 'Parsers') are very useful to quickly get an idea about what's going on. The sequence data statistics are very important when you're actually performing the analysis in order to make sure that the data you base your conclusions on are reliable. We would be looking to confirm acceptable quality scores (cp. SubConcept 'Quality Scores and Formats') and a roughly uniform distribution of nucleotides across the entire length of the sequencing reads. It's important to notice that we frequently observe slightly skewed nucleotide distributions in the beginning (first 1-5bp) of the read sequences and that quality scores in general drop with progressing seqeuncing cycles. This is, mostly, nothing to worry about, unless it's getting too bad. In this experiment, indeed the nucleotide distribution slightly uneven, however, quality scores are consistently good and we did not require any advanced measures to be taken. In other cases, it might be a good idea to trim of the beginning or ends of read sequences, but the experiment creators can easily try out different values.
In the alignment statistics, we usually look foremost at the Aligment Ambiguity plot, which gives an overview of which percentage of read sequences from each library could be successfully mapped back to the genome and how many of those reads mapped to repetitive sequences (cp. SubConcept 'Alignment'). If too few reads could be aligned, the quality of the input data might have been questionable or it might have been advisable to pre-process the reads in some way. In this experiment, however, the alignment worked very well, with about 60% to 77% of all reads aligning uniquely to one position in the genome and only few reads without any reported alignment.
Now, the previously described statistics lay the groundwork and may help you to judge whether the data in an experiment is trustworthy, but what we're really interested in are the results, aren't we? So let's look at the Gene Expression Statistics! The list of most highly expressed genes, the clustered expression heatmap, the correlation matrix and the PCA plot all confirm a very high degree of similarity between the samples belonging to the same sample group (cell type) -- which is what we expect to see since they're really technical replicates of each other. We can also see that NPCs appear to be more similar to ESCs in terms of gene expression than MLFs (this, again, works to confirm that the experiment did actually produce sensible gene expression measures). The majority of expression stems from protein coding genes, yet there is considerable evidence for the transcription of pseudo-genes and various forms of short RNAs with the latter contributing more strongly to MLF and NPC expression than to ESCs.
Finally, the Gene Expression with Differntial Expression tab gives us the actual main output of the experiment. This dataset contains a complete table with expression values for all genes and all samples along with fold-changes and p-values marking their differential expression between the sample groups. The tab on this page will only give a selection of 10 rows from this table, but you can open the complete results table by clicking Browse Complete Data Table. The dynamic table browser (cp. this page) will open in a new browser tab / window and you use it to search, filter and sort the results in whatever way you like. You can also add additional columns to the displayed table (e.g. additional annotations about the genes) by clicking the Display options button (cp. this page). Let's search for Nanog by typing 'nanog' into the text box under the Name header and applying the filter by pressing the ENTER-key. As expected, the gene is highly expressed in ESCs (with about 330 RPKM), but there is little to no transcription in NPCs and MLFs.
If you're interested in how the data was analysed, you can have a look at the analysis workflow (cp. Concept 'Workflows'). A simplified, schematic representation of the workflow is included in the experiment main page: The input sequence data has been aligned to the mouse reference genome using a meta-module combining quality control with the TopHat alignment software (cp. module 'Quality Control + Tophat 1.20 ..'), from the alignment coverage we calculated gene expression ((cp. module 'Quantitate Gene Expression'), which were used to assess differential expression with DESeq (cp. module 'DESeq') and on the basis of DEseq's assessment we've filtered the data into sets of differentially expressed genes.
You can investigate the analysis further, by opening the workflow designer for this experiment (cp. this page), which allows you to see every single step of the analysis, how the outputs of one process were linked to the inputs of another and each parameter used. You can also view many additional dataset (intermediate results, etc.) that have not been included in the experiment's main page.
Another way to check how the data was analysed, is to look at the experiment's history (click show history? in the general experiment description section or the History button in the workflow designer; cp. this page). You'll see that we initially imported data from the SRA, then created the analyis workflow using the RNA-seq wizard (cp. this page) and then executed the entire workflow. Note that the programmatic output of each sub-process is available as part of the history. Finally the experiment has been finalised and made public (cp. SubConcept 'Finalization and Publication')

Summary:

  1. Find an experiment using the experiment browser: Public Data >> Browse Experiments and open it.
  2. Read through the description and examine inputs and sample annotation.
  3. Have a look at the main outputs, in particular, the summary statistics (sequence quality before and after QC, alignment stats, etc.).
  4. If you're particularly keen: Have a look at the analyis workflow and the experiment history to find out how exactly the data was processed.

Creating a Custom Reference Set

GeneProf comes with quite a range of ready-made, pre-defined reference datasets for many popular model organisms such as human, mouse, Arabidopsis or C. elegans. Don't panic, though, if your organism of choice is not in the list of organisms supported by default, it's fairly straightforward to add your own, custom references!
We're currently concentrating on well-annotated organisms with a good genome and transcriptome assembly and to define a reference set you will need, at least, the following:
  1. A good genome assembly, that is sequences of all chromosomes or major scaffolds. Each chromosome sequence must have a recognisable name.
  2. Genomic annotations for all (known) genes or other transcriptionally active units (e.g. short RNAs). These need to cover at least these information: A unique identifier for each gene, as well as a name, start and end coordinate, chromosome (where the chromosome name has to match EXACTLY one of the chromosomes above), strand and a category of 'feature type', e.g. 'protein_coding'. The categories can be anything you like, but we recommend following Ensembl nomenclature (see below).
  3. cDNA sequences of all known transcripts (essentially, a transcriptome assembly). There needs to be at least one transcript sequence for each gene in (2). Each transcript sequence must have a unique identifier.
  4. Coordinates of all exons (start, end, chromosome, strand) plus the identifier of the corresponding gene in (2) and transcript in (3). This will establish a link between genes and all transcript variants.
Hopefully, we'll be able to expand support in future for organisms for which not all of these information are available (e.g. if you've got only a transcriptome, but no genome assembly). If you're interested in these kind of applications or have any suggestions, please support the case for easier reference dataset generation via the feature tracker (Section 'Reporting Bugs and Feature Req..').
Alright, the easiest way to get your hands on the files required is from Ensembl (http://www.ensembl.org) or Ensembl Genomes (http://www.ensemblgenomes.org). So the first step should always be to check whether your favourite organism is already hosted at one of the Ensembl projects, if so, great, otherwise you might have to find another way to create the files we'd usually get from them (please get in touch to discuss).
To get all required files from Ensembl proceed as follows:
  1. Find your organism in one of these lists: Ensembl FTP, Ensembl Bacteria FTP Ensembl Plants FTP Ensembl Fungi FTP Ensembl Metazoa FTP or Ensembl Protists FTP.
  2. Download the DNA sequence first by clicking the 'FASTA (DNA)' link. Select the file ending with 'YOURORGANISMNAME.dna.toplevel.fa.gz'. This is the genome assembly file.
  3. Next, download the cDNA sequence by clicking the 'FASTA (cDNA)' link. Select the file ending with 'YOURORGANISMNAME.cdna.all.fa.gz'. This is the transcriptome assembly file.
  4. The remainder of data will be downloaded via Ensembl's BioMart utility. At the top of the page, click on the 'BioMart' link. Choose the correct 'Genes' database and then your specific organism.
  5. We'll first get the basic gene annotations (that is, (2) from the first list above): Click 'Attributes' on the left, now un-tick the box next to 'Ensembl Transcript ID' and instead tick the boxes: 'Description', 'Chromosome Name', 'Gene Start (bp)', 'Gene End (bp)', 'Strand', 'Associated Gene Name' and 'Gene Biotype'. Also make sure the box for 'Ensembl Gene ID' is ticked! Download the file by clicking the 'Results' button and selecting 'Compressed file (.gz)' and 'TSV', then click the 'Go!' button. Downloading the file should not take very long.
  6. Now we need to get the exon annotatons. Switch back to the 'Attributes' panel and choose 'Structures' (instead of 'Features'). Expand the 'Exon' sub-panel and tick the boxes next to: 'Exon Chr Start (bp)' and 'Exon Chr End (bp)', the expand the 'Gene' sub-panel and also tick: 'Ensembl Gene ID', 'Ensembl Transcript ID', 'Chromosome Name' and 'Strand'. Again, download the file as a compressed, tab-delimited text file by clicking 'Results', selecting 'Compressed file (.gz)' and 'TSV' and clicking the 'Go!' button. The download of this file might take slightly longer.
  7. OPTIONALLY, you may download additional annotations, e.g. further external identifiers or whatever you like. Additional annotations can be tab-delimited text files and MUST contain a reference to the gene they are for (using the primay unique ID used earlier, i.e. the Ensembl Gene ID) as one column.
Ok, you've now got all the files required. Now you just need to create a new experiment and upload your files (see Tutorial 'How to Create a GeneProf Exper..' and Tutorial 'Uploading Data to GeneProf').
Once your files are in the experiment, open the workflow designer (this page) from the experiment main page (this page).
You'll need to use three modules to create the reference: First, parse the sequence information from the chromosome and transcript files using the FASTA parser module (module 'FASTA Parser'). Drag and drop two instances of this module (from 'General' > 'Parsers') onto the workflow canvas and wire them up to the respective input files. You don't need to configure any parameters, but might want to chane the output name.
Next, drag and drop an instance of the 'Define a new Reference Set' module (module 'Define a new Reference Set') onto the canvas. Connect (in order) the following inputs: The gene annotations file, the parsed chromosome sequences, the exon coordinates file and, finally, the parsed transcript sequences.
Save the workflow, then click 'Module Parameters' to configure the reference generation:
Define the organism the reference is for by entering its name and selecting the correct match from the drop-down list.
Now you need to tell GeneProf about the structure of the gene annotations file. There will be a little sample of the first few lines of this file in the configuration dialog box. Tick the 'Use first row as header' checkbox and then select the correct types for all columns, i.e. 'Ensembl Gene ID' for the column holding the unique identifier, 'Description' for the description, 'Chromosome', 'Start', 'End' and 'Strand' for chromosome, gene start (bp), gene end (bp) and strand, respectively, and 'Feature Type' for gene biotype and 'Name' for associated gene name.
Finally, we need to do pretty much the same for the exons file. Select 'Feature ID' for the Ensembl Gene ID column, 'Ensembl Transcript ID' for the transcript ID and 'Chromosome', 'Strand', 'Start' and 'End' for chromosome name, strand, exon chr start and exon chr end, respectively.
Lastly, define a sensible name for the output dataset (in the 'Inputs and Outputs' tab). We roughly follow the naming convention 'SOURCE - VERSION - TRIVIAL NAME Genes - ASSEMBLY Assembly', e.g. 'Ensembl 58 Mouse Genes, NCBIM37 Assembly'.
That's it. Execute the workflow and wait until the processing has completed.
You could now start using the reference right away in the same experiment, but we'd recommend you instead finalise this experiment at this stage. That way, you can use the reference in many other experiments and keep everything nice and tidy. Before finalising the experiment (SubConcept 'Finalization and Publication'), annotate the input files with a nice label and description for each input (this page), so you know later on what it's all about.

Bibliography


[1] Ashburner, M and Ball, CA and Blake, JA and Botstein, D and Butler, H and Cherry, JM and Davis, AP and Dolinski, K and Dwight, SS and Eppig, JT and Harris, MA and Hill, DP and Issel-Tarver, L and Kasarskis, A and Lewis, S and Matese, JC and Richardson, JE and Ringwald, M and Rubin, GM and Sherlock, G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. (2000). PMID: 10802651.
[2] Chen, X and Xu, H and Yuan, P and Fang, F and Huss, M and Vega, VB and Wong, E and Orlov, YL and Zhang, W and Jiang, J and Loh, YH and Yeo, HC and Yeo, ZX and Narang, V and Govindarajan, KR and Leong, B and Shahab, A and Ruan, Y and Bourque, G and Sung, WK and Clarke, ND and Wei, CL and Ng, HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell (2008). PMID: 18555785.
[3] Marson, A and Levine, SS and Cole, MF and Frampton, GM and Brambrink, T and Johnstone, S and Guenther, MG and Johnston, WK and Wernig, M and Newman, J and Calabrese, JM and Dennis, LM and Volkert, TL and Gupta, S and Love, J and Hannett, N and Sharp, PA and Bartel, DP and Jaenisch, R and Young, RA. Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell (2008). PMID: 18692474.
[4] Zhang, Y and Liu, T and Meyer, CA and Eeckhoute, J and Johnson, DS and Bernstein, BE and Nusbaum, C and Myers, RM and Brown, M and Li, W and Liu, XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. (2008). PMID: 18798982.
[5] Durinck, S and Bullard, J and Spellman, PT and Dudoit, S. GenomeGraphs: integrated genomic data visualization with R. BMC Bioinformatics (2009). PMID: 19123956.
[6] Langmead, B and Trapnell, C and Pop, M and Salzberg, SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. (2009). PMID: 19261174.
[7] Trapnell, C and Pachter, L and Salzberg, SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). PMID: 19289445.
[8] Flicek, P and Aken, BL and Ballester, B and Beal, K and Bragin, E and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Fernandez-Banet, J and Gordon, L and Gräf, S and Haider, S and Hammond, M and Howe, K and Jenkinson, A and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Koscielny, G and Kulesha, E and Lawson, D and Longden, I and Massingham, T and McLaren, W and Megy, K and Overduin, B and Pritchard, B and Rios, D and Ruffier, M and Schuster, M and Slater, G and Smedley, D and Spudich, G and Tang, YA and Trevanion, S and Vilella, A and Vogel, J and White, S and Wilder, SP and Zadissa, A and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Smith, J and Searle, SM. Ensembl's 10th year. Nucleic Acids Res. (2010). PMID: 19906699.
[9] Ouyang, Z and Zhou, Q and Wong, WH. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. U.S.A. (2009). PMID: 19995984.
[10] Guttman, M and Garber, M and Levin, JZ and Donaghey, J and Robinson, J and Adiconis, X and Fan, L and Koziol, MJ and Gnirke, A and Nusbaum, C and Rinn, JL and Lander, ES and Regev, A. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. (2010). PMID: 20436462.
[11] Gregg, C and Zhang, J and Weissbourd, B and Luo, S and Schroth, GP and Haig, D and Dulac, C. High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science (2010). PMID: 20616232.
[12] Anders, S and Huber, W. Differential expression analysis for sequence count data. Genome Biol. (2010). PMID: 20979621.
[13] Flicek, P and Amode, MR and Barrell, D and Beal, K and Brent, S and Chen, Y and Clapham, P and Coates, G and Fairley, S and Fitzgerald, S and Gordon, L and Hendrix, M and Hourlier, T and Johnson, N and Kähäri, A and Keefe, D and Keenan, S and Kinsella, R and Kokocinski, F and Kulesha, E and Larsson, P and Longden, I and McLaren, W and Overduin, B and Pritchard, B and Riat, HS and Rios, D and Ritchie, GR and Ruffier, M and Schuster, M and Sobral, D and Spudich, G and Tang, YA and Trevanion, S and Vandrovcova, J and Vilella, AJ and White, S and Wilder, SP and Zadissa, A and Zamora, J and Aken, BL and Birney, E and Cunningham, F and Dunham, I and Durbin, R and Fernández-Suarez, XM and Herrero, J and Hubbard, TJ and Parker, A and Proctor, G and Vogel, J and Searle, SM. Ensembl 2011. Nucleic Acids Res. (2011). PMID: 21045057.
[14] Leinonen, R and Sugawara, H and Shumway, M. The sequence read archive. Nucleic Acids Res. (2011). PMID: 21062823.