GeneProf Manual


Florian Halbritter, 27-Nov-2012

Table of Contents

Chapter: Introduction

Chapter: Concepts Explained

Chapter: Tutorials

Chapter: Modules

Chapter: Pages

Chapter: Advanced Topics

Chapter: Frequently Asked Questions (FAQ)

Advanced Topics


Installing GeneProf Locally

Considering to setup a local installation of GeneProf? If you are experiencing long waiting times or are dealing with large quantities of data, installing a copy of GeneProf on a (cluster of) server in your local department might be a solution to consider. We provide the GeneProf web application free of charge, but, naturally, our resources are limited and we might not be able to satisfy your requirements adequately.
Also, you might be interested in developing additional GeneProf-compatible modules and / or to integrate your existing stand-alone software into GeneProf's framework (cp. Section 'Module Development: Adding new..'). In this case, you might be well-advised to install a local version of the software to test-drive your new code, before submitting for inclusion in the live-version of GeneProf.
Before we proceed, a word of warning: Setting up a local copy of GeneProf is not exactly an easy process and will require a fair level of IT expertise. We are currently working on packaging the environmental setup required as some form of distributable disk images, so, hopefully, the installation process should be much simpler in future.
Note: We recommend a Linux operating system and the instructions below are tailored for those. We have currently not tested this software on any other system and cannot provide any support.

Overview of GeneProf Components

The entire GeneProf system consists of three (-ish) components:
In theory, all components may be installed on just one computer, but since we're dealing with large-scale genomics data, we recommend having at least one computer dedicated to running the web application and databases and at least one separate processing node ('job agency'; how many you actually need depends entirely on your data volume).

Setting up the Environment

In order to set up a working copy of GeneProf you will first need to set up quite a few external programs, which are not currently shipped directly with GeneProf.
Please check the list below and set up all required software according to the software providers' instructions. Where possible, please try to use the version stated in this overview (avoid using newer versions if the old one is still available, however, never use older versions than the ones recommended below).
Note: Only Java, an Java web container and a database system are actually required components; R, ImageMagick, TeX Live, GraphViz and other external components make additional functionality available and are highly recommended.
On the computer dedicated to running the GeneProf Web Application you will need:
Java1.6+ (Sun/Oracle VM recommended)Download Page
Apache Tomcat6+Download Page
(or a compatible java application container)
MySQL Connector5.1.12Download Page
(or alternative connector for other database)
R2.12+Download Page
ImageMagick6.5.7-8+Download Page
TeX Livelatest (full install)Download Page
GraphViz / dotlatestDownload Page
On the computer(s) dedicated to running the GeneProf Databases you will need:
MySQL5.1+Download Page
(or a compatible RDBMS)
MySQL Connector5.1.12Download Page
(or alternative connector for other database)
any FTP server(optional)alternative: share a network folder
On the computer(s) dedicated to running the GeneProf Job Agencies you will need:
MySQL Connector5.1.12Download Page
(or alternative connector for other database)
R2.12+Download Page
Apart from the external components listed above, GeneProf makes use of a number of great third-party libraries which are included in the GeneProf downloads. Please check the terms and conditions page for a full list.

Installing the GeneProf Databases

Having set up all the required and recommended software, we can install the actual GeneProf system. Let's start by installing the GeneProf databases.

Summary:

  1. Make sure your RBDMS (e.g. Mysql) is running.
  2. Download the latest version of the GeneProf Database Skeleton from the downloads section onto your designated database computer.
  3. Extract the archive in a folder which will become the home of your GeneProf data (let's call this folder DBHOME), like this: $ tar -xvfz GeneProfDatabaseSkeleton.tar.gz
  4. Modify the setup script with your database credentials. Edit the file setupGeneProfDatabase.sh and fill in your details for the variables: MYSQLUSER, MYSQLPWD, MYSQLHOST
  5. Execute the setup script: $ sh setupGeneProfDatabase.sh
  6. You can now delete all temporary file now.
  7. Now also set up an FTP server. The FTP server needs to provide access to a drive with lots of storage space. No other set up is required. Take a note of the FTP home folder FTPHOME (as seen by the user after login) and your login credentials.

Installing the GeneProf Web App

Next, we can set up the GeneProf web application:

Summary:

  1. Shutdown Tomcat (or whatever web app container you are using).
  2. Download the latest version of the GeneProf web application from the downloads section onto your designated database computer.
  3. Extract the archive into your Tomcat web root folder (usually tomcat/webapps/), like this: $ tar -xvfz GeneProfWeb.tar.gz
  4. Open and edit the file GeneProf/META-INF/context.xml filling in the correct parameters for your database and external software setup and any other configuration parameters you might want to set.
  5. Launch your Tomcat server.

Installing GeneProf Job Agencies

Finally, we can set up the GeneProf job agencies. Repeat these steps on each computer you want to use as a processing node:

Summary:

  1. Download the latest version of the GeneProf job agency from the downloads section onto your designated database computer.
  2. Extract the archive into a local folder, in which all your GeneProf files will reside, like this: $ tar -xvfz GeneProfJobAgency.tar.gz
  3. Open and edit the file GeneProfJobAgency/gp.cfg filling in the correct parameters for your database and external software setup and any other configuration parameters you might want to set.
  4. Open and edit the file GeneProfJobAgency/startGeneProf.sh updating the parameter GP_DL_URL to point to your local GeneProf instance.
  5. Make sure you're allowed to 'run' the startup script: $ chmod u+x startGeneProf.sh
  6. Start the job agency: $ nohup ./startGeneProf.sh &
  7. Optional: You may want to add the job agency startup script to the list of applications automatically started at system startup.

Installing GeneProf Job Agencies

Here's a list of all config parameters that may or need to be set either in the gp.cfg file (job agencies) or in the context.xml file (web server):
ParameterDescriptionDefaultWeb AppJob Agencies
jdbc/geneprofJNDI resource for the database connection.(not set)required-
DB_HOSTComplete host URL for the RDBMS, e.g. jdbc:mysql://localhost:3306/gp?autoReconnect=true.(not set)-required
DB_USERFully privileged user for access to the RDBMS.(not set)-required
DB_PWDPassword to go with DB_USER.(not set)-required
FTP_HOSTURL of the FTP server, e.g. myftp.myurl.com.(not set)requiredrequired
FTP_PORTPort address that the FTP server can be reached on.21requiredrequired
FTP_USERFully privileged user for access to the FTP.(not set)requiredrequired
FTP_PWDPassword to go with FTP_USER.(not set)requiredrequired
FTP_ROOT_FOLDERThe root folder for GeneProf-data as visible from after logging in with the given credentials, e.g. /home/geneprof/gpftp/.(not set)requiredrequired
GENEPROF_TMP_FOLDERFolder used to store temporary files in../geneprof_tmp_files/requiredrequired
HOME_URLHome URL of the GeneProf web app in non-encrypted mode.http://www.geneprof.orgrequiredrequired
SECURE_URLHome URL of the GeneProf web app in encrypted mode.https://www.geneprof.orgrequired-
SECURE_IP_RANGERestrict access to the administrator section to connections coming from either localhost or IP addresses starting with this string (empty string matches all IP's!).(empty)optional-
WEB_API_TRUSTED_CLIENTAccept WebAPI connections from these IP address without requiring an API key and allowing access to all data. Accepts a comma-separated list of IP addresses. Use with care.(not set)Web AppJob Agencies
EMAIL_FROM_HOSTMail server emails sent out from GeneProf are marked to originate from, e.g. mail.mygeneprof.com.(not set)requiredrequired
EMAIL_VIA_SMTPOutgoing email SMTP server, e.g. smtp.myemailserver.com.(not set)requiredrequired
NOREPLY_EMAIL_ADDRESSNo-reply email address used for sending out emails from GeneProf.(not set)requiredrequired
WEB_API_KEY_REQUIREDIf set to false, access to public data via the WebAPI does not require an access key.falserequired-
MAX_GENE_SEARCH_RESULTSMaximum number of results to display in gene search queries.50required-
MAX_N_GBROWSER_TRACKSMaximum number of genome browser tracks supported in the same plot.60required-
REDIRECT_SERVLETThe mapping/name of the servlet forwarding any sort of accession ID to its respective entry page (default implementation: org.stembio.geneprof.servlets.RedirectServlet).showrequired-
REBUILD_PDF_MANUALAutomatically rebuild the PDF version of the GeneProf manual at server launch.falserequired-
REBUILD_HTML_MANUALAutomatically rebuild the HTML/JSP version of the GeneProf manual at server launch.falserequired-
PRECALCULATE_OVERVIEW_PAGESPrepare data for public experiments overview pages on server startup.falserequired-
PLOT_SUPPORT_PNGIf set to true, the web app will assume it has been set up so it can use the Cairo PNG output device in R.falserequired-
PLOT_SUPPORT_PDFIf set to true, the web app will assume it has been set up so it can use the pdf(..) output device in R.falseoptional-
PLOT_SUPPORT_JPEGIf set to true, the web app will assume it has been set up so it can use the Cairo JPEG output device in R.falseoptional-
PLOT_SUPPORT_BMPIf set to true, the web app will assume it has been set up so it can use the bmp(..) output device in R.falseoptional-
PLOT_SUPPORT_PSIf set to true, the web app will assume it has been set up so it can use the Cairo PostScript output device in R.falseoptional-
PLOT_SUPPORT_SVGIf set to true, the web app will assume it has been set up so it can use the svg(..) output device in R.falseoptional-
PLOT_SUPPORT_TIFFIf set to true, the web app will assume it has been set up so it can use the Cairo TIFF output device in R.falseoptional-
OSThe operating system the software is running on. Scarcely used at the moment. Valid values are only: Unix and Windows.Unixrequiredrequired
CPUSThe number of CPU cores usable by GeneProf (used for multi-threading). For job agencies, this is the number PER WORKER.1requiredrequired
QUEUE_RELOAD_INTERVALRefresh interval for the job queue (in seconds).60required-
DEFAULT_DISK_QUOTADefault disk quota allowance allocated to new users (in megabytes).25000requiredrequired
MIN_FREE_QUOTA_FOR_UPLOADMinimum free disk quota to allow any sort of uploads (in megabytes).50required-
HARD_DISK_QUOTA_EXTRAA percentage to be added on top of the user quota. So long as the 'hard limit' is not exceeded yet, the user may still run analyses, but not upload any more data.0.25requiredrequired
RThe command used to start up R on the system (may be an absolute path).Rrequiredrequired
GRAPHVIZ_LAYOUTERThe command used for rending Graphviz plots (as for the schematic workflow images).dotrequired-
SH_CMDThe command used for the dash/shell command interpreter./bin/shrequiredrequired
TARThe command used for creating and extracting TAR archives.tarrequired-
UNZIPThe command used for extracting ZIP archives.unziprequired-
AGENCY_IDThe name/ID assigned to this job agency. Important: The job agency IDs MUST be unique!(dynamic ID based on system time at startup)-required
NUMBER_OF_WORKERSThe number of separate worker processes managed by this job agency.1-required
JOB_UPDATE_FREQUENCYThe refresh interval between queries for new jobs (in milliseconds).10000-required
ALIVE_SIGNAL_FREQUENCYThe frequency with which 'alive' signals are reported back to the server (in multiples of JOB_UPDATE_FREQUENCY).8-required
JAVA_EXECUTABLEThe command used to execute Java.java-required
SUBPROCESS_JVM_XMSMinimum memory allocation per worker thread.1024m-required
SUBPROCESS_JVM_XMXMaximum memory allocation per worker thread.1024m-required
WORKER_JARThe location of the GeneProf worker thread JAR file relative to the job agency home folder.dist/gp/GeneProfWorker.jar-required
PERLThe command used to execute Perl scripts./usr/bin/perl-optional
PYTHONThe command used to execute Python (2.7) scripts.(not set)-optional
CANVAS_XDefault width of the workflow designer canvas (in pixels).5000required / cosmetic-
CANVAS_YDefault height of the workflow designer canvas (in pixels).5000required / cosmetic-
DEFAULT_XDefault X coordinate for new components in the workflow designer (in pixels).16required / cosmetic-
DEFAULT_YDefault Y coordinate for new components in the workflow designer (in pixels).16required / cosmetic-
DEFAULT_WDefault width for new components in the workflow designer (in pixels).97required / cosmetic-
DEFAULT_HDefault height for new components in the workflow designer (in pixels).56required / cosmetic-
DEFAULT_SPACINGDefault spacing between components in the workflow designer (in pixels).56required / cosmetic-
RIGID_ID_PATTERN_EXPERIMENTDescription(not set)Web AppJob Agencies
RIGID_ID_PATTERN_DATASETDescription(not set)Web AppJob Agencies
RIGID_ID_PATTERN_FEATUREDescription(not set)Web AppJob Agencies
DATE_FORMATDescription(not set)Web AppJob Agencies
DATE_TIME_FORMATDescription(not set)Web AppJob Agencies
The following config parameters need to be set for specific workflow modules by specifying them in the gp.cfg file of all job agencies:
ModuleParameterDescription
module 'BEDTools: intersectBed'BEDTOOLS_FOLDERThe folder containing the BEDTools installation.
module 'Find Peaks with CCAT'CCATFull path to the CCAT exectuable.
module 'Find Peaks with ChIPSeqPeakFin..'CHIPSEQ_PEAK_FINDERFull path to the ChIPseqPeakFinder exectuable.
module 'Find Peaks with MACS'MACS14_LIBFull path of the MACS library folder.
module 'Find Peaks with MACS'MACS14_CMDFull path to the MACS 1.4 executable.
module 'Find Peaks with MACS'PYTHONFull path to the Python 2.7 executable.
module 'MEME Motif Discovery'MEME481_CMDFull path to the MEME 2.8.1 executable.
module 'Find Peaks with SISSRs v1.4'SISSRS_V14_EXEFull path to the SISSRs 1.4 executable.
module 'FASTX Toolkit: Artifacts Filter'FASTXTOOLKIT_FOLDERFull path to the folder containing the FASTX-Toolkit executables.
module 'FASTX Toolkit: Clip Adapter Se..'FASTXTOOLKIT_FOLDERFull path to the folder containing the FASTX-Toolkit executables.
module 'FASTX Toolkit: Reverse Complem..'FASTXTOOLKIT_FOLDERFull path to the folder containing the FASTX-Toolkit executables.
module 'Align against DNA with Bowtie ..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Align against DNA with Bowtie ..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'Align against cDNA with Bowtie..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Align against cDNA with Bowtie..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'Align against Sequences with B..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Align against Sequences with B..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'TopHat 1.2 Alignment'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'TopHat 1.2 Alignment'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'TopHat 1.2 Alignment'SAMTOOLS_FOLDERFull path to the folder containing the SAMTools executables.
module 'TopHat 1.2 Alignment'TOPHAT120_EXECUTABLEFull path to the Tophat 1.20 executable.
module 'Quality Control + Bowtie Itera..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'Quality Control + Bowtie Itera..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Quality Control + Bowtie Align..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'Quality Control + Bowtie Align..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Quality Control + Tophat 1.20 ..'BOWTIE_FOLDERFull path to the folder containing the Bowtie executables.
module 'Quality Control + Tophat 1.20 ..'BOWTIE_INDEX_FOLDERFull path to the folder containing the Bowtie index files.
module 'Quality Control + Tophat 1.20 ..'SAMTOOLS_FOLDERFull path to the folder containing the SAMTools executables.
module 'Quality Control + Tophat 1.20 ..'TOPHAT120_EXECUTABLEFull path to the Tophat 1.20 executable.
module 'MACS + Gene Association + Stat..'MACS14_LIBFull path of the MACS library folder.
module 'MACS + Gene Association + Stat..'MACS14_CMDFull path to the MACS 1.4 executable.
module 'MACS + Gene Association + Stat..'PYTHONFull path to the Python 2.7 executable.
module 'SRA File Parser'SRA_TOOLS_2_3_3_3Full path to the folder containing the SRA Toolkit executables.
module 'DESeq'RLIB_DESEQA boolean flag indicating the DESeq library is installed.
module 'DESeq'RFull path to the R executable.
module 'DESeq (for Region Data)'RLIB_DESEQA boolean flag indicating the DESeq library is installed.
module 'DESeq (for Region Data)'RFull path to the R executable.
module 'EdgeR'RLIB_EDGERA boolean flag indicating the edgeR library is installed.
module 'EdgeR'RFull path to the R executable.
module 'EdgeR (for Region Data)'RLIB_EDGERA boolean flag indicating the edgeR library is installed.
module 'EdgeR (for Region Data)'RFull path to the R executable.
module 'GOSeq Enrichment Analysis'GO_ORG_RN_EG_DBA boolean flag indicating the org.Rn.eg.db library is installed.
module 'GOSeq Enrichment Analysis'GO_ORG_MM_EG_DBA boolean flag indicating the org.Mm.eg.db library is installed.
module 'GOSeq Enrichment Analysis'GO_ORG_HS_EG_DBA boolean flag indicating the org.Hs.eg.db library is installed.
module 'GOSeq Enrichment Analysis'RFull path to the R executable.
module 'GOSeq Enrichment Analysis'GO_ORG_DR_EG_DBA boolean flag indicating the org.Dr.eg.db library is installed.
module 'GOSeq Enrichment Analysis'GOSEQA boolean flag indicating the goseq library is installed.
module 'GOSeq Enrichment Analysis'GODBA boolean flag indicating the GO.db library is installed.
module 'GOSeq Enrichment Analysis'GO_ORG_DM_EG_DBA boolean flag indicating the org.Dm.eg.db library is installed.
module 'Quantile Normalization'RFull path to the R executable.
module 'Quantile Normalization'R_PREPROCESSCOREA boolean flag indicating the preprocessCore library is installed.

Managing GeneProf as an Administrator

Dealing with Publication Requests

Publication of experiments in GeneProf requires approval by the administrators (see SubConcept 'Finalization and Publication'). This is our way of ensuring that only high-quality data is added to the public section of the GeneProf databases.
When a user issues a request to make an experiment public, all administrators should automatically receive an email notification. All experiments whose publication is pending are also listed on the main admin page.
The administrators should open the experiment and check that the sample annotations are satisfactory (in terms of completeness and style) and that the experimental workflow is formally correct (we do not evaluate the scientific soundness of the experiment, this is the job of the reviewers for published articles, rather we only check that there are no formal mistakes in the way GeneProf has been used for the analysis).
If there are any flaws, the administrators may get in touch with the owner of the experiment and (temporarily) reject the publication of the experiment.
After any mistakes have been rectified, an administrator can decide to accept the publication of the experiment. The experiment is now marked as public and can be accessed by all users of GeneProf. The sample annotation is likewise transferred over to the database table storing information for all public samples, however, the data from the experiment will not automatically be added to the public summary pages nor will any genome browser tracks be added to the public repository -- a selection of high-quality datasets for both sections may be chosen by the administrators as outlined in the next section.

Adding Data to the Public Summary Pages

As discussed above, making an experiment public does not automatically warrant inclusion of the experiment's data in the public summary pages (such as the gene summary page or the VDE, see this page and this page), rather GeneProf gives the administrator a further level of more fine-grained control over which data will appear in these pages.
Two types of data may be added to the public data assortment: feature datasets (gene expression values as raw read counts, RPM or RPKM, transcription factor association scores (TFAS) and binary binding-protein to target-gene assignments) and genomic region datasets as genome browser tracks. In both cases, the procedure for selecting datasets is very similar: The administrator imports all relevant datasets from the published experiment into one of the administrator's own experiments and uses one of the admin-only modules to include the data in the public repository:

Adding Public Feature Data

Adding public gene-centric data:

  1. Create a new experiment or re-use a previously created experiment for admin functionality.
  2. Import all relevant feature datasets from the public experiment into this experiment. For RNA-seq experiments the 'relevant dataset' would be the first one containing expression values for all samples studied. For ChIP-seq datasets there would usually be two datasets: One containing binary protein-to-target-gene assignments and one containing TFAS scores for all studied proteins.
  3. Open the workflow designer and drag & drop a Make Annotations Public onto the workflow canvas.
  4. Connect the imported datasets (all from the same source experiment) to the new workflow module and save the workflow.
  5. Now configure the module. The config dialog will list all columns from all input datasets and ask you which you would like to include in the public repository. To include a column, select the correct data type, i.e. [RC] for raw read counts, [RPM] for RPM-transformed counts, [RPKM] for RPKM-transformed counts, TFAS for TFAS scores, Has TFBS? for binary protein-to-target assignments and Expression for other expression values, e.g. from microarrays. For each column, select the sample annotation matching the data from the dropdown list on the right (if the column names match the sample names, they will be selected by default). Lastly, you may change the labels for the columns used.
  6. Config hint: To automatically prepend a fixed string to the titles of all columns, enter the string to be prepended into the text box at the very bottom and double-click it. To change the sample annotation of all columns, select it from the text box at the very bottom of the page.
  7. After configuring the module, save and execute the workflow. Execution should only take a few seconds, unless you have inlcuded RPM values in the annotation, in which case cross-correlations between all genes have to be recalculated which may take a few minutes.

Adding Public Genome Browser Tracks

Adding public genome browser tracks:

  1. Create a new experiment or re-use a previously created experiment for admin functionality.
  2. Import all relevant genomic region datasets from the public experiment into this experiment. For RNA-seq and ChIP-seq experiments alike these would usually be the datasets coming straight out of the alignment modules (usually Bowtie or Tophat).
  3. Open the workflow designer and drag & drop a Make Tracks Public onto the workflow canvas.
  4. Connect the imported datasets (all from the same source experiment) to the new workflow module and save the workflow.
  5. Now configure the module. The config dialog will list all input datasets and let you choose the track type and other settings to be used for each dataset. We use AllWIG as a track type for RNA-seq alignments and DistinctWIG for ChIP-seq alignments. Next select a category for each track and enter an 'extension' (fragment length) for each track: Keep it a -1 for RNA-seq data and use a consistent value (e.g. 200) for ChIP-seq data unless more accurate information about the fragment length is available. For each dataset, select the sample annotation matching the data from the dropdown list on the right (if the dataset names match the sample names, they will be selected by default). Lastly, you may change the labels for the genome browser tracks.
  6. Config hint: To quickly change the values for all datasets enter or select the correct value from the text or select box at the very bottom and double-click it.
  7. After configuring the module, save and execute the workflow. Execution should only take a few seconds.

Adding Public Gene/Genome References

New reference datasets created in any experiment may later on be added to the list of GeneProf-recommended reference dataset, by using the Make Reference Public admin-only module.

Making a reference dataset 'GeneProf-recommended':

  1. Create a new experiment or re-use a previously created experiment for admin functionality.
  2. Import the reference dataset .
  3. Open the workflow designer and drag & drop a Make Reference Public onto the workflow canvas.
  4. Connect the imported dataset to the new workflow module and save the workflow.
  5. Now configure the module. All you need to do at this stage is to decide on an identifier to be used for the public reference dataset. GeneProf will automatically prefix the ID you enter with gpDS_pub_!
  6. After configuring the module, save and execute the workflow.
  7. Important: Any existing reference dataset with the same ID will be overwritten without further confirmation, so be careful!
  8. Optional: In order to display a custom image for the reference dataset in workflows or other pages using it, create a PNG image called org_tXYZ.pmg (where XYZ is the taxon ID of the reference) and insert the file into the imgs folder of the web app. The image should be approx. 75x70px in width and height.

Restoring Serialised Experiments

Experiments that have been serialised to a binary file using the Export Experiment module can be restored into a working GeneProf experiment using the admin-only Import Experiment module. For this to work, no experiment with the same accession number may exist yet and all external datasets that the experiment depends on must exist prior to import (that is, reference datasets and datasets imported from other experiments).

Importing a serialised experiment:

  1. Make sure the serialised experiment file to be imported (a .gpx file) has been placed into the experiment-io subfolder of the FTP server.
  2. Create a new experiment or re-use a previously created experiment for admin functionality.
  3. Open the workflow designer and drag & drop a Import Experiment module onto the workflow canvas.
  4. Save the workflow, then open the config dialog for the new module.
  5. Select the file to be imported. If all dependencies exist and the experiment doesn't exist yet, you can now click ok. Ohterwise, please resolve all dependencies first.
  6. After configuring the module, save and execute the workflow.

Other Administrator Functionality

Further administrator functionality can be accessed from the administrator main page (itself accessible via a link in the navigation bar, but only to users with an administrator role and only to a range of IP addresses deemed secure, see in SECURE_IP_RANGE in Section 'Installing GeneProf Locally').

Managing User Disk Quota

After registering for a new GeneProf account, each new user is automatically allocated a certain amount of disk space on the GeneProf servers (the default may be changed by altering the value of the DEFAULT_DISK_QUOTA parameter in the context.xml).
Administrators may change the quota allocation later on via the 'User Disk Quota Usage' category of the admin page. Just find the user to be updated in the page and enter the new allocation (in megabytes) into the text box next to the user name, then click the update button.

Updating the Manual

There are two ways to update the online version of the manual (and the PDF version) after making changes to the manual source code:
All files can be set to be automatically rebuilt at launch of the web app by setting the configuration parameters REBUILD_HTML_MANUAL and/or REBUILD_PDF_MANUAL (in the context.xml) to true. Alternatively, the files can be prebuilt and permanently replaced via the admin section under the category 'Maintenance' and by clicking the link 'Build Manual'.

Web API: Retrieving Data from GeneProf

GeneProf allows programmatic access to a large part of the data stored in its databases via a RESTful web API. This component of GeneProf is now documented separately at: GeneProf Web Services Documentation.

Web API: Advanced Examples

GeneProf allows programmatic access to a large part of the data stored in its databases via a RESTful web API. This component of GeneProf is now documented separately at: GeneProf Web Services Documentation.

Module Development: Adding new Functionality to GeneProf

Thinking of adding your own algorithm or program to GeneProf? That's great -- and really shouldn't be all that difficult to do, either!
So, as you're surely aware of by now, most of GeneProf's analysis components are implemented as so-called workflow 'modules' (cp. SubConcept 'Workflow Modules'). These are (usually rather little) programs that receive user-selected input data, do some sort of magic with these data and write a defined set of outputs back to the GeneProf workflow environment. Taking a more technical point of view, each GeneProf's modules are programs implemented in the Java programming language (don't worry, though: If you're not familiar with Java, you can quite easily wrap any sort of commandline program from within here). Each module is a class inheriting from the ComponentHandler class. To create a new module, only a few abstract methods need to be implemented, most noteworthily the execute(..) method. This is where all the actual action happens.
Let's have a look at the ComponentHandler class in more detail:

ComponentHandler.java.code:


package org.stembio.geneprof.workflow;

public abstract class ComponentHandler implements Serializable,
                                                 Comparable<ComponentHandler> {


    public ComponentHandler( ComponentHandlerInfo info,
                             InputPort[] inputs,
                             OutputPort[] outputs,
                             Parameter[] params ) {
        [..] // set parameters
    }

    public abstract void execute(Experiment exp,
                 Map<String, DatasetHandler[]> inp,
                 Map<String, DatasetHandler> outp,
                 Map<String, String> parameters);

    [.. implemented methods omitted ..]

}
There's only one method that needs to be implemented, execute(..), which runs the actual analysis code that is relevant to the analyses performed. You will also have to pass a few arguments to the constructor of the superclass to identify and describe the module (such as a name and description for the module and, importantly, the definition of inputs, outputs and parameters). Additionally, there are a few more methods that you might wish to override to further customize your module, but more about that later.
Alright, the remainder of this chapter is structured as follows: First, we will explain how to set up a development environment for programming new modules and how to install all software and dependencies required (cp. Subsection 'Setting up a Development Envir..'). We'll then dive straight into a simple example giving code about how things might be done (cp. Subsection 'Getting started: A Simple Exam..'). We continue by explaining common design principles, strategies and pitfalls you might encounter when writing a GeneProf module (cp. Subsection 'Tips, Tricks and Pitfalls'). Lastly, we provide the JavaDoc documentation of all GeneProf classes (cp. Subsection 'Source Code & JavaDoc') and instructions detailing how to go about getting your work into the public GeneProf instance (cp. Subsection 'Getting your Modules into Gene..').

Setting up a Development Environment

In order to test the code you will write, it will be necessary for you to run your new module prior to submitting it to the GeneProf team (cp. Subsection 'Getting your Modules into Gene..'). There are two ways to go about setting up an environment for running GeneProf modules:
Installing GeneProfDirect and the Truncated GeneProf Databases
Note: We recommend a Linux operating system for development work and the instructions below are tailored for those. We assume a basic level of familiarity with the use of commandline tools and general system administration. We have currently not tested this software on any system. However, if you do want to try installing GeneProfDirect on any OS, please make us aware of any problems you encounter (see Section 'Reporting Bugs and Feature Req..').
Following the steps outlined below will provide you with all the basic setup required to start developing GeneProf modules. You will download GeneProfDirect as well as the GeneProf Java libraries and a basic setup of the GeneProf databases (these are split across a local shared folder to be used for storing binary flat files -- or alternatively an FTP -- and a relational database, preferrably MySQL).

GeneProfDirect Installation:

  1. Download the latest version of the GeneProf Databases (see Subsection 'Installing the GeneProf Databa..') as well as all requirements for the database and job agency computers as listed in Subsection 'Setting up the Environment'.
  2. Download the latest version of GeneProfDirect from the downloads section.
  3. Choose a location and extract the archive like this: $ tar -xvfz GeneProfDirect.tar.gz
  4. Modify the file GeneProfDirect/run.cfg: This is a configuration file telling GeneProf where certain programs are installed, acess credentials for the database, etc. Update the values DB_USER, DB_PWD and DB_HOST with your MySQL username, password and address, respectively. Please also set either SHARED_ROOT_FOLDER to the shared database folder or set FTP_USER, FTP_PWD, FTP_HOST and FTP_ROOT_FOLDER, if you are using an FTP server.
  5. If you're using any non-standard command to execute Java or R (anything other than java and R), please also update the configuration parameters with the full path name of the executables (the parameters are called JAVA_EXECUTABLE and R).
  6. Test your GeneProfDirect installation by executing: $java -jar GeneProfDirect.jar
  7. Hopefully, the execution should complete with the message ALL TESTS COMPLETE!, otherwise, please follow the instructions on the screen or get in touch if problems persist.
Obtaining GeneProf Libraries
In order to develop GeneProf modules, you'll also need to obtain a copy of the GeneProf java libraries. These are split across several projects: GeneProfCoreClasses -- essential Java classes used throughout the entirety of GeneProf, GeneProfWorkflowModules -- the code behind all GeneProf modules and SequenceEncoder -- a small utility class used for encoding and decoding nucleotide sequences in a more compact representation format.
Copies of the latest version of these files (packaged as .jar files) can be obtained from the downloads section. Include those in your classpath to make use of the included classes in your modules.
Other Recommended Development Tools
We do almost all our Java development work with the Netbeans IDE and recommend using this environment for coding, although we have also used Eclipse and found no problems with it. To examine database contents outside the GeneProf web app, we find DBVisualizer of great use, although you can, of course, use a commandline client or any other graphical application if you prefer.

Getting started: A Simple Example Module

Rather than going into a long-winded explanation of the system architecture and design philosophy, we'll rather proceed straight to an example at this point.
So, let's imagine we want to define a new module that takes nucleotide sequences, extracts random sub-sequences and writes those to a new output dataset. We start by creating a new Java class that inherits from ComponentHandler and define a constructor calling the super-constructor:

example-module.part1.code:


public class ExampleModuleRandomLengthSeqs extends ComponentHandler {

    public static final String INPUT_NAME = "Input Sequences";
    public static final String OUTPUT_NAME = "Random Subsequences";
    public static final String PARAM_MIN_LENGTH = "minlen";

    public ExampleModuleRandomLengthSeqs() {
        super(
                new ComponentHandlerInfo(
                                    "Example: Random Subsequences",
                                    "Cut out sub-sequences to random length.",
                                    "2011-08-16",
                                    new User("John","Doe","contact@me.com")
                ),
                new InputPort[] {
                    new InputPort(
                            "Input Sequences", "A set of input sequences.",
                            DataType.SEQUENCES, 1
                    )
                },
                new OutputPort[] {
                    new OutputPort(
                            "Random Subsequences",
                            "Randomly cut-out subsequences, one per input.",
                            DataType.SEQUENCES, Port.UNLIMITED_BRANCHING
                    )
                },
                new Parameter[] {
                    new Parameter(
                            "minlen", "Min. Length",
                            "Minimum length of random sequences.",
                            3, ParameterType.INT
                    )
                }
        );
    }

    [..]
}
In the constructor we use an anonymous instance of ComponentHandlerInfo to specify some generic information about the module. There are several overloaded versions of this constructor and we could furthermore pass in additional arguments, but let's keep it simple for now and only provide (a) a name for the new module, (b) a short description, (c) a version number (or code) and (d) information about the developer ('John Doe' in the example, you in a real module!).
The next argument to the constructor is an array of type InputPort. Each input port corresponds to one connector in the workflow designer. It's of a defined type (from the DataType enumeration) and branching factor (can more than one input be connected to the same port? -1 would mean 'unlimited'). Each input port also has a name (first argument) and a description (second argument). The name is important! It serves as a unique identifier to this port, so you MUST NOT use the same port name twice! Our example module takes only one input of type DataType.SEQUENCES and we call the port Input Sequences.
Next, we define output ports in a similar fashion. The constructor is identical to the one for input ports. Note how we set the branching factor in such a way, that the output of this module may be used in an unlimited number of other processes. We set the name of the output port to Random Subsequences.
The last argument defines all user-defined parameters passed into the module. Again, we only specify one parameter for the example module: An (integer) number defining the minimum length of the sequences created. The constructor for the Parameter class takes the following argumetns (in order): A unique key, a (short) label, a (slightly longer) description, a default value (here 3) and a data type (from the ParameterType enumeration).
This lays the groundwork for our new module. Now we need to add the actual functionality of the module by implementing the execute(..) method. We start by adding this code:

example-module.part2.code:


@Override
public void execute(Experiment exp,
                    Map<String, DatasetHandler[]> inp,
                    Map<String, DatasetHandler> outp,
                    Map<String, String> parameters)
                 throws RestrictedDataException, SQLException, IOException {

    SequenceDataHandler inSeqs =
                             (SequenceDataHandler) inp.get("Input Sequences")[0];
    SequenceDataHandler outSeqs =
                                (SequenceDataHandler) outp.get("Random Subsequences");

    int minLen = getParameterInteger(parameters, "minlen", true);

    Random rnd = Utility.getRandom();

    [..]
}
To start, we retrieve a reference to the input and output dataset using the arguments passed to the method and the port names defined earlier. Since input ports might receive more than one dataset per port, the value is an array, but in our case we know that there's only one, so we just take the entry at index position 0. We cast these dataset references (instances of a super-class called DatasetHandler) to a more specific sub-class, namely SequenceDataHandler, which we can do safely because we have defined the input ports to accept only data of type DataType.SEQUENCES. It's ok to assume that only the correct types of data are passed in at this point. Similar 'handlers' exist for other types, e.g. GenomicRegionDataHandler or FeatureDataHandler.
We then retrieve the user-defined value for the only parameter we defined 'minlen'. The true in the end means that we want to use the default value, if no user-defined value has been passed in.
Lastly, we get a random number generator from one of GeneProf's utiltiy classes (adequately named Utility), which will be used later on to get some random numbers. Now we have everything we need to get going:

example-module.part3.code:


[..]

for(Sequence s : inSeqs) {

    int randomLen =
     min_len + rnd.nextInt(s.getFullSequenceLength() - minLen);
    int start = rnd.nextInt(s.getFullSequenceLength() - randomLen - 1);

    String nucSeq = s.getFullSequence().substring(start, randomLen);
    String qualSeq = null;
    if(s.hasQualities()) {
        qualSeq = s.getFullQualities();
    }

    Sequence newSeq = new Sequence(
            s.getId(),
            nucSeq,
            qualSeq
    );

    outSeqs.addData(newSeq);
}
What we do here is, to iterate the input data -- one sequence at a time -- decide on a random fragment to cut out from the sequence, then trim out this sequence, creating a new Sequence data object. This sequence is then added to the output dataset. Pretty straightforward, isn't it? Rest assured, it won't always be quite that simple, but the fundental idea is always pretty much like this.
Let's put this all together, add some comments and define constants for the port and parameter names:

ExampleModuleRandomLengthSeqs.java.code:


import java.io.IOException;
import java.sql.SQLException;
import java.util.*;
import org.stembio.geneprof.core.Constants.DataType;
import org.stembio.geneprof.data.*;
import org.stembio.geneprof.data.db.*;
import org.stembio.geneprof.exceptions.*;
import org.stembio.geneprof.convenience.Utility;
import org.stembio.geneprof.users.User;
import org.stembio.geneprof.workflow.*;
import org.stembio.geneprof.workflow.Parameter.ParameterType;

public class ExampleModuleRandomLengthSeqs extends ComponentHandler {

    public static final String INPUT_NAME = "Input Sequences";
    public static final String OUTPUT_NAME = "Random Subsequences";
    public static final String PARAM_MIN_LENGTH = "minlen";

    public ExampleModuleRandomLengthSeqs() {
        super(
                // general information:
                new ComponentHandlerInfo(
                                    "Example: Random Subsequences",
                                    "Cut out sub-sequences to random length.",
                                    "2011-08-16",
                                    new User("John","Doe","contact@me.com")
                ),
                // one input dataset (nucleotide sequences):
                new InputPort[] {
                    new InputPort(
                            INPUT_NAME, "A set of input sequences.",
                            DataType.SEQUENCES, 1
                    )
                },
                // one output dataset (also, sequences):
                new OutputPort[] {
                    new OutputPort(
                            OUTPUT_NAME,
                            "Randomly cut-out subsequences, one per input.",
                            DataType.SEQUENCES, Port.UNLIMITED_BRANCHING
                    )
                },
                // one parameter: minimum length of output sequences
                new Parameter[] {
                    new Parameter(
                            PARAM_MIN_LENGTH, "Min. Length",
                            "Minimum length of random sequences.",
                            3, ParameterType.INT
                    )
                }
        );
    }

    @Override
    public void execute(Experiment exp,
                        Map<String, DatasetHandler[]> inp,
                        Map<String, DatasetHandler> outp,
                        Map<String, String> parameters)
                     throws RestrictedDataException, SQLException, IOException {

        // get reference to input data:
        SequenceDataHandler inSeqs =
                                 (SequenceDataHandler) inp.get(INPUT_NAME)[0];
        // get reference to output data:
        SequenceDataHandler outSeqs =
                                    (SequenceDataHandler) outp.get(OUTPUT_NAME);

        // retrieve the value for the minimum length parameter (using the
        // default, if it has not been set:
        int minLen = getParameterInteger(parameters, PARAM_MIN_LENGTH, true);

        // get a (seeded) random number generator:
        Random rnd = Utility.getRandom();

        // then iterate or sequences in the input data:
        for(Sequence s : inSeqs) {

            // calculate a random length and random offset:
            int randomLen =
             minLen + rnd.nextInt(s.getFullSequenceLength() - minLen);
            int start = rnd.nextInt(s.getFullSequenceLength() - randomLen - 1);

            // cut out the new nucleotide sequence from the input sequence:
            String nucSeq = s.getFullSequence().substring(start, randomLen);
            // .. if there are associated quality values, cut those, too:
            String qualSeq = null;
            if(s.hasQualities()) {
                qualSeq = s.getFullQualities();
            }

            // create a new sequence data object:
            Sequence newSeq = new Sequence(
                    s.getId(),
                    nucSeq,
                    qualSeq
            );

            // .. and, finally, add the sequence to the output dataset:
            outSeqs.addData(newSeq);
        }
    }

}

Tips, Tricks and Pitfalls

As we have seen in above example, most of the action in a GeneProf module happens in the execute(..) method. From the arguments of this method, you can retrieve references to the input and output datasets. These are passed in in form of Maps pointing from the identifier of a port to a DatasetHandler -- or for inputs, to an array of those, because there might be more than one, if you allowed the port to accept multiple inputs. These DatasetHandlers are generic interfaces providing functionality applicable to all data, regardless of the specific type. Often you will find it more useful to cast these handlers to a more specific, data type-specific sub class. Here's a list of the correct of data types to the respective handlers:
SEQUENCES org.stembio.geneprof.data.db.SequenceDataHandler
GENOMIC_REGIONS org.stembio.geneprof.data.db.GenomicRegionDataHandler
FEATURES org.stembio.geneprof.data.db.FeatureDataHandler
REFERENCE org.stembio.geneprof.data.db.ReferenceDataHandler
FILE org.stembio.geneprof.data.db.FileDataHandler
SPECIAL org.stembio.geneprof.data.db.SpecialDataHandler
Furthermore, you might be interested in using convenience-methods for retrieving parameter values more easily and quickly:
getParameter(..)STR or any other
getParameterDouble(..)DBL, INT
getParameterInteger(..)INT
getParameterBoolean(..)BOOL
Now, most modules won't be quite as simple as the example module above and often the structure of the output dataset depends heavily on what's going on inside the module. Most GeneProf data is essentially tabular, i.e. their are rows of data with each row representing a separate entry (e.g. a genomic interval, sequence or gene) and each column storing a different type of value. In GeneProf, each column is represented by an AnnotationType object, which stores metadata about the column such as the data type, label and a description. Each row, in turn, is represented by an AnnotationValues object -- essentially a map from the IDs of annotation types to the value of the respective row. These values are typically retrieved iteratively rather than all at once, since genomic data can be huge and quite heavy on the system memory. GeneProf automatically adds all columns required to store the defining information for a given data type to output datasets. Which exactly these are depends on the type, but references to the columns can be easily retrieved from the specific data handlers, e.g. org.stembio.geneprof.data.db.GenomicRegionDataHandler has a method called getStartAnnotationType(), which refers to the column storing the left-most coordinate of a region.
For datasets of type FEATURES and GENOMIC_REGIONS it is possible to add an arbitrary assortment of additional columns. These need to be added to the output dataset handler before any data is fed into it (use the addAnnotationType(..) method).
Rather than working with AnnotationValues most data can be iterated and retrieved in more convenient formats, i.e. SEQUENCES as Sequence and GENOMIC_REGIONS as IdentifiableComplexLocus objects (ok, ok, ugly name.. still better than AnnotationValues..). A conjunction of basic values and additional columns is usually stored in an AnnotatedEntry<T> object, which is a tuple consisting of either an IdentifiableComplexLocus or a Sequence and an AnnotationValues.
This might (ok, WILL) all sound a bit confusing in the beginning, but will soon start making sense after looking at some other modules. We're demonstrating some of these features in the next example.
More information will be added to this section in the future.

Advanced Example: Integrating External Software

The bioinformatics community has developed a wide range of tools for next-generation sequencing and functional genomic data analysis. Many of these tools could sensibly be integrated into the GeneProf environment and, in fact, we have already done so (cp. e.g. module 'Align against DNA with Bowtie ..' or module 'DESeq').
The key to integrating external software is to write a small 'wrapper' module, which understands sort of input the tools requires, what parameters might be set and which outputs are being produced. The module then has to convert GeneProf data input into the correct format, execute the program (with user-defined parameters) and parse the output back into a GeneProf-compliant data format.
Probably the best way to demonstrate this is to look at another example. In this example, we will run a simple shell script to manipulate some GeneProf data. This is the script:

MyShellScript.sh.code:


perl -p -i -e 's/^(\d+)\t3\t/$1\t2\t/' $1
All it does is to look for entries that are on the chromosome with the ID 3 (note that chromosome ID is not (always) equivalent to the chromosome name!) and places them onto the chromosome with ID 2. Pretty useless, but it'll do as an example. Now, how do we use this script in a GeneProf module? Well, here's the code:

ExampleModuleExternalScript.java.code:


import java.io.*;
import java.sql.SQLException;
import java.util.*;
import org.stembio.geneprof.annotations.*;
import org.stembio.geneprof.core.Constants.*;
import org.stembio.geneprof.core.*;
import org.stembio.geneprof.data.*;
import org.stembio.geneprof.data.db.*;
import org.stembio.geneprof.exceptions.*;
import org.stembio.geneprof.features.*;
import org.stembio.geneprof.interfaces.commandline.CommandLineInterface;
import org.stembio.geneprof.users.User;
import org.stembio.geneprof.workflow.*;
import org.stembio.geneprof.annotations.AnnotatedEntry;


public class ExampleModuleExternalScript extends ComponentHandler {

    // configuration file parameter:
    public static final String MY_GREAT_SCRIPT = "MY_GREAT_SCRIPT";

    // port names:
    public static final String INPUT_PORT = "Input Regions";
    public static final String OUTPUT_PORT = "Processed Regions";

    public ExampleModuleExternalScript() {
        super(
                new ComponentHandlerInfo( "Example: External Script",
                        "Runs a shell script processing a BED file.",
                        "2011-08-16", new User("John","Doe","contact@me.com") ),
                new InputPort[] { new InputPort(INPUT_PORT,
                         "A set of genomic regions.", DataType.SEQUENCES, 1) },
                new OutputPort[] { new OutputPort(OUTPUT_PORT,
                         "Sequence data processed using my great script.",
                         DataType.SEQUENCES, Port.UNLIMITED_BRANCHING) },
                new Parameter[0]
        );
    }

    @Override
    public void execute(Experiment exp, Map<String, DatasetHandler[]> inp,
                        Map<String, DatasetHandler> outp,
                        Map<String, String> parameters)
                                 throws RestrictedDataException, SQLException,
                                             IOException, InterruptedException {
        // get configuration:
        String sh = LocalConfig.get(LocalConfig.SH);
        String script = LocalConfig.get(MY_GREAT_SCRIPT);

        // get input and output handlers:
        GenomicRegionDataHandler in =
                             (GenomicRegionDataHandler) inp.get(INPUT_PORT)[0];
        GenomicRegionDataHandler out =
                             (GenomicRegionDataHandler) outp.get(OUTPUT_PORT);

        // define additional new column for the output dataset:
        AnnotationType atFlag = new AnnotationType(out.getData(),
                                        out.getData() + "_FLAG",
                                        "Processed",
                                        "Flag indicating that this record has "
                                        + "been processed using " + script,
                                        AnnotationValueType.BOOL);
        out.addAnnotationType(atFlag);

        // Simple way to export data in standard formats:
        // RegionDataExporter.exportToBED(..);
        // RegionDataExporter.exportToWIG(..);
        // ..

        File tmpFile = null;
        PrintWriter pw = null;
        BufferedReader br = null;
        try {
            // export the input data in a tab-delimited text file:
            // ID <TAB> CHR <TAB> START <TAB> END <TAB> STRAND
            tmpFile =
                 File.createTempFile("tmp", "txt", LocalConfig.getTempFolder());
            tmpFile.deleteOnExit();
            pw = new PrintWriter(new FileWriter(tmpFile));
            String linePattern = "%d\t%d\t%d\t%d\t%s";
            for(IdentifiableComplexLocus locus : in) {
                pw.println(String.format(linePattern,
                        locus.getIdentifier(), locus.getChromosome(),
                        locus.getStart(), locus.getEnd(),
                        Strand.getStrandString(locus.isPlusStrand()))
                );
            }
            pw.flush();
            pw.close();
            pw = null;

            // execute the shell script using the `sh` command and pass in the
            // absolute path of the temporary file as a parameter:
            CommandLineInterface.runCommand(
                            sh + " " + script + " " + tmpFile.getAbsolutePath()
            );

            // parse output (same format as input):
            br = new BufferedReader(new FileReader(tmpFile));
            String line;
            String[] tokens;
            while((line = br.readLine()) != null) {
                tokens = line.split("\\t");
                // create a new region description from the parsed data:
                IdentifiableComplexLocus locus = new IdentifiableComplexLocus(
                        Integer.parseInt(tokens[0]), // id
                        Integer.parseInt(tokens[1]), // chromosome
                        Strand.getStrand(tokens[2]), // strand
                        Integer.parseInt(tokens[3]), // start
                        Integer.parseInt(tokens[4]) // end
                );
                // store additional data in this map-like object:
                AnnotationValues avs =
                                    new AnnotationValues(locus.getIdentifier());
                // set the processed-flag to true:
                avs.addValue(atFlag.getAnnotationTypeId(), true);

                // add both to the output:
                out.addData(
                        new AnnotatedEntry<IdentifiableComplexLocus>(locus, avs)
                );
            }

        }
        finally {
            // always tidy up temporary files, streams, etc.!
            if(pw != null) try {
                pw.close();
            } catch(Throwable t) { /* don't worry about exception here.. */ }
            if(br != null) try {
                br.close();
            } catch(Throwable t) { /* don't worry about exception here.. */ }
            if(tmpFile != null && tmpFile.isFile()) tmpFile.delete();
        }
    }

    @Override
    public Map<String,String> getRequiredEnvironmentVariables() {
        Map<String,String> envVars = new HashMap<String,String>();
        envVars.put(LocalConfig.SH, "Shell script execution command.");
        envVars.put(MY_GREAT_SCRIPT, "The external script to be executed.");
        return envVars;
    }

}
Like in our earlier example, we first define the general configuration of the module by calling the constructor of the super-class. Note that this module deals with GENOMIC_REGIONS rather than SEQUENCES so the DataType of the in- and output ports are different from before! This module does not take any paramters.
Importantly, this module will depend on external software (our little script) to be installed on any processing nodes attempting to run the module. To check that this setup is present, we need to specify all configuration options that need to be set to run the module. We do this by overriding the getRequiredEnvironmentVariables() method. The returned array contains the names of all configuration variables that need to be set. Here we require a shell (sh) and a custom config option that we called MY_GREAT_SCRIPT.
Alright, let's dive into the depths of the execute(..) method! In addition to running our script, we want to extend each processed entry with a new annotation value, a simple flag which indicates that the entry has been processed with our script (so we might later on tell that it's bogus..). To do so, we define a new AnnotationType and add it to the output dataset, before we start any further processing. That done, we export the input data into a tab-separated text file by creating a temporary file and printing selected information from the input data into it -- one entry per line.
Subsequently, we execute the script using the shell (sh) and provide the full path to the temporary file as an argument, so it can be referred to from witin the script.
The script modifies the file in place, so afterwards we can parse the temporary file and write its content, after converting it into GeneProf-compliant data, to the output dataset. Note how we make use of an AnnotatedEntry<IdentifiableComplexLocus> object here to bundle the essential genomic region information with the additional annotation value for the new flag column.

Source Code & JavaDoc

Code documentation GeneProf classes and modules as well as source code for many of the classes can be obtained from the downloads section.

Getting your Modules into GeneProf

Once you've completed a new module and tested it (extensively), you might want to consider getting your piece of work integrated into the live GeneProf application and future releases of the software. In fact, we're very keen to hear from you! What we'll do is, we'll have a look at your code, just to check that there are no obvious flaws or security risks and maybe we can suggest some improvements or alterations. Once both parties are satisfied, we'll add your module to the repository of tools and deploy it alongside the next scheduled update to the GeneProf system.
A few further nodes here: The module code remains your (intellectual) property and throughout the system and documentation module developers are appropriately attribured at all times. If you ever feel it necessary to update your module, it is of paramount importance to consider in which effect that might have on existing results -- possibly produced by others using your module. Minor adjustments, that are guaranteed never (ever) to change or invalidate existing results, may be made in place (e.g. performance improvements and the like or adding an additional parameter which would change the way things are processed, but whose default value is set in such a way, that older results would be reproduced perfectly). All other results necessitate the derivation of a new module. The older version can then be marked as 'out-dated', but on-going workflows depending on these modules can still use them without impairing the integrity of their results. The same consideration apply to updated versions of external programs!