Documentation

This documentation is currently being migrated from the manual accompanying the package. It will be updated and eventually replace the manual.


 


Tutorial

Overview

The abcsysbio module within abc-sysbio can be used as a standalone Python library but to facilitate the most common applications of the package, abc-sysbio is supplied with two scripts, abc-sysbio-sbml-sum and run-abc-sysbio. These scripts combine the functions in the package to implement the ABC algorithms.

run-abc-sysbio runs all the simulation, parameter inference and model selection algorithms and takes as input a .xml configuration file.

abc-sysbio-sbml-sum parses SBML files to produce a model summary containing information on the structure of the models structures. It also provides a template .xml configuration file for the user to edit and supply to run-abc-sysbio.

Conversion of SBML models to Python/C++/CUDA code

When run-abc-sysbio is run, model(s) written in SBML format are used to generate an appropriate code module representing the model, via a call to abcsysbio_parser.ParseandWrite. The format of the code written depends on the integration type, which also informs the program which solver to use to simulate the model (see Section “Extending the library”).

In an SBML model, values can be assigned to compartment sizes, to parameters and to species. When the Python module representing the model is written, the model is inspected to determine how values should be assigned during the course of the simulation. Compartments typically have a constant size and are considered as an additional parameter to the model, albeit one that the user may not want to infer. Some species may be outside of the scope of the reactions in the model, having either a constant val
ue or a value assigned by an assignment rule.

Conversely, SBML has the capacity for rate rules to be used to assign values to species, parameters or compartment sizes. Rate rules are time-dependent rules used to assign values to variables. Because our solvers typically assign values to species over
time to compute a trajectory, species, parameters and compartment sizes with rate rules are treated as `species’ in a Python module representing the SBML model.

The user input file

Not all the information required for the algorithms is intrinsic to the model, therefore a .xml file that contains the additional information (such as initial conditions and parameter values) must be supplied to the script. This file, the “user input file”, must be written in a specific format. Examples are included in the “Examples” folder which is supplied along with the package; details of the input file are given here.

Required arguments

  • modelnumber Number of the model for which details are described in this input file
  • epsilon This allows for the specification of the tolerance schedules. Often only one epsilon schedule is required but in more complex design problems or with inferences using summary statistics multiple epsilon schedules may be desired. Within the epsilon tags each vector of schedules can be specified via a whitespace delimited list of values between two tags eg <e1> </e1>. Note that the parser ignores the name of the tag and so will always read them in order. If multiple schedules are provided then the user must specify a custom distance function (see Section “Extending the library”)
  • autoepsilon This is an experimental feature used instead of the epsilon option where the epsilon schedule is automated. The vector of final epsilons (whitespace separated) can be specified within <finalepsilon> </finalepsilon> tags. The <alpha> tag specifies the quantile of the previous population distance distribution to choose as the next epsilon. The optimal value of this parameter will depend on the models and data.
  • particles Number of particles to accept
  • beta Number of times to simulate each sampled parameter set. For deterministic systems beta is set to 1. For analysis of stochastic systems beta can be chosen larger than 1.
  • dt The internal time step for the solvers.
  • data This section is divided into <times> and <variables>. Both describe the experimental data for which the parameters or models respectively have to be analyzed.
    • times The time points are given within <times> tags as a whitespace delimited list.
    • variables The species concentrations (in the case of an ODE or SDE simulation) or molecule numbers (in the case of a Gillespie simulation) are also given as whitespace delimited lists and denoted as <v1>, <v2> .. <vN> for N different species. Note that the names of the tags is ignored and the parser always reads the species in order. Missing data can be specified provided that the first entry in present. Missing data is denoted by ‘NA’.
  • models Each model is contained within tags <modeli> i =1,…,M, where M is the total number of models to be investigated.
    • name The name of the model which will be the name of the code written (with suitable extension .py, .cpp, .cu) to represent the model in a format that is interpretable by the solvers. The name of the code model file if option --localcode is given.
    • source The name of the .xml file containing an SBML representation of the model. Can be left blank if option --localcode is given.
    • type The simulation type. One of ODE, SDE or Gillespie.
    • fit Denotes the correspondence between the species defined in the SBML model and the experimental data. If this keyword is not given, or if fit is None, all the species in the model are fitted to the data in the order that they are listed in the model. Otherwise, a whitespace-delimited list of fitting instructions with the same length as the dimensions of the data can be supplied. Simple arithmetic operations can be performed on the species from the SBML model. To denote the Nth species in the SBML model, use speciesN. For example, to fit the sum of the first two species in the model, write species1+species2.
    • parameters, initial Prior specifications on parameters and initial conditions. Note that the tag names for each parameter and species initial condition are ignored and are always read in the order specified. The prior is specified within the tags via a whitespace delimited list:
      • constant x constant parameter with value x
      • normal a b normal distribution with location a and var b
      • uniform a b uniform distribution on the interval [a, b]
      • lognormal a b lognormal distribution with location a and var b

Optional arguments

  • modelkernel Used in model selection. This controls the model perturbation probability (default =0.7).
  • modelprior This specifies the prior over the models. The default is a uniform prior over the model space
  • kernel The implemented ABC SMC algorithms compute the perturbation kernels after each population, dependent on the previous particle distributions. Implemented distributions for the perturbation kernels are (default uniform)
    • uniform component-wise uniform kernels
    • normal component-wise normal kernels
    • multiVariateNormal multi-variate normal kernel whose covariance is based on all the previous population
    • multiVariateNormalKNeigh multivariate normal kernel where the covariance is based on the K nearest neighbours of the particle
    • multiVariateNormalOCM multi-variate normal kernel whose covariance is the OCM
  • rtol, atol For models to be simulated as an ODE system these two keywords can be used to set the relative and absolute error tolerances for the numeric simulation. For stiff models, this may be necessary for successful simulation.
  • restart Frequently in the implementation of the ABC SMC algorithm, the epsilon schedule selected in the first instance might be sub-optimal, leading to a high acceptance rate and too wide a posterior distribution. In addition this makes parameter inference computationally expensive. To avoid wasting the information from initial attempts at parameter inference, it is possible to make a backup that stores the information about each population after it has been completed. With this backup one can stop the program, change the maximum distances or any other parameters and restart the program with the results of the last population. To do this add: <restart> True </restart>. When restarting from a backup population, it is important not to increase the population size and to keep the structure of the models constant. Permitted changes include epsilon, beta, dt, rtol and atol, the values in data (but not the structure), the initial concentrations, the prior distributions (for constant parameters). Which of these changes will make the inference more informative, we will leave the user to decide.

abc-sysbio-sbml-sum

This user input file can be written by hand; however, we forsee attempts to hand-write the user input file leading to errors. Therefore, the script abc-sysbio-sbml-sum is provided. This script generates two text files. The first one is a summary of all SBML models to be investigated. The second one is a template for the user input file. This template includes all necessary keywords with instructions how to fill in the information. It has already the correct structure for the models and provides default values for several keywords. Additionally if a data file is specified then the data will be passed into the template .xml file.

Command line options

The current options are listed below and be specified using the --help option.

  • –files a comma separated list of xml model files
  • –data a data file with columns time, variable1, variable2, variable3 …. (optional)
  • –input_file_name the name of the xml file to write (default input\_file\_template.xml)
  • –summary_file_name the name of the summary file to write (default model\_summary.txt)

For example, to run abc-sysbio-sbml-sum with three SBML models, type

> abc-sysbio-sbml-sum --files source1.xml,source2.xml,source3.xml

to produce the output files model_summary.txt and input_file_template.xml. The model summary is very useful for examining the model structure and is required to identify the order of the parameters.

run-abc-sysbio

To run run-abc-sysbio, use

> run-abc-sysbio -i user_input_file.xml

Command line options

When running run-abc-sysbio, several options are possible. To see a list of these, type

> run-abc-sysbio --help

The current options are listed below. There are two forms for arguments.

  • -i , –infile declaration of the input file. This input file has to be provided to run the program!
  • -lc , –localcode do not import model from sbml intead use .py, .cpp or cuda file
  • -sd , –setseed seed the random number generator in numpy with an integer eg -sd=2, –setseed=2
  • -tm , –timing print timing information
  • –c++ use C++ implementation
  • -cu, –cuda use CUDA implementation
  • -of , –outfolder write results to folder eg -of=/full/path/to/folder (default is _results_ in current directory)
  • -f , –fulloutput print epsilon, sampling steps and acceptence rates after each population
  • -s , –save no backup after each population
  • -S , –simulate simulate the model over the range of timepoints, using paramters sampled from the priors
  • -d , –diagnostic disable printing of diagnostic plots
  • -t , –timeseries disable plotting of simulation results after each population
  • -p , –plotdata disable plotting of given data points
  • -h , –help print this list of options.

Output

The outputs from running the ABC SMC algorithm are saved in a folder specified via the --outfolder option.

  • _data.png, a scatter plot of your input data.
  • rates.txt containing population number, epsilon value, number of sampled particles, acceptance rate and time to complete in seconds
  • ModelDistribution_1.png and ModelDistribution.txt Histograms of the posterior distribution of accepted models after each population. Above each histogram the population number, epsilon, and acceptance rate for that population are displayed.
  • One text file per population, distance_PopulationN.txt, listing the distances of the accepted particles together with the model number of the accepted model.
  • One text file per population, traj_PopulationN.txt, the trajectories of the accepted particles. Each line contains:
    accepted particle number, replicate number (==0 if beta=1), model id, fitted species id, X(t=1), X(t=2) ……
  • One sub-folder per model. These sub-folders, suffixed with the model name, contain sub-folders for each population, population_N. Each contains:
    • data_PopulationN.txt, the accepted parameter sets
    • data_WeightsN.txt, the accepted parameter weights
    • ScatterPlotPopulationN.png, scatter plots of all accepted parameters. (See Figure \ref{AcceptedSIR}
    • TimeseriesPopulationN.png, simulations of the model using ten accepted parameter sets, to compare with the data.
    • weightedHistograms_PopulationN.png, histograms showing accepted parameter distributions.
  • copy contains in binary data form the information required to restart the ABC SMC algorithm using the last population. These files are not human-readable but are read into Python if the algorithm is being run restarting from a previous population. See Example 2.

Simulation mode

Beside implementing the ABC SMC algorithms the program run-abc-sysbio provides an easy way to simulate biochemical systems directly from the SBML source. Simulation mode requires the user input file but information on the epsilon schedule and the variables section of data are not required or ignored if present. Here particles specifies the number of simulations to perform, parameters are sampled from their priors and multiple models can be specified, together with modelprior, so that model averaging can be performed.

The output folder in simulation mode contains

  • particles.txt the simulated parameter sets
  • trajectories.txt the trajectories of the accepted particles. Each line contains:
    accepted particle number, replicate number (==0 if beta=1), model id, fitted species id, X(t=1), X(t=2) ……
  • One .png file for each model plotting the simulated timeseries

Installation

Overview

The package abc-sysbio provides Python module, abcsysbio containing functions for parameter inference and model selection. Together with the scripts abc-sysbio-sbml-sum and run-abc-sysbio it creates a user-friendly tool. The biochemical network simulation algorithms are now implemented in Python, C++ and CUDA through cuda-sim. It is developed for the Linux operating system but can also run on Mac OS X.

abcsysbio has a number of dependencies which change as different parts of the software are used. numpy and matplotlib are essential. libsbml must be installed if import from SBML is desired. We advocate the use of CUDA if possible, for which cuda-sim must be installed, or C++ which requires the $GSL$ library to be installed. scipy must be installed if the Python ODE solver is to be used. To make installation easier, it is advisable to use the package together with the Python Enthought Distribution which contains matplotlib, numpy, scipy.

Once the dependent packages have been installed (see the README.txt in the package distribution), abc-sysbio is installed via the standard distutils interface. Download and unpack the code abc-sysbio-XX.tar.gz where XX is the version number.

> tar -xzf abc-sysbio-XX.tar.gz
> cd abc-sysbio-XX
> python setup.py install

Linux details

On linux this will copy the module abcsysbio into the lib/pythonXX/site-packages directory corresponding to the python version that was used to invoke the installation. In addition it will copy the scripts run-abc-sysbio and abc-sysbio-sbml-sum into the same bin directory as the python version that was used to invoke the installation. For example

> /usr/local/bin/python2.6 setup.py install

places run-abc-sysbio and abc-sysbio-sbml-sum into /usr/local/bin/ and abcsysbio into /usr/local/lib/python2.6/site-packages. If run-abc-sysbio and abc-sysbio-sbml-sum are already in the path then it should now be sufficient to issue the command

> run-abc-sysbio -h

to get a list of options. Alternatively if they are not in the path, they can be added by issuing

>export PATH=<dir>:$PATH

for ‘sh’ type shells or

> setenv PATH <dir>:$PATH

for ‘c’ type shells. This must be done each time you open a new shell unless you add the line to your .bashrc / .cshrc files. Alternatively you can just call the scripts using the full path eg /usr/local/bin/run-abc-sysbio.

To run the C++ version of the code, you must also set two environment variables, \verb$GSL_INC$ and \verb$GSL_LIB$, which point to the directories containing the \verb$GSL$ include and libraries respectively.

Mac OS X details

In principle the installation should be identical to Linux. However on the Mac the location of the resulting files can be different depending on which flavour of python you are using. The output from the install command can be used to identify the location of the scripts and this directory should be added to the PATH as above.
See the README.txt in the package distribution for one way to install abc-sysbio on the Mac.

Reference