Gpml Pathway Format

Overview

This page describes every aspect of GPML, how it came to be, how it works and how you can use it.

GPML is simply an XML-based format. You can use it to define a pathway consisting of purely graphical elements (such as lines and shapes) or graphical elements with added biological information (such as genes, proteins and datanodes)

GPML has very strong ties to the GenMAPP MAPP format. This is important to realize because it explains some of the idiosyncrasies in the definition that are usually there for backwards compatibility to GenMAPP.

For an example of a real-life gpml file, take a look at  Hs_Apoptosis.gpml

Naming

GPML stands for GenMAPP Pathway Markup Language. Originally this file format was named GMML, for Gen-MAPP markup language. GMML was renamed to GPML to make it more distinct from both GML and XGMML, to markup languages used extensively by the Cytoscape community.

Structure of a GPML file

Overview

The root element is always the <Pathway> element. Below the <Pathway> element there are three important types of elements:

  • pure Graphical elements: Shape, Label
  • elements with a biological context: DataNode
  • the only element that can connect elements: Line

Contrary to most XML definitions, in GPML all elements and attributes start with an uppercase letter

Root level: Pathway

At the root there is always one Pathway element.

  • SubElement Comment
  • Subelement Graphics
    • BoardWidth -> width of the drawing field, all elements should fit in it.
    • BoardHeight -> height of the drawing field, all elements should fit in it.
    • WindowWidth -> exists only for backwards compatibility with GenMAPP.
    • WindowHeight -> exists only for backwards compatibility with GenMAPP.
  • Name -> Pathway Title. Will be displayed in the infobox in PathVisio.
  • Organism -> Using the full latin name, e.g. "Homo sapiens". There is no equivalent in MAPP format, in MAPP format the file name determines the organism.
  • Data Source -> e.g. 'Kegg', 'GenMAPP' etc.
  • Version -> GenMAPP Version, use for Mapps exported from GenMAPP only
  • Author -> Name of author of this mapp
  • Maintainer -> Maintainer, if different from author. Maintainer is "GenMAPP.org" for many genmapp pathways
  • Email -> Email address of the maintainer.
  • Copyright -> Our policy is to use Creative Commons licensing
  • Last-modified -> Last time this pathway was updated
  • Biopaxref -> reference to a biopax element

First level: Pathway Elements

Below the Pathway element, the following elements appear (in order):

Comment, Graphics, DataNode, Line, Label, Link, Shape, Group, InfoBox, Legend, Biopax.

Note that the first two elements: Comment and Graphics, are not specific to the first level so they are described in the next section.

DataNode

  • Comment
  • Graphics
    • CenterX, CenterY, Width, Height
    • Color
  • Xref
    • Database
    • ID
  • BiopaxRef
  • GraphId
  • GroupRef
  • ObjectType
  • TextLabel
  • BackpageHead
  • GenMAPP-Xref -> deprecated, there for backwards compatibility
  • Type -> one of Unknown, Gene, Protein, GeneProduct or Metabolite

Line

  • Comment subelement
  • Graphics Graphics] subelement
    • Point sublement
      • x, y ->
      • GraphRef
      • GraphId
      • ArrowHead, determines presence and type of arrowhead. Possible values include: Line, Arrow, Receptor, ReceptorRound, ReceptorSquare, LigandRound, LigandSquare, TBar. Other values are possible too, but then they are not recognized by GenMAPP.
      • Head -> deprecated, use ArrowHead instead. Will be removed in the future.
    • Color -> either six hexadecimal digits specifying a "html color" (meaning 3 groups of 2 digits, representing the red, green and blue color levels as values from 00 to FF in hexadecimal) or one of several named colors, including the special color "Transparent".
  • Style -> The line style, one of "Solid" or "Broken"
  • GroupId
  • GraphId
  • Biopaxref

Label

  • Comment
  • Graphics
    • CenterX, CenterY, Width, Height
    • Color
    • FontName
    • FontStyle
    • FontDecoration
    • FontStrikethru
    • FontWeight
    • FontSize

Very similar to Label, with only one additional, optional attribute

  • Href -> a url pointing to another pathway. This will probably be done using pathway identifiers.

The idea of Link is that it will represent links between pathways, i.e. labels in blue, underlined, that you can click on to take you to another pathway. This feature is at the moment not implemented, neither in GenMAPP nor in PathVisio.

Is part of the spec may change in the future.

Shape

  • subelement Graphics
    • CenterX, CenterY, Width, Height
    • Color -> default is Black
    • Rotation
    • FillColor -> default is Transparent
  • subelement Comment
  • Biopaxref
  • Graphid
  • GroupRef
  • ObjectType -> "Node", "Edge" or "Annotation". defaults to Annotation. Currently unused.
  • Style -> Solid or Broken, like Line.Style

Group

Group's are used to group elements together, to make it possible to select them as a unit. Groups can share a functional relation, e.g. proteins that form a proteincomplex can be grouped. Because they have a groupId they can be part of another group so they can be nested.

  • Comment
  • GroupId
  • GroupRef
  • Style
  • TextLabel
  • GraphId
  • BiopaxRef

Infobox

Infobox is used by GenMAPP, currently ignored by PathVisio.

  • CenterX, CenterY

Legend

Legend is used by GenMAPP, currently ignored by PathVisio

  • CenterX, CenterY

Biopax

A pool of Biopax objects that other elements can refer to. These objects have to be part of the biopax namespace:  http://www.biopax.org/release/biopax-level2.owl. PathVisio itself does not test that the objects are valid biopax; as long as they are clean xml and in the right namespace, PathVisio will accept them.

Shared attributes and minor elements

Graphics

Many elements have a Graphics subelement. Each Graphics subelement has an implementation that is specific to the super element, i.e. the Graphics subelement of a DataNode is totally different from the Graphics subelement of a Label. The only shared feature is that the Graphics subelement groups purely graphical attributes together. Different Graphics subelements are described in the sections of the super elements.

Comment

All pathway elements and pathway itself can have zero or more Comment subElements. They have one attribute

  • source -> value designating the source of the comment, i.e. the program or script that added the comment. For pathways converted from genmapp, the source value is either "GenMAPP notes" when a comment came from the notes column or "GenMAPP remarks" when it came from the remarks column.

GroupIds

GPML allows nested groups of elements.

GraphRefs/GraphIds

GraphId's have the XML Schema ID type. They have to be identifiers consisting of a sequence of letters, digits and underscores not starting with a digit. They are unique with respect to the document.

In the current implementation of PathVisio, GraphIds are randomly generated hexadecimals in the ranges A0000 to FFFFF and a00 to FFF. This is only a quick & dirty approach to generating identifiers that comply with the XML Schema ID type. GraphIds do not have to be interpretable as a hexadecimal number and applications should not rely on this aspect.

GraphId's are used to link element together. Elements that have a graphRef attribute use it to refer to a GraphId of another element.

GraphId's are only meaningful within a pathway: if an element is copied from one pathway to another, it's graphId may be changed. It is possible that two pathways have totally different elements with the same GraphId by coincidence, in fact it would be impossible to prevent this from happening.

Biopaxref

Reference to any biopax element stored in the Biopax object pool. This is the means to link gpml elements to biopax definitions.

Schema and Validation

The GPML format is defined by the XML Schema definition (or xsd). "XML Schema" is a standard defined by the W3C group, but it is not the only possible standard (e.g there is DTD, as well as Relax NG). XML Schema is certainly not without problems, but until now it has always been possible to work around them. Having an XML Schema for validation makes it much easier to get information in and out of GPML format in a sensible way.

The latest version of GPML.xsd can always be found in our svn repository, directly accessible at  http://svn.bigcat.unimaas.nl/pathvisio/trunk/GPML.xsd

A simple tool for validation is xmllint avalaible on windows and unix. For windows you can get binaries here  ftp://xmlsoft.org/libxml2/win32/libxml2-2.6.23.win32.zip. You may also need to install zlib1.dll and iconv.dll, which can be found easily with google.

An example of xmllint usage: <code>xmllint --noout GPML.xsd my-pathway.gpml</code>

One of the annoying features of XML Schema is that you have to define an order for element siblings. For example in GPML, DataNodes always have to be before Line (if any), which has to be before Label, etc. The complete order for all sublelements of Pathway is: Comment, Graphics, DataNode, Line, Label, Link, Shape, Group, Infobox, Legend, Biopax

Versioning

As more and more people are interested in wikipathways and pathvisio, the risk will be higher that we break other people's code by changing something to GPML. At least it will help the situation if we're clear about versioning. In the current implementation, the XML namespace is used for versioning, something that is a common practice (see  http://www.ibm.com/developerworks/xml/library/x-tipnamsp.html).

The namespace of the old schema was:  http://genmapp.org/GPML/2007 . Until this moment we haven't been very strict about versioning so in fact if you look at older revisions you will find small differences between schema's with the same version. From now on every change, no matter how small, should be coupled to a change in the schema version.

The namespace of the new schema is:  http://genmapp.org/GPML/2008a . The difference with 2007 is the addition of the State element.

The idea is that from now on, the GPML schema version will be  http://genmapp.org/GPML/ + the year + a letter. So if we make another revision this year, it will be 2008b, the first one next year will be 2009a. From now on, there will always be a letter after the year, even for the first revision. Note that it's OK to use 2008x in 2009 too if there are no changes, i.e. we don't have to do a rev every year on January 1st.

Each namespace is defined in a different XML Schema. The latest is always defined in GPML.xsd (to be found at  http://svn.bigcat.unimaas.nl/pathvisio/trunk/GPML.xsd). To be able to read and validate older files, you still need the older schema's. The one of 2007 is still available as GPML2007.xsd. The idea is that when the next change is made, the current GPML.xsd will be copied to GPML2008a.xsd, and the changes are made in GPML.xsd.

How does this work in Java code? Upon reading the dom tree, PathVisio checks the namespace of the root element. This is done in org.pathvisio.model.GpmlFormat. Based on this namespace, the proper validation and loading code is run. Because 2007 and 2008a are really very similar, both are handled by the same class: org.pathvisio.model.GpmlFormatImpl1. For bigger changes in the future it may make sense to implement a norg.pathvisio.model.GpmlFormatImpl2

Here is the corresponding code in GpmlFormat.java:

Namespace ns = root.getNamespace(); GpmlFormatImpl1[] formats = new GpmlFormatImpl1[] { GpmlFormatImpl1.Gpml2007, GpmlFormatImpl1.Gpml2008a }; for (GpmlFormatImpl1 format : formats) {

if (ns.equals(format.getGpmlNamespace())) {

... read the pathway ...

}

}

To keep it easy, writing is always done to the 2008a namespace. This means that e.g. all pathways on wikipathways will be automatically converted to the new format if somebody edits them. This also means that the change is backward compatible but not forward compatible: a pathway saved with the latest PathVisio can't be opened by somebody running an older version of PathVisio.

See the examples below how the namespace appears in practice in GPML:

<?xml version="1.0" encoding="ISO-8859-1"?> <Pathway xmlns=" http://genmapp.org/GPML/2007" Name="Apoptosis Mechanisms" Data-Source="GenMAPP 2.0" Version="20041216" Author="Alexander C. Zambon and Beth Lawlor" Maintainer="Alexander C. Zambon" Email="azambon@…" Organism="Homo sapiens">

<Comment Source="GenMAPP notes"></Comment> <Comment Source="GenMAPP remarks"></Comment>

<?xml version="1.0" encoding="ISO-8859-1"?> <Pathway xmlns=" http://genmapp.org/GPML/2008a" Name="Apoptosis Mechanisms" Data-Source="GenMAPP 2.0" Version="20041216" Author="Alexander C. Zambon and Beth Lawlor" Maintainer="Alexander C. Zambon" Email="azambon@…" Organism="Homo sapiens">

<Comment Source="GenMAPP notes"></Comment> <Comment Source="GenMAPP remarks"></Comment>

History

GMML was first developed by Lynn Ferrante for the GenMAPP group as a thesis project. However GPML has changed a lot since then.

Future

GPML is still evolving. Since the 1.0 release of PathVisio, we try hard to keep all changes backwards compatible meaning that you'll always be able to load old pathways with the latest version of GPML.

The next version of GPML will be released around Feb / Mar 2010. Proposed changes are tracked here: GPMLChangeProposal