Downloading data and mmCIF files of interfaces or assemblies

The EPPIC web interface is powered by a REST API, that is also available publicly for data retrieval. Endpoints exist for the different data objects that can be retrieved: interfaces, assemblies, residues, multiple sequence alignments, mmCIF files per assembly or interface. Data can be obtained for either precomputed PDB ids or user jobs (use the long alphanumeric job id as the identifier). The data provided by REST services is offered in JSON format.

The interface ids are those calculated by EPPIC from largest (1) to smallest (n). The assembly ids are sorted from lower stoichiometries to higher stoichiometries. Note that the downloaded mmCIF files have the b-factors column replaced by the corresponding sequence entropy values per residue. Chains that are transformed with a rotation operator (symmetry partners) are named with <original_chain_id>_<operator_id>.

Software and source code

The EPPIC web server is a web GUI to the EPPIC command line program, written in Java. The latest version of it is available here. If you need to run it often or want to tweak the parameters we recommend that you use the command line version. It has been tested in Linux only but it should work also in MacOSX. Blast and Clustal Omega are required for it to work.

You will need Java 17 (or newer) to be able to run the command-line EPPIC program.

The source code is available  under the GPL license. You can get it with the following GIT command:

git clone https://github.com/eppic-team/eppic

EPPIC uses the open source BioJava library.

Please contact us if you have problems with it or if you want to send any kind of feedback.

Datasets

The datasets used for developing the EPPIC method (see the paper) can be downloaded as plain text files:

  • DCxtal set: a set of crystal contacts with large interface areas (>1000Å2)
  • DCbio set: a set of biologically relevant interfaces with relatively small interface areas (<2000Å2)

The area distributions of the DCxtal and DCbio interfaces, as seen in this plot, overlap substantially. This is a distinctive feature of the sets, as crystal interfaces tend to be small and biologically relevant ones tend to be large. Also note that all entries in the sets are selected for crystallographic quality by resolution and Rfree filtering.

The files contain lists of PDB codes with lists of interface identifiers as calculated by EPPIC, i.e. id 1 corresponding to largest interface in crystal, and increasing ids for smaller interfaces. If no interface id is given in a line then interface 1 is implied. Lines starting with "#" are comments.

We further compiled (see paper) a new dataset of experimentally validated transmembrane protein oligomeric structures. It can also be downloaded as text file here:

  • TMPbio set: a set of biological interfaces spanning the transmembrane region, from both alpha and beta TMP subclasses

We next automatically obtained two large-scale datasets of crystal and biological contacts, called XtalMany and BioMany, respectively (Baskaran et al. 2014), which contain nearly 3000 entries each. XtalMany is based on the concept of operators leading to infinite assemblies. BioMany is mainly based on the concept of shared interfaces across crystal forms: it is a subset of ProtCID  from the Dunbrack group with very stringent parameters. In addition it contains interfaces from dimeric structures that were solved both by crystallography and NMR: here the idea is that an NMR dimer validates the dimeric biounit of the corresponding crystal structure. XtalMany and BioMany can be downloaded as text files here:

In Bliven et al. 2018 we used a dataset of assemblies extracted from the bioassembly annotations in the PDB (only the 1st bioassembly "PDB1" was used). Bioassemblies with good consensus within their 70% sequence clusters were taken, see full details in paper.

Please note that the original publications also contain the datasets including our full annotations. However we cannot update those if we find any mistakes. The datasets linked here represent the most up-to-date and best validated sets. Please use these ones preferentially to the ones available in the original publications.