Guide to Understanding PDB Data
Introduction
PDB Overview
Beginner’s Guide to PDBx/mmCIF
Dealing with Coordinates
Biological Assemblies
Missing Coordinates
Computed Structure Models
Primary Sequences
Protein Hierarchical Structure
Small Molecule Ligands
Exploring Carbohydrates
Methods for Determining Structure
Crystallographic Data
Molecular Graphics Programs
Introduction to RCSB PDB APIs

Primary Sequences

The primary sequence of the polymeric molecules contained in an entry are presented primarily in the _entity_poly and entity_poly_seq categories of the mmCIF/PDBx file. These listings include the sequence of each chain of linear, covalently-linked standard or modified amino acids or nucleotides. It may also include other residues that are linked to the standard backbone in the polymer.

As described in the “Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format”, a tabular style is used as there are multiple values for each token. Here, a loop_ token is followed by rows of data item names and then white-space delimited data values. Additional information (and correspondence with legacy PDB format) can be found in the PDBx/mmCIF User Guide and complete file format documentation is available.

The example below from entry 4HHB shows the one-letter code sequence given in the _entity_poly category. Each residue from chains A and C (entity 1), and then chains B and D (entity 2) are listed in sequential order in _entity_poly.pdbx_seq_one_letter_code. Modified residues are listed using their canonical parent residue in _entity_poly.pdbx_seq_one_letter_code_can.

loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_linkage
_entity_poly.nstd_monomer
_entity_poly.pdbx_seq_one_letter_code
_entity_poly.pdbx_seq_one_letter_code_can
_entity_poly.pdbx_strand_id
_entity_poly.pdbx_target_identifier
1  'polypeptide(L)'  no  no
;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
;
;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
;
A,C  ?
2  'polypeptide(L)'  no  no
;VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN
LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
;
;VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN
LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
;
B,D  ?
#

The sequence in three-letter code format can be found in the entity_poly_seq category. Here again from entry 4HHB.

loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1  1    VAL  n
1  2    LEU  n
1  3    SER  n
1  4    PRO  n
1  5    ALA  n
1  6    ASP  n
1  7    LYS  n
1  8    THR  n
1  9    ASN  n
1  10   VAL  n
1  11   LYS  n
1  12   ALA  n
1  13   ALA  n
1  14   TRP  n
1  15   GLY  n
1  16   LYS  n
1  17   VAL  n
1  18   GLY  n
1  19   ALA  n
<<truncated for brevity>>

The _entity_poly and entity_poly_seq categories provide correspondence between the 1-letter and 3-letter formats for primary sequence and are equivalent to what is reported in the FASTA sequence and the "SEQRES” records found in the legacy PDB file format.

The three-letter residue code found in _entity_poly_seq.mon_id item is a pointer to _chem_comp.id in the chem_comp category. This is analogous to the _atom_site.label_comp_id in the atom_site category which is also a pointer to the _chem_comp.id in the chem_comp category.

Here is an example from entry 4HHB:

loop_
_chem_comp.id
_chem_comp.type
_chem_comp.mon_nstd_flag
_chem_comp.name
_chem_comp.pdbx_synonyms
_chem_comp.formula
_chem_comp.formula_weight
ALA  'L-peptide linking'  y  ALANINE                           ?    'C3 H7 N O2'          89.093
ARG  'L-peptide linking'  y  ARGININE                          ?    'C6 H15 N4 O2 1'      175.209
ASN  'L-peptide linking'  y  ASPARAGINE                        ?    'C4 H8 N2 O3'         132.118
ASP  'L-peptide linking'  y  'ASPARTIC ACID'                   ?    'C4 H7 N O4'          133.103
CYS  'L-peptide linking'  y  CYSTEINE                          ?    'C3 H7 N O2 S'        121.158
GLN  'L-peptide linking'  y  GLUTAMINE                         ?    'C5 H10 N2 O3'        146.144
GLU  'L-peptide linking'  y  'GLUTAMIC ACID'                   ?    'C5 H9 N O4'          147.129
GLY  'peptide linking'    y  GLYCINE                           ?    'C2 H5 N O2'          75.067
HEM  non-polymer          .  'PROTOPORPHYRIN IX CONTAINING FE' HEME 'C34 H32 Fe N4 O4'    616.487
HIS  'L-peptide linking'  y  HISTIDINE                         ?    'C6 H10 N3 O2 1'      156.162
HOH  non-polymer          .  WATER                             ?    'H2 O'                18.015
LEU  'L-peptide linking'  y  LEUCINE                           ?    'C6 H13 N O2'         131.173
LYS  'L-peptide linking'  y  LYSINE                            ?    'C6 H15 N2 O2 1'      147.195
MET  'L-peptide linking'  y  METHIONINE                        ?    'C5 H11 N O2 S'       149.211
PHE  'L-peptide linking'  y  PHENYLALANINE                     ?    'C9 H11 N O2'         165.189
PO4  non-polymer          .  'PHOSPHATE ION'                   ?    'O4 P -3'             94.971
PRO  'L-peptide linking'  y  PROLINE                           ?    'C5 H9 N O2'          115.130
SER  'L-peptide linking'  y  SERINE                            ?    'C3 H7 N O3'          105.093
THR  'L-peptide linking'  y  THREONINE                         ?    'C4 H9 N O3'          119.119
TRP  'L-peptide linking'  y  TRYPTOPHAN                        ?    'C11 H12 N2 O2'       204.225
TYR  'L-peptide linking'  y  TYROSINE                          ?    'C9 H11 N O3'         181.189
VAL  'L-peptide linking'  y  VALINE                            ?    'C5 H11 N O2'         117.146
<<truncated for brevity>>
    

In many cases, you may find that the coordinates presented in the atom_site records of the mmCIF format file may not exactly match the sequence in the _entity_poly and entity_poly_seq categories. The ends of chains and mobile loops are often not observed in PDB experimental structures, and coordinates are not included as atom_site records in the file. However, these amino acids will often be included in the sequence records, since the portion of the chain was present during the experiment. In these cases, information will be included in the pdbx_unobs_or_zero_occ_residues category to identify each missing residue. This category is analogous to the information presented in REMARK 465 of legacy PDB format files.

You may also notice some differences with sequences in other databases. For example, a researcher may change or mutate particular residues to see the effect this will have on the overall structure, or a particular portion of it. The _struct_ref_seq category (corresponding to the DBREF record in legacy PDB format files) provides cross-reference information between the sequence studied and a corresponding database sequence.

Here is an example from entry 4HHB:

loop_
_struct_ref_seq.align_id
_struct_ref_seq.ref_id
_struct_ref_seq.pdbx_PDB_id_code
_struct_ref_seq.pdbx_strand_id
_struct_ref_seq.seq_align_beg
_struct_ref_seq.pdbx_seq_align_beg_ins_code
_struct_ref_seq.seq_align_end
_struct_ref_seq.pdbx_seq_align_end_ins_code
_struct_ref_seq.pdbx_db_accession
_struct_ref_seq.db_align_beg
_struct_ref_seq.pdbx_db_align_beg_ins_code
_struct_ref_seq.db_align_end
_struct_ref_seq.pdbx_db_align_end_ins_code
_struct_ref_seq.pdbx_auth_seq_align_beg
_struct_ref_seq.pdbx_auth_seq_align_end
1  1  4HHB  A  1  ?  141  ?  P69905  2  ?  142  ?  1  141
2  2  4HHB  B  1  ?  146  ?  P68871  2  ?  147  ?  1  146
3  1  4HHB  C  1  ?  141  ?  P69905  2  ?  142  ?  1  141
4  2  4HHB  D  1  ?  146  ?  P68871  2  ?  147  ?  1  146
<<truncated for brevity>>
    

The _struct_ref_seq_dif category (corresponding to the SEQADV record in legacy PDB format files) identifies differences between sequence information in the sequence records (_entity_poly and entity_poly_seq categories) of the PDB entry and the sequence database entry given in _struct_ref_seq.

Here is an example from entry 8JK1:

loop_
_struct_ref_seq_dif.align_id
_struct_ref_seq_dif.pdbx_pdb_id_code
_struct_ref_seq_dif.mon_id
_struct_ref_seq_dif.pdbx_pdb_strand_id
_struct_ref_seq_dif.seq_num
_struct_ref_seq_dif.pdbx_pdb_ins_code
_struct_ref_seq_dif.pdbx_seq_db_name
_struct_ref_seq_dif.pdbx_seq_db_accession_code
_struct_ref_seq_dif.db_mon_id
_struct_ref_seq_dif.pdbx_seq_db_seq_num
_struct_ref_seq_dif.details
_struct_ref_seq_dif.pdbx_auth_seq_num
_struct_ref_seq_dif.pdbx_ordinal
1  8JK1  GLN  A  1   ?  UNP  A0A024R9I8  ?    ?   'expression tag'      -1   1
1  8JK1  ALA  A  2   ?  UNP  A0A024R9I8  ?    ?   'expression tag'      0    2
1  8JK1  SER  A  30  ?  UNP  A0A024R9I8  CYS  28  'engineered mutation' 28   3
1  8JK1  SER  A  95  ?  UNP  A0A024R9I8  ASN  93  variant               93   4
1  8JK1  ILE  A  118 ?  UNP  A0A024R9I8  PHE  116 variant               116  5
1  8JK1  CYS  A  136 ?  UNP  A0A024R9I8  ARG  134 variant               134  6
2  8JK1  GLN  B  1   ?  UNP  A0A024R9I8  ?    ?   'expression tag'      -1   7
2  8JK1  ALA  B  2   ?  UNP  A0A024R9I8  ?    ?   'expression tag'      0    8
2  8JK1  SER  B  30  ?  UNP  A0A024R9I8  CYS  28  'engineered mutation' 28   9
2  8JK1  SER  B  95  ?  UNP  A0A024R9I8  ASN  93  variant               93   10
2  8JK1  ILE  B  118 ?  UNP  A0A024R9I8  PHE  116 variant               116  11
2  8JK1  CYS  B  136 ?  UNP  A0A024R9I8  ARG  134 variant               134  12
<<truncated for brevity>>
    

Structural biologists often work with fragments of macromolecules which are more amenable to study than the full macromolecule. Thus, the _entity_poly and entity_poly_seq and atom_site records may include only a portion of the molecule, not the whole protein. The numbering of residues can also provide an additional complication. In some cases, researchers number atom_site records based on the numbering of the whole protein, while in other cases, they number the chain based on the fragment. Any number (negative, 0, positive) can be used. The numbering in the _entity_poly_seq.num and _atom_site.label_seq_id categories is always sequential beginning with “1”, while the _atom_site.auth_seq_id category provides the author’s residue numbering with any insertions given in _atom_site.pdbx_PDB_ins_code.

The example below from entry 5JZY, shows author residue numbering in the coordinates starting at 0 (in blue) and sequential numbering (shown in pink) starting with 5. Residues 1-4 in sequential numbering ( corresponding to residues 4, -3,-2, and -1 in author numbering) were not experimentally observed. An insertion (shown in red) is shown beginning with sequential residue number 23 (corresponding to author numbering 14).

Loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code

_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM   1    N   N    .   GLY   A   1   5   ?   2.001  16.815  86.112  1.00  44.36  ?   0   GLY   L   N    1 
ATOM   2    C   CA   .   GLY   A   1   5   ?   1.482  18.136  86.547  1.00  45.80  ?   0   GLY   L   CA   1 
ATOM   3    C   C    .   GLY   A   1   5   ?   1.277  18.094  88.040  1.00  46.03  ?   0   GLY   L   C    1 
ATOM   4    O   O    .   GLY   A   1   5   ?   0.165  17.865  88.516  1.00  47.41  ?   0   GLY   L   O    1 
<<truncated for brevity>>
ATOM   317  N   N    .   LYS   A   1   23  A  -18.855  6.491   92.634  1.00  13.38  ?   14  LYS   L   N    1 
ATOM   318  C   CA   .   LYS   A   1   23  A  -19.863  5.707   93.333  1.00  15.01  ?   14  LYS   L   CA   1 
ATOM   319  C   C    .   LYS   A   1   23  A  -19.372  5.069   94.627  1.00  14.76  ?   14  LYS   L   C    1 
ATOM   320  O   O    .   LYS   A   1   23  A  -20.210  4.671   95.446  1.00  16.30  ?   14  LYS   L   O    1 
<<truncated for brevity>>
ATOM   330  N   N    .   THR   A   1   24  B  -18.053  4.956   94.853  1.00  12.77  ?   14  THR   L   N    1 
ATOM   331  C   CA   .   THR   A   1   24  B  -17.592  4.286   96.064  1.00  13.26  ?   14  THR   L   CA   1 
ATOM   332  C   C    .   THR   A   1   24  B  -16.544  5.065   96.851  1.00  12.53  ?   14  THR   L   C    1 
ATOM   333  O   O    .   THR   A   1   24  B  -16.062  4.553   97.860  1.00  12.69  ?   14  THR   L   O    1 
<<truncated for brevity>>
ATOM   344  N   N    .   GLU   A   1   25  C  -16.216  6.305   96.471  1.00  11.57  ?   14  GLU   L   N    1 
ATOM   345  C   CA   .   GLU   A   1   25  C  -15.205  7.037   97.226  1.00  10.78  ?   14  GLU   L   CA   1 
ATOM   346  C   C    .   GLU   A   1   25  C  -15.638  7.307   98.654  1.00  10.66  ?   14  GLU   L   C    1 
ATOM   347  O   O    .   GLU   A   1   25  C  -14.787  7.406   99.548  1.00  11.35  ?   14  GLU   L   O    1 
<<truncated for brevity>>
    

For more information, see “Missing Loops and Tails” and “Fragments and Domains” sections (in this Guide) and the “Macromolecules”>>”Sample Sequence” section of the PDBx/mmCIF User Guide.

Amino Acid and Nucleotide Nomenclature

In the SEQRES records, the standard 3-character code is used for standard amino acids, and standard nucleotides are specified by 1 or 2 characters:

Standard (L-) Amino Acids

ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER, THR, VAL, TRP, TYR, PYL (pyrrolysine)*, SEC (selenocysteine) *

D-Amino Acids (present in the PDB Archive)

DAL ('ALA'), DSN ('SER'), DCY ('CYS'), DPR ('PRO'), DVA ('VAL'), DTH ('THR'), DLE ('LEU'), DIL ('ILE'), DSG ('ASN'), DAS ('ASP'), MED ('MET'), DGN ('GLN'); DGL ('GLU'), DLY ('LYS'), DHI ('HIS'), DPN ('PHE'), DAR ('ARG'), DTY ('TYR'), DTR ('TRP')

Deoxyribonucleotides

DA, DC, DG, DT, DI

Ribonucleotides

A, C, G, U, I

* SEC and PYL are considered as standard amino acids as announced by the wwPDB.

Other codes are used for modified amino acids (such as MSE for selenomethionine) and for modified nucleotides (such as CBR for bromocytosine).

Several additional records are included in the PDB format to define modifications as they appear in the ATOM records.

As an example, here are the records that describe HYP (hydroxyproline, a modified version of PRO, or proline) in the ATOM records for collagen entry 1CAG:

loop_
_pdbx_struct_mod_residue.id
_pdbx_struct_mod_residue.label_asym_id
_pdbx_struct_mod_residue.label_comp_id._pdbx_struct_mod_residue.label_seq_id
_pdbx_struct_mod_residue.auth_asym_id._pdbx_struct_mod_residue.auth_comp_id
_pdbx_struct_mod_residue.auth_seq_id
_pdbx_struct_mod_residue.PDB_ins_code _pdbx_struct_mod_residue.parent_comp_id
_pdbx_struct_mod_residue.details
1   A  HYP  2   A  HYP  2   ?   PRO   4-HYDROXYPROLINE
2   A  HYP  5   A  HYP  5   ?   PRO   4-HYDROXYPROLINE
3   A  HYP  8   A  HYP  8   ?   PRO   4-HYDROXYPROLINE
4   A  HYP  11  A  HYP  11  ?   PRO   4-HYDROXYPROLINE
5   A  HYP  14  A  HYP  14  ?   PRO   4-HYDROXYPROLINE
6   A  HYP  17  A  HYP  17  ?   PRO   4-HYDROXYPROLINE
7   A  HYP  20  A  HYP  20  ?   PRO   4-HYDROXYPROLINE
8   A  HYP  23  A  HYP  23  ?   PRO   4-HYDROXYPROLINE
9   A  HYP  26  A  HYP  26  ?   PRO   4-HYDROXYPROLINE
10  A  HYP  29  A  HYP  29  ?   PRO   4-HYDROXYPROLINE
11  B  HYP  2   B  HYP  32  ?   PRO   4-HYDROXYPROLINE
12  B  HYP  5   B  HYP  35  ?   PRO   4-HYDROXYPROLINE
<<truncated for brevity>>
  

Additional specifics about the nature of the modified residue can be found in the _chem_comp category:

loop_
_chem_comp.id
_chem_comp.type
_chem_comp.mon_nstd_flag
_chem_comp.name
_chem_comp.pdbx_synonyms
_chem_comp.formula
_chem_comp.formula_weight
ACY  non-polymer         .  'ACETIC ACID'    ?              'C2 H4 O2'   60.052  
ALA  'L-peptide linking' y  ALANINE          ?              'C3 H7 N O2' 89.093  
GLY  'peptide linking'   y  GLYCINE          ?              'C2 H5 N O2' 75.067  
HOH  non-polymer         .  WATER            ?              'H2 O'       18.015  
HYP  'L-peptide linking' n  4-HYDROXYPROLINE HYDROXYPROLINE 'C5 H9 N O3' 131.130
PRO  'L-peptide linking' y  PROLINE          ?              'C5 H9 N O2' 115.130
    

In the case of a modified residue, it is presented in _entity_poly.pdbx_seq_one_letter_code with three letters in parenthesis. Again from entry 1CAG:

_entity_poly.entity_id                      1
_entity_poly.type                           'polypeptide(L)'
_entity_poly.nstd_linkage                   no
_entity_poly.nstd_monomer                   yes
_entity_poly.pdbx_seq_one_letter_code       'P(HYP)GP(HYP)GP(HYP)GP(HYP)GP(HYP)AP(HYP)GP(HYP)GP(HYP)GP(HYP)GP(HYP)G'
_entity_poly.pdbx_seq_one_letter_code_can   PPGPPGPPGPPGPPAPPGPPGPPGPPGPPG
_entity_poly.pdbx_strand_id                 A,B,C
_entity_poly.pdbx_target_identifier         ?