Primary Sequences
The primary sequence of the polymeric molecules contained in an entry are presented primarily in the _entity_poly and entity_poly_seq categories of the mmCIF/PDBx file. These listings include the sequence of each chain of linear, covalently-linked standard or modified amino acids or nucleotides. It may also include other residues that are linked to the standard backbone in the polymer.
As described in the “Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format”, a tabular style is used as there are multiple values for each token. Here, a loop_ token is followed by rows of data item names and then white-space delimited data values. Additional information (and correspondence with legacy PDB format) can be found in the PDBx/mmCIF User Guide and complete file format documentation is available.
The example below from entry 4HHB shows the one-letter code sequence given in the _entity_poly category. Each residue from chains A and C (entity 1), and then chains B and D (entity 2) are listed in sequential order in _entity_poly.pdbx_seq_one_letter_code. Modified residues are listed using their canonical parent residue in _entity_poly.pdbx_seq_one_letter_code_can.
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_linkage
_entity_poly.nstd_monomer
_entity_poly.pdbx_seq_one_letter_code
_entity_poly.pdbx_seq_one_letter_code_can
_entity_poly.pdbx_strand_id
_entity_poly.pdbx_target_identifier
1 'polypeptide(L)' no no
;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
;
;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
;
A,C ?
2 'polypeptide(L)' no no
;VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN
LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
;
;VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN
LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
;
B,D ?
#
The sequence in three-letter code format can be found in the entity_poly_seq category. Here again from entry 4HHB.
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1 VAL n
1 2 LEU n
1 3 SER n
1 4 PRO n
1 5 ALA n
1 6 ASP n
1 7 LYS n
1 8 THR n
1 9 ASN n
1 10 VAL n
1 11 LYS n
1 12 ALA n
1 13 ALA n
1 14 TRP n
1 15 GLY n
1 16 LYS n
1 17 VAL n
1 18 GLY n
1 19 ALA n
<<truncated for brevity>>
The _entity_poly and entity_poly_seq categories provide correspondence between the 1-letter and 3-letter formats for primary sequence and are equivalent to what is reported in the FASTA sequence and the "SEQRES” records found in the legacy PDB file format.
The three-letter residue code found in _entity_poly_seq.mon_id item is a pointer to _chem_comp.id in the chem_comp category. This is analogous to the _atom_site.label_comp_id in the atom_site category which is also a pointer to the _chem_comp.id in the chem_comp category.
Here is an example from entry 4HHB:
_chem_comp.id
_chem_comp.type
_chem_comp.mon_nstd_flag
_chem_comp.name
_chem_comp.pdbx_synonyms
_chem_comp.formula
_chem_comp.formula_weight
ALA 'L-peptide linking' y ALANINE ? 'C3 H7 N O2' 89.093
ARG 'L-peptide linking' y ARGININE ? 'C6 H15 N4 O2 1' 175.209
ASN 'L-peptide linking' y ASPARAGINE ? 'C4 H8 N2 O3' 132.118
ASP 'L-peptide linking' y 'ASPARTIC ACID' ? 'C4 H7 N O4' 133.103
CYS 'L-peptide linking' y CYSTEINE ? 'C3 H7 N O2 S' 121.158
GLN 'L-peptide linking' y GLUTAMINE ? 'C5 H10 N2 O3' 146.144
GLU 'L-peptide linking' y 'GLUTAMIC ACID' ? 'C5 H9 N O4' 147.129
GLY 'peptide linking' y GLYCINE ? 'C2 H5 N O2' 75.067
HEM non-polymer . 'PROTOPORPHYRIN IX CONTAINING FE' HEME 'C34 H32 Fe N4 O4' 616.487
HIS 'L-peptide linking' y HISTIDINE ? 'C6 H10 N3 O2 1' 156.162
HOH non-polymer . WATER ? 'H2 O' 18.015
LEU 'L-peptide linking' y LEUCINE ? 'C6 H13 N O2' 131.173
LYS 'L-peptide linking' y LYSINE ? 'C6 H15 N2 O2 1' 147.195
MET 'L-peptide linking' y METHIONINE ? 'C5 H11 N O2 S' 149.211
PHE 'L-peptide linking' y PHENYLALANINE ? 'C9 H11 N O2' 165.189
PO4 non-polymer . 'PHOSPHATE ION' ? 'O4 P -3' 94.971
PRO 'L-peptide linking' y PROLINE ? 'C5 H9 N O2' 115.130
SER 'L-peptide linking' y SERINE ? 'C3 H7 N O3' 105.093
THR 'L-peptide linking' y THREONINE ? 'C4 H9 N O3' 119.119
TRP 'L-peptide linking' y TRYPTOPHAN ? 'C11 H12 N2 O2' 204.225
TYR 'L-peptide linking' y TYROSINE ? 'C9 H11 N O3' 181.189
VAL 'L-peptide linking' y VALINE ? 'C5 H11 N O2' 117.146
<<truncated for brevity>>
In many cases, you may find that the coordinates presented in the atom_site records of the mmCIF format file may not exactly match the sequence in the _entity_poly and entity_poly_seq categories. The ends of chains and mobile loops are often not observed in PDB experimental structures, and coordinates are not included as atom_site records in the file. However, these amino acids will often be included in the sequence records, since the portion of the chain was present during the experiment. In these cases, information will be included in the pdbx_unobs_or_zero_occ_residues category to identify each missing residue. This category is analogous to the information presented in REMARK 465 of legacy PDB format files.
You may also notice some differences with sequences in other databases. For example, a researcher may change or mutate particular residues to see the effect this will have on the overall structure, or a particular portion of it. The _struct_ref_seq category (corresponding to the DBREF record in legacy PDB format files) provides cross-reference information between the sequence studied and a corresponding database sequence.
Here is an example from entry 4HHB:
_struct_ref_seq.align_id
_struct_ref_seq.ref_id
_struct_ref_seq.pdbx_PDB_id_code
_struct_ref_seq.pdbx_strand_id
_struct_ref_seq.seq_align_beg
_struct_ref_seq.pdbx_seq_align_beg_ins_code
_struct_ref_seq.seq_align_end
_struct_ref_seq.pdbx_seq_align_end_ins_code
_struct_ref_seq.pdbx_db_accession
_struct_ref_seq.db_align_beg
_struct_ref_seq.pdbx_db_align_beg_ins_code
_struct_ref_seq.db_align_end
_struct_ref_seq.pdbx_db_align_end_ins_code
_struct_ref_seq.pdbx_auth_seq_align_beg
_struct_ref_seq.pdbx_auth_seq_align_end
1 1 4HHB A 1 ? 141 ? P69905 2 ? 142 ? 1 141
2 2 4HHB B 1 ? 146 ? P68871 2 ? 147 ? 1 146
3 1 4HHB C 1 ? 141 ? P69905 2 ? 142 ? 1 141
4 2 4HHB D 1 ? 146 ? P68871 2 ? 147 ? 1 146
<<truncated for brevity>>
The _struct_ref_seq_dif category (corresponding to the SEQADV record in legacy PDB format files) identifies differences between sequence information in the sequence records (_entity_poly and entity_poly_seq categories) of the PDB entry and the sequence database entry given in _struct_ref_seq.
Here is an example from entry 8JK1:
_struct_ref_seq_dif.align_id
_struct_ref_seq_dif.pdbx_pdb_id_code
_struct_ref_seq_dif.mon_id
_struct_ref_seq_dif.pdbx_pdb_strand_id
_struct_ref_seq_dif.seq_num
_struct_ref_seq_dif.pdbx_pdb_ins_code
_struct_ref_seq_dif.pdbx_seq_db_name
_struct_ref_seq_dif.pdbx_seq_db_accession_code
_struct_ref_seq_dif.db_mon_id
_struct_ref_seq_dif.pdbx_seq_db_seq_num
_struct_ref_seq_dif.details
_struct_ref_seq_dif.pdbx_auth_seq_num
_struct_ref_seq_dif.pdbx_ordinal
1 8JK1 GLN A 1 ? UNP A0A024R9I8 ? ? 'expression tag' -1 1
1 8JK1 ALA A 2 ? UNP A0A024R9I8 ? ? 'expression tag' 0 2
1 8JK1 SER A 30 ? UNP A0A024R9I8 CYS 28 'engineered mutation' 28 3
1 8JK1 SER A 95 ? UNP A0A024R9I8 ASN 93 variant 93 4
1 8JK1 ILE A 118 ? UNP A0A024R9I8 PHE 116 variant 116 5
1 8JK1 CYS A 136 ? UNP A0A024R9I8 ARG 134 variant 134 6
2 8JK1 GLN B 1 ? UNP A0A024R9I8 ? ? 'expression tag' -1 7
2 8JK1 ALA B 2 ? UNP A0A024R9I8 ? ? 'expression tag' 0 8
2 8JK1 SER B 30 ? UNP A0A024R9I8 CYS 28 'engineered mutation' 28 9
2 8JK1 SER B 95 ? UNP A0A024R9I8 ASN 93 variant 93 10
2 8JK1 ILE B 118 ? UNP A0A024R9I8 PHE 116 variant 116 11
2 8JK1 CYS B 136 ? UNP A0A024R9I8 ARG 134 variant 134 12
<<truncated for brevity>>
Structural biologists often work with fragments of macromolecules which are more amenable to study than the full macromolecule. Thus, the _entity_poly and entity_poly_seq and atom_site records may include only a portion of the molecule, not the whole protein. The numbering of residues can also provide an additional complication. In some cases, researchers number atom_site records based on the numbering of the whole protein, while in other cases, they number the chain based on the fragment. Any number (negative, 0, positive) can be used. The numbering in the _entity_poly_seq.num and _atom_site.label_seq_id categories is always sequential beginning with “1”, while the _atom_site.auth_seq_id category provides the author’s residue numbering with any insertions given in _atom_site.pdbx_PDB_ins_code.
The example below from entry 5JZY, shows author residue numbering in the coordinates starting at 0 (in blue) and sequential numbering (shown in pink) starting with 5. Residues 1-4 in sequential numbering ( corresponding to residues 4, -3,-2, and -1 in author numbering) were not experimentally observed. An insertion (shown in red) is shown beginning with sequential residue number 23 (corresponding to author numbering 14).
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . GLY A 1 5 ? 2.001 16.815 86.112 1.00 44.36 ? 0 GLY L N 1 ATOM 2 C CA . GLY A 1 5 ? 1.482 18.136 86.547 1.00 45.80 ? 0 GLY L CA 1 ATOM 3 C C . GLY A 1 5 ? 1.277 18.094 88.040 1.00 46.03 ? 0 GLY L C 1 ATOM 4 O O . GLY A 1 5 ? 0.165 17.865 88.516 1.00 47.41 ? 0 GLY L O 1 <<truncated for brevity>> ATOM 317 N N . LYS A 1 23 A -18.855 6.491 92.634 1.00 13.38 ? 14 LYS L N 1 ATOM 318 C CA . LYS A 1 23 A -19.863 5.707 93.333 1.00 15.01 ? 14 LYS L CA 1 ATOM 319 C C . LYS A 1 23 A -19.372 5.069 94.627 1.00 14.76 ? 14 LYS L C 1 ATOM 320 O O . LYS A 1 23 A -20.210 4.671 95.446 1.00 16.30 ? 14 LYS L O 1 <<truncated for brevity>> ATOM 330 N N . THR A 1 24 B -18.053 4.956 94.853 1.00 12.77 ? 14 THR L N 1 ATOM 331 C CA . THR A 1 24 B -17.592 4.286 96.064 1.00 13.26 ? 14 THR L CA 1 ATOM 332 C C . THR A 1 24 B -16.544 5.065 96.851 1.00 12.53 ? 14 THR L C 1 ATOM 333 O O . THR A 1 24 B -16.062 4.553 97.860 1.00 12.69 ? 14 THR L O 1 <<truncated for brevity>> ATOM 344 N N . GLU A 1 25 C -16.216 6.305 96.471 1.00 11.57 ? 14 GLU L N 1 ATOM 345 C CA . GLU A 1 25 C -15.205 7.037 97.226 1.00 10.78 ? 14 GLU L CA 1 ATOM 346 C C . GLU A 1 25 C -15.638 7.307 98.654 1.00 10.66 ? 14 GLU L C 1 ATOM 347 O O . GLU A 1 25 C -14.787 7.406 99.548 1.00 11.35 ? 14 GLU L O 1 <<truncated for brevity>>
For more information, see “Missing Loops and Tails” and “Fragments and Domains” sections (in this Guide) and the “Macromolecules”>>”Sample Sequence” section of the PDBx/mmCIF User Guide.
Amino Acid and Nucleotide Nomenclature
In the SEQRES records, the standard 3-character code is used for standard amino acids, and standard nucleotides are specified by 1 or 2 characters:
Standard (L-) Amino Acids |
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER, THR, VAL, TRP, TYR, PYL (pyrrolysine)*, SEC (selenocysteine) * |
D-Amino Acids (present in the PDB Archive) |
DAL ('ALA'), DSN ('SER'), DCY ('CYS'), DPR ('PRO'), DVA ('VAL'), DTH ('THR'), DLE ('LEU'), DIL ('ILE'), DSG ('ASN'), DAS ('ASP'), MED ('MET'), DGN ('GLN'); DGL ('GLU'), DLY ('LYS'), DHI ('HIS'), DPN ('PHE'), DAR ('ARG'), DTY ('TYR'), DTR ('TRP') |
Deoxyribonucleotides |
DA, DC, DG, DT, DI |
Ribonucleotides |
A, C, G, U, I |
* SEC and PYL are considered as standard amino acids as announced by the wwPDB.
Other codes are used for modified amino acids (such as MSE for selenomethionine) and for modified nucleotides (such as CBR for bromocytosine).
Several additional records are included in the PDB format to define modifications as they appear in the ATOM records.
As an example, here are the records that describe HYP (hydroxyproline, a modified version of PRO, or proline) in the ATOM records for collagen entry 1CAG:
_pdbx_struct_mod_residue.id
_pdbx_struct_mod_residue.label_asym_id
_pdbx_struct_mod_residue.label_comp_id._pdbx_struct_mod_residue.label_seq_id
_pdbx_struct_mod_residue.auth_asym_id._pdbx_struct_mod_residue.auth_comp_id
_pdbx_struct_mod_residue.auth_seq_id
_pdbx_struct_mod_residue.PDB_ins_code _pdbx_struct_mod_residue.parent_comp_id
_pdbx_struct_mod_residue.details
1 A HYP 2 A HYP 2 ? PRO 4-HYDROXYPROLINE
2 A HYP 5 A HYP 5 ? PRO 4-HYDROXYPROLINE
3 A HYP 8 A HYP 8 ? PRO 4-HYDROXYPROLINE
4 A HYP 11 A HYP 11 ? PRO 4-HYDROXYPROLINE
5 A HYP 14 A HYP 14 ? PRO 4-HYDROXYPROLINE
6 A HYP 17 A HYP 17 ? PRO 4-HYDROXYPROLINE
7 A HYP 20 A HYP 20 ? PRO 4-HYDROXYPROLINE
8 A HYP 23 A HYP 23 ? PRO 4-HYDROXYPROLINE
9 A HYP 26 A HYP 26 ? PRO 4-HYDROXYPROLINE
10 A HYP 29 A HYP 29 ? PRO 4-HYDROXYPROLINE
11 B HYP 2 B HYP 32 ? PRO 4-HYDROXYPROLINE
12 B HYP 5 B HYP 35 ? PRO 4-HYDROXYPROLINE
<<truncated for brevity>>
Additional specifics about the nature of the modified residue can be found in the _chem_comp category:
_chem_comp.id
_chem_comp.type
_chem_comp.mon_nstd_flag
_chem_comp.name
_chem_comp.pdbx_synonyms
_chem_comp.formula
_chem_comp.formula_weight
ACY non-polymer . 'ACETIC ACID' ? 'C2 H4 O2' 60.052
ALA 'L-peptide linking' y ALANINE ? 'C3 H7 N O2' 89.093
GLY 'peptide linking' y GLYCINE ? 'C2 H5 N O2' 75.067
HOH non-polymer . WATER ? 'H2 O' 18.015
HYP 'L-peptide linking' n 4-HYDROXYPROLINE HYDROXYPROLINE 'C5 H9 N O3' 131.130
PRO 'L-peptide linking' y PROLINE ? 'C5 H9 N O2' 115.130
In the case of a modified residue, it is presented in _entity_poly.pdbx_seq_one_letter_code with three letters in parenthesis. Again from entry 1CAG:
_entity_poly.entity_id 1 _entity_poly.type 'polypeptide(L)' _entity_poly.nstd_linkage no _entity_poly.nstd_monomer yes _entity_poly.pdbx_seq_one_letter_code 'P(HYP)GP(HYP)GP(HYP)GP(HYP)GP(HYP)AP(HYP)GP(HYP)GP(HYP)GP(HYP)GP(HYP)G' _entity_poly.pdbx_seq_one_letter_code_can PPGPPGPPGPPGPPAPPGPPGPPGPPGPPG _entity_poly.pdbx_strand_id A,B,C _entity_poly.pdbx_target_identifier ?