Help Summary
Pfam 35.0 (Nov 2021 , 19632 families)
Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a hidden Markov model (HMMs).
Each Pfam family, often referred to as a Pfam-A entry, consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.
Pfam entries are classified in one of six ways:
- Family:
- A collection of related protein regions
- Domain:
- A structural unit
- Repeat:
- A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
- Motifs:
- A short unit found outside globular domains
- Coiled-Coil:
- Regions that predominantly contain coiled-coil motifs, regions that typically contain alpha-helices that are coiled together in bundles of 2-7.
- Disordered:
- Regions that are conserved, yet are either shown or predicted to contain bias sequence composition and/or are intrinsically disordered (non-globular).
Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.
Pfam Read The Docs
This Pfam help documentation is also available at readthedocs, where it can be searched, printed or downloaded for offline reading.
Pfam Changes
This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.
Latest changes to Pfam data
Changes between Pfam 31 and 32
Release 32.0 contains a total of 17929 families, with 1229 new families and 12 families killed since the last release. 74.5% of all proteins in Pfamseq contain a match to at least one Pfam domain. 50.1% of all residues in the sequence database fall within Pfam domains.
Show past changes.
Latest changes to website
Release 4.0 (29th April 2014)
This change coincided with the move from Sanger to EBI. There were no major changes to the website, though the underlying search system was entirely re-written to use the EBI external services web framework.
-
Remove references to mirrors Pfam is now served
from
pfam.xfam.org
and the mirror sites have been shutdown. - Documentation changes: we're gradually working through the documentation on our help pages, changing links and re-writing text to reflect our new home.
Show past changes.
Contents:
Site organisation

The family page is the major page for accessing information contained within Pfam as it describes the Pfam family entries. Most referring sites link to this page. Alternatively, users can navigate to family pages by entering the Pfam identifier or accession number, either via the home page, the "Jump-to" boxes or the keyword search box, or by clicking on a domain name or graphic from anywhere on the website. As with all Pfam pages, there is the context-sensitive icon bar in the top right hand corner that provides a quick overview about the contents of the tabs. The tabs on the family page cover the following topics: functional annotation; domain organisation or architectures; alignments; HMM logo; trees; curation and models; species distribution; interactions; and structures.

Using the "Jump to" search
Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.
The "Jump to..." search understands accessions and IDs for
most types of entry. For example, you can enter either a Pfam family
accession, e.g. PF02171
, or, if you find it easier to
remember, a family ID, such as piwi
. Note that the search
is case insensitive.
Because some identifiers can be ambiguous, the "Jump to..."
search may need to test several types of identifier to find
the entry that you're looking for. For example, Pfam A family IDs (e.g.
Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so
if you enter kazal
, the search will first look for a
family called kazal and, if it doesn't find one, will then
look for a clan called kazal. If all of the guesses fail, you'll
see an error message saying "Entry not found".
The order in which the search tries the various types of ID and accession is given below:
- Pfam A accession, e.g. PF02171
- Pfam A identifier, e.g. piwi
- UniProt sequence ID, e.g. CANX_CHICK
- NCBI "GI" number, e.g. 113594566
- NCBI secondary accession, e.g. BAF18440.1
- Pfam clan accession, e.g. CL0005
- metaseq ID, e.g. JCVI_ORF_1096665732460
- metaseq accession, e.g. JCVI_PEP_1096665732461
- Pfam clan accession, e.g. CL0005
- Pfam clan ID, e.g. Kazal
- PDB entry, e.g. 2abl
- Proteome species name, e.g. Homo sapiens

Keyword search
Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam-A families which match a particular keyword. The search includes several different areas of the Pfam database:
- text fields in Pfam entries, e.g. family descriptions
- UniProt sequence entry description and species fields
HEADER
andTITLE
fields from PDB entries- Gene Ontology IDs and terms
- InterPro entry abstracts
Each Pfam-A entry is listed only once in the results table, although it might have been found in more than one area of the database.
Searching a protein sequence against Pfam
Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein. If your protein is present in the version of UniProt, NCBI Genpept or the metagenomic sequence set that we used to make the current release of Pfam, we have already calculated its domain architecture. You can access this by entering the sequence accession or ID in the 'view a sequence' box on the Pfam homepage.
If your sequence is not in the Pfam database, you could perform a single-sequence or a batch search by clicking on the 'Search' link at the top of the Pfam page.
Single protein search
If your protein is not recognised by Pfam, you will need to paste the protein sequence into the search page. We will search your sequence against our HMMs and instantly display the matches for you.
Batch search
If you have a large number of sequences to search (up to several thousand), you can use our batch upload facility. This allows you to upload a file of your sequences in FASTA format, and we will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you put a maximum of 5000 sequences in each file.
Local protein searches
If you have a very large number of protein searches to perform, or you do not wish to post your sequence across the web, it may be more convenient to run the Pfam searches locally using the 'pfam_scan.pl' script. To do this you will need the HMMER3 software, the Pfam HMM libraries and a couple of additional data files from the Pfam website. You will also need to download a few modules from CPAN, most notably Moose.
Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.
Proteome analysis
Pfam pre-calculates the domain compositions and architectures for all the proteomes present in our snapshot of UniProt proteomes. To see the list of proteomes, click on the 'browse' link at the top of the Pfam website, and click on a letter of the alphabet in the 'proteomes' section. By clicking on a particular organism, you will be be able to view the proteome page for that organism. From here you can view the domain organisation and the domain composition for that proteome.
The taxonomy query allows quick identification of families/domains which are present in one species but are absent from another. It can also be used to find families/domains that are unique to a particular species (note this can be very slow).
Finding proteins with a specific set of domain combinations ('architectures')
Pfam allows you to retrieve all of the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) using the domain query tool. For a more detailed study of domain architectures you can use PfamAlyzer. PfamAlyzer allows you to find proteins which contain a specific combination of domains and to specify particular species and the evolutionary distances allowed between domains.
Wikipedia annotation
The Pfam consortium is now coordinating the annotation of Pfam families via Wikipedia. On the summary tab of some family pages, you'll find the text from a Wikipedia article that we feel provides a good description of the Pfam family. If a family has a Wikipedia article assigned to it, we now show the text of that article on the summary tab, in preference to the traditional Pfam annotation text.
If a family does not yet have a Wikipedia article assigned to it, there are several ways for you to help us add one. You can find much more information about the process in the Pfam and Wikipedia tab.
Pfam Quick tour
This quick tour provides a brief introduction to the Pfam website and database.
Creating Families
Creating families provides a tutorial on how to create a Pfam entry.
Repeats in Pfam
Repeats describes how repeats are represented in Pfam.
Contents:
- What is Pfam?
- What is on a Pfam-A family page?
- What is a clan?
- What criteria do you use for putting families
into clans? - What happened to the Pfam_ls and Pfam_fs files?
- I was wondering if it is possible to build Wise2 with
HMMER3 support? - How can I search Pfam locally?
- Why doesn't Pfam include my sequence?
- Why is there apparent redundancy of UniProt IDs in the
full-length FASTA sequence file? - How many accurate alignments do you have?
- How can I submit a new domain?
- Can I search my protein against Pfam?
- Why do I get slightly different results when
I search my sequence against Pfam versus
when I look up a sequence on the Pfam website? - What is the difference between the '-' and '.'
characters in your full alignments? - What do the SS lines in the alignment mean?
- You don't have domain YYYY in Pfam!
- Are there other databases which do this?
- So which database is better?
What is Pfam?
Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.
For each Pfam accession we have a family page, which can be accessed in several ways: from the 'View a Pfam Family' search box on the HOME page, by clicking on any graphical image of a domain, by searching for a particular family using the 'Keyword Search' box on the top right hand corner of most website pages, or by pasting the family identifier or accession into the 'JUMP TO' box that is present on most pages in the site.
What is on a Pfam-A family page?
From the family page you can view the Pfam annotation for a family. We also provide access to many other sources of information, including annotation from the InterPro database, where available, cross-links to other databases and other tools for protein analysis. Since release 25.0 we have also started displaying relevant articles from Wikipedia where available.
Via the tabs on the left-hand side of the page, you can view:
- the domain architectures in which this family is found
- the alignments for the family in various formats, including alignments of matches to the NCBI and metagenomic sets, as well as in 'heat-map' format. All alignments can be downloaded
- the phylogenetic and species distribution trees, either as a traditional, interactive tree or as a "sunburst" plot
- the HMM logo
- the structural information for each family where available
What is a clan?
Some of the Pfam families are grouped into clans. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links, where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on the top of any Pfam page.
What criteria do you use for putting families into clans?
We use a variety of measures. Where possible we do use structures to guide us and that is always the gold standard. In the absence of a structure we use:
- profile comparisons such as HHsearch
- the fact that a sequence significantly matches two HMMs in the same region of the sequence
- a method called SCOOP, that looks for common matches in search results that may indicate a relationship
All of this sort of information is then used by one of our curators to make a decision about where families are related and we strive to find information in literature that support the relationship, e.g. common function.
I was wondering if it is possible to build Wise2 with HMMER3 support?
The way we get round the problem with the difference in HMMER versions, is to convert the HMMs that are in HMMER3 format to HMMER2 format using the HMMER3 program "hmconvert" (with -2) flag. To make the searches feasible, we screen the DNA for potential domains using ncbi-blast and the Pfam-A.fasta as a target library. GeneWise is then used to calculate a subset of HMMs against the DNA. There is some down-weighting of the bits-per-position between H2 and H3 HMMs that the conversion does not account for, leading inevitably to some false negatives for some families/sequences. However, until GeneWise is patched to deal with HMMER3 models, this is the best course of action.
What happened to the Pfam_ls and Pfam_fs files?
In the past, each Pfam family was represented by two profile-hidden Markov models (HMMs). One of these could match partially to a family and was called local or fs mode, the other required a sequence to match to the whole length of the HMM, and was called glocal or ls mode. With HMMER2, we found that the combination of the two models gave us the most sensitive searches. However, HMMER3 models are only available for searching in local (fs) mode. Because of the improvements in HMMER3, this single model is as sensitive as the two combined HMMER2 models. This means that we no longer provide two HMM libraries called 'HMM_ls' and 'HMM_fs'. Instead, a single library is available called 'Pfam-A.hmm'.
How can I search Pfam locally?
If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally using the 'pfam_scan.pl' script.
In terms of HMMs and formats, Pfam is based around the HMMER3 package. This will need to be installed on your local machine. You will need also to download the Pfam HMM libraries from the FTP site, as well as a few modules from CPAN, most notably Moose.
Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.
Why doesn't Pfam include my sequence?
Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that, at any time, the sequences used by Pfam might be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam, you can still find out what domains it contains by pasting it into the sequence search box on the search page.
Why is there apparent redundancy of UniProt IDs in the full-length FASTA sequence file?
A given Pfam family may match a single protein sequence multiple times, if the domain/family is a repeating unit, for example, or when the HMM matches only to short stretches of the sequence but matches several times. In such cases the FASTA file with the full length sequences will contain multiple copies of the same sequence.
How many accurate alignments do you have?
Release 35.0 has 19632 families. Over 75.2% of the proteins in SWISSPROT 2021_03 and TrEMBL 2021_03 have at least one match to a Pfam-A family.
How can I submit a new domain?
If you know of a domain that is not present in Pfam, you can
submit it to us by email
(pfam-help@ebi.ac.uk)
and we will endeavour to build a Pfam entry
for it. We ask that you supply us with a multiple sequence
alignment of the domain (please send the alignment file as a
text file (e.g. .txt
) and not in the format of
a specific application such as Microsoft Word (e.g. a .doc
)
file), and associated literature evidence if available.
Can I search my protein against Pfam?
Of course! Please use this search form.
Why do I get slightly different results when I search my sequence against Pfam versus when I look up a sequence on the Pfam website?
When a sequence region has overlapping matches to more than one family within the same clan, we only show one of those matches. If the sequence region is also in the seed alignment for a family, only the match to that family is shown. Otherwise we show the family that corresponds to the match with the lowest E-value.
There are cases where a sequence region is in the seed alignment of a Pfam family (family A), but it does not have a significant match to that family’s HMM. Occasionally, the same sequence region has a significant match to another family (family B) in the same clan. In this situation, the Pfam website will not show the match to family B as it is present in the seed alignment of family A. The sequence search will however show the match to family B as the seed alignment information is unknown. This scenario, where the sequence search shows a match that the Pfam website does not, is very rare (affecting less than 0.01% of all matches in the Pfam database).
What is the difference between the '-' and '.' characters in your full alignments?
The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.
FLPA_METMA/1-193 ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE FBRL_XENLA/86-317 RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE FBRL_MOUSE/90-321 KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD O75259/81-312 KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD FBRL_SCHPO/71-303 AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD O15647/71-301 GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED FBRL_TETTH/64-294 KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG FBRL_LEIMA/57-291 AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ Q9ZSE3/38-276 SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD HMM STATES MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
What do the SS lines in the alignment mean?
These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:
- C: random Coil
- H: alpha-helix
- G: 3(10) helix
- I: pi-helix
- E: hydrogen bonded beta-strand (extended strand)
- B: residue in isolated beta-bridge
- T: h-bonded turn (3-turn, 4-turn, or 5-turn)
- S: bend (five-residue bend centered at residue i)
You don't have domain YYYY in Pfam!
We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@ebi.ac.uk.
Are there other databases which do this?
To a certain extent yes, there are a number of "second generation" databases which are trying to organise protein space into evolutionarily conserved regions. Examples include:
- PROSITE
- This originally was based around regular expression patterns but now also includes profiles.
- PRINTS
- This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
- SMART
- This is a database concentrating on extracellular modules and signaling domains.
- ADDA
- This is an automatic algorithm for domain decomposition and clustering of protein domain families.
- InterPro
- Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
- CDD
- The Conserved Domain Database is derived from Pfam and SMART databases.
So which database is better?
As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website.
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates from HMMER3.
Architecture
The collection of domains that are present on a protein.
Clan
A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.
Domain
A structural unit.
Domain score
The score of a single domain aligned to an HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.
DUF
Domain of unknown function.
Envelope coordinates
See Alignment coordinates.
Family
A collection of related protein regions.
Full alignment
An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.
Gathering threshold (GA)
Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The gathering threshold is assigned by a curator when the family is built. The GA is the minimum score a sequence must attain in order to belong to the full alignment of a Pfam entry. For each Pfam HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.
HMMER
The suite of programs that Pfam uses to build and search HMMs. Since Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site for more information.
Hidden Markov model (HMM)
A HMM is a probablistic model. In Pfam we use HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our HMMs against the UniProt protein database to find homologous sequences.
HMMER3
The suite of programs that Pfam uses to build and search HMMs. See the HMMER site for more information.
iPfam
A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.
Metaseq
A collection of sequences derived from various metagenomics datasets.
Motif
A short unit found outside globular domains.
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment.
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.
Posterior probability
HMMER3 reports a posterior probability for each residue that matches a 'match' or 'insert' state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with '*' being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.
Repeat
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.
Seed alignment
An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.
Sequence score
The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment.
Help With Pfam HMM scores
E-values and Bit-scores
Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER3 package. In HMMER3, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal to or better than this value by chance alone. A good E-value is much less than 1. A value of 1 is what would be expected just by chance. In principle, all you need to decide on the significance of a match is the E-value.
E-values are dependent on the size of the database searched, so we use a second system in-house for maintaining Pfam models, based on a bit score (see below), which is independent of the size of the database searched. For each Pfam family, we set a bit score gathering (GA) threshold by hand, such that all sequences scoring at or above this threshold appear in the full alignment. It works out that a bit score of 20 equates to an E-value of approximately 0.1, and a score 25 of to approximately 0.01. From the gathering threshold both a "trusted cutoff" (TC) and a "noise cutoff" (NC) are recorded automatically. The TC is the score for the next highest scoring match above the GA, and the NC is the score for the sequence next below the GA, i.e. the highest scoring sequence not included in the full alignment.
Sequence versus domain scores
There's an additional wrinkle in the scoring system. HMMER3 calculates two kinds of scores, the first for the sequence as a whole and the second for the domain(s) on that sequence. The "sequence score" is the total score of a sequence aligned to the model (the HMM); the "domain score" is the score for a single domain — these two scores are virtually identical where only one domain is present on a sequence. Where there are multiple occurrences of the domain on a sequence any individual match may be quite weak, but the sequence score is the sum of all the individual domain scores, since finding multiple instances of a domain increases our confidence that that sequence belongs to that protein family, i.e. truly matches the model.
Meaning of bit-score for non-mathematicians
A bit score of 0 means that the likelihood of the match having been emitted by the model is equal to that of it having been emitted by the Null model (by chance). A bit score of 1 means that the match is twice as likely to have been emitted by the model than by the Null. A bit score of 2 means that the match is 4 times as likely to have been emitted by the model than by the Null. So, a bit score of 20 means that the match is 2 to the power 20 times as likely to have been emitted by the model than by the Null.
References & Bibliography
Pfam References
Book Chapters on Pfam
Contents:
The Pfam consortium is now coordinating the annotation of Pfam families via Wikipedia. This is some background on the process.
A new approach to annotation
Pfam families have traditionally been annotated by our curators as they were added into the Pfam database, but the annotation step has become by far the most time-consuming part of building a Pfam family. As we adapt the Pfam model to cope with the dramatic increases in sequence data that are on the horizon, we have had to consider ways to make this step quicker and more efficient. We are also striving constantly to improve the quality and depth of our annotations, and to this end we have now adopted the Wikipedia model of annotation that was pioneered by the Rfam resource.
In this approach we will gradually reduce the prominence of our traditional, curator-produced family annotations, replacing them with Wikipedia articles. Starting from Pfam release 25.0, you will see that some family pages in the Pfam website show Wikipedia content rather than our own annotations. Ultimately we hope to be able to assign a detailed, high-quality Wikipedia article in Pfam, but we need the help of the Pfam and Wikipedia communities to make this happen.
Wikipedia content in the Pfam website

When we build a new Pfam family, we try to find a Wikipedia article that describes the family and provides what we feel to be a valuable annotation for it. If we can't find a suitable article we will, in many cases, generate a new Wikipedia article ourselves. Hence, most new families will be assigned a Wikipedia article as soon as they are created.
Where a Wikipedia article has been assigned to a family, the main summary tab of the family page will show the content of the article, rather than the Pfam annotation. You can still see the old Pfam annotation, along with the Interpro annotation text, in adjacent tabs. Note that we will no longer be updating Pfam annotation text for any family that has a Wikipedia article. Instead we will try to make improvements or corrections to the article itself and will encourage our users to make improvements and corrections themselves.

Unfortunately, while new Pfam families will have Wikipedia articles assigned when they are created, we simply do not have the resources to be able to revisit older, pre-existing Pfam families. The family pages for these families still continue to show the Pfam annotation, but we hope to replace this with Wikipedia content wherever possible.
Contributing annotations
You can now contribute to the improvement of Pfam annotations in several ways. First and foremost, if you come across a family that does not yet have a Wikipedia article assigned to it, we would really like to add one. If you know of an article that would provide a useful description of a family, please let us know via our annotation submission form (click the "Add annotation" button on the family page) or by email. You can find our email address at the bottom of every page.
One of the advantages of using Wikipedia to provide our annotations is that any user can now contribute to that annotation text. In many cases, families that do not yet have a Wikipedia article can be assigned an article that already exists. In some cases, however, no suitable article exists, and in that case we would encourage you to consider adding one to Wikipedia yourself.
Editing Wikipedia articles
You can see these notes on every family page by clicking "More" on the Wikipedia content tab.
Before you edit for the first time
Wikipedia is a free, online encyclopedia. Although anyone can edit or contribute to an article, Wikipedia has some strong editing guidelines and policies, which promote the Wikipedia standard of style and etiquette. Your edits and contributions are more likely to be accepted (and remain) if they are in accordance with this policy.
You should take a few minutes to view the following pages:
How your contribution will be recorded
Anyone can edit a Wikipedia entry. You can do this either as a new user or you can register with Wikipedia and log on. When you click on the "Edit Wikipedia article" button, your browser will direct you to the edit page for this entry in Wikipedia. If you are a registered user and currently logged in, your changes will be recorded under your Wikipedia user name. However, if you are not a registered user or are not logged on, your changes will be logged under your computer's IP address. This has two main implications. Firstly, as a registered Wikipedia user your edits are more likely seen as valuable contribution (although all edits are open to community scrutiny regardless). Secondly, if you edit under an IP address you may be sharing this IP address with other users. If your IP address has previously been blocked (due to being flagged as a source of 'vandalism') your edits will also be blocked. You can find more information on this and creating a user account at Wikipedia.
If you have problems editing a particular page, contact us at pfam-help@ebi.ac.uk and we will try to help.
Does Pfam agree with the content of the Wikipedia entry ?
Pfam has chosen to link families to Wikipedia articles. In some case we have created or edited these articles but in many other cases we have not made any direct contribution to the content of the article. The Wikipedia community does monitor edits to try to ensure that (a) the quality of article annotation increases, and (b) vandalism is very quickly dealt with. However, we would like to emphasise that Pfam does not curate the Wikipedia entries and we cannot guarantee the accuracy of the information on the Wikipedia page.
Contact us
Community annotation is a new facility of the Pfam web site. If you have problems editing or experience problems with these pages please contact us.
Linking to the Pfam website
Linking to family pages
You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.
Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.
Here are some examples of linking to Pfam families:
- Directly, using either accession or ID:
-
/family/PF00002
or
/family/7tm_2 - Using a parameter, by accession:
- /family?acc=PF00002
- Using a parameter, by ID:
- /family?id=7tm_2
- Using the "entry" parameter with either accession or ID:
-
/family?entry=PF00002
or
/family?entry=7tm_2
Linking to protein sequence pages
As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.
Here are some examples of linking to protein sequence pages at EBI:
- Directly:
-
/protein/P15498
or
/protein/VAV_HUMAN - By accession:
- /protein?acc=P15498
- By ID:
- /protein?id=VAV_HUMAN
- Using "entry":
-
/protein?entry=P15498
or
/protein?entry=VAV_HUMAN
Linking to the "jump to" search
The Pfam website features a search tool that tries to guess the type of any accession or ID that it is given. For example, if given "VAV_HUMAN", the search returns the URL for the protein sequence page for the VAV_HUMAN entry. If given "1w9h", the search returns the URL for the PDB entry (structure) 1w9h.
You can use the "jump to" search if you need to link to Pfam but
can't be sure what type of accession or ID you will be using in your link.
By default, the search returns the URL that it has found, as a simple,
plain text HTTP response. Adding the parameter redirect=1
will make the "jump to" tool redirect to the URL that it finds
or, if it couldn't find an appropriate URL, to the Pfam homepage.
- Return URL:
- /search/jump?entry=P15498
- Redirect:
- /search/jump?entry=P15498&redirect=1
Note that, although it may be convenient to link to Pfam using this search tool, there is no error reporting for your users if the search fails to find an appropriate URL in the Pfam site. It is much safer to link directly to the correct section of the site. Please contact us if you need help with building specific links.
One of the visualisations provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.
The library that generates the images in this page and throughout the Pfam site uses a JSON string to describe the domain graphic. Each of the example graphics in this page is followed by a link that can be used to show the JSON snippet that produced it.
Generating graphics
You can try generating your own graphics using the domain graphics generator. The JSON descriptions in this page can be pasted directly into the generator to produce the graphics that you see here.
You can also generate the domain graphics for specific sequences, using the UniProt graphics generator.
Using the domain graphics code
Finally, if you would like to use the javascript library in your own site, we have put together an example page, showing how to set up the library and its dependencies. Look at the source code of the page for an explanation.
The sequence
The base sequence, undecorated by any domains or features, is represented by a plain grey bar:

The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 400 residue sequence will result in a 200 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.
Pfam-A
The high quality, curated Pfam-A domains are classified into one of six different types: family, domain, coiled-coil, disordered, repeat and motif (more details). These different classification types are rendered slightly differently.
Family/domain
It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.
Both family and domain entries are rendered as rectangles with curved ends when the sequence is a full length match. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.

From Pfam 24.0 onwards, Pfam has been generated using HMMER3, which introduces the concept of "envelope coordinates" for a match. Envelope regions are represented in domain graphics as lighter coloured regions. The graphic above shows short envelope regions at the ends of both domains.
When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:

Repeat/motif
Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.

Discontinuous nested domains
Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.
Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.
To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them.

Other sequence motifs
In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.

Signal peptides
Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.
Low complexity regions
Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.
Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.
Disordered regions
We use the IUPred method for the prediction of disordered regions in the query sequence. The IUPred server provides more detailed disorder prediction results than currently offered here.
Coiled-coils
Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.
Transmembrane regions
Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.
Other Sequence features
Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.

Disulphide bridges
Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details of the bond in a tooltip.
Active site residues
Within an enyzme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types are represented by a "lollipop" with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.
Pfam-predicted active sites are determined by using the experimental data and transferring these annotations through a Pfam alignment.
"Lollipops"
A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. The lollipop head can be drawn as a square, circle or diamond, as a simple coloured bar, or as an arrow (pointing away from the sequence) or a "pointer" (an arrow pointing towards the sequence).

Tooltips
If appropriate metadata are present in the sequence description, the domain graphics library can also add tooltips to the image. The example below is a "live" domain graphic and its description includes the necessary metadata for generating tooltips; move your mouse over the various domains and sequence features to see them.
Show JSONContents:
This is an introduction to the "RESTful" interface to the Pfam website. REST (or Representation State Transfer) refers to a style of building websites which makes it easy to interact programmatically with the services provided by the site. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.
Basic concepts
URLs
A RESTful service typically sends and receives data over HTTP, the same protocol that's used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.
In the Pfam website we use the same basic URL to provide both the standard HTML representation of Pfam data and the alternative XML representation. To see the data for a particular Pfam-A family, you would visit the following URL in your browser:
/family/Piwi
To retrieve the data in XML format, just add an extra parameter,
output=xml
, to the URL:
/family/Piwi?output=xml
The response from the server will now be an XML document, rather than an HTML page.
Sending requests
Using curl
Although you can use a browser to retrieve family data in XML format,
it's most useful to send requests and retrieve XML programmatically.
The simplest way to do this is using a Unix command line tool such as
curl
:
Note: we have recently changed the web server that we use
for serving the Pfam site. Due to a bug in the server itself, requests that
come from curl
are normally rejected. The current work-around is
to add an extra parameter to the curl
command line:
-H 'Expect:'
. This should avoid problems with requests being
rejected.
Using a script
Most programming languages have the ability to send HTTP requests and receive HTTP responses. A Perl script to retrieve data about a Pfam family might be as trivial as this:
Retrieving data
Although XML is just plain text and therefore human-readable, it's intended to be parsed into a data structure. Extending the Perl script above, we can add the ability to parse the XML using an external Perl module, XML::LibXML:
This script now prints out the accession for the family "Piwi" (PF02171).
Available services
The following is a list of the sections of the website which are currently available as RESTful services.
Pfam ID/accession conversion
This is a simple service to return the accession and ID for a Pfam family, given either the ID or accession as input. Any of the following URLs will return the same simple XML document:
/family/acc?id=Piwi&output=xml /family/Piwi/acc?output=xml /family/id?output=xml&acc=PF02171 /family/Piwi/id?output=xml /family?entry=Piwi&output=xml
You can see the XML schema for this XML document here.
Note that, as a convenience, you can also omit the output=xml
parameter and the response will contain only the ID or accession, as a
plain text string:
Pfam-A annotations
You can retrieve a sub-set of the data in a Pfam-A family page as an XML document using any of the following styles of URL:
/family?id=Piwi&output=xml /family?output=xml&acc=PF02171 /family?entry=Piwi&output=xml /family/Piwi?output=xml
The last two styles, using the entry
parameter or
an extended URL, accept either accessions or identifiers. The
accession/ID is case-insensitive in all cases.
You can see the XML schema for this XML document here.
Some Pfam families are removed or merged into others, in which case they become "dead" families. If you try to retrieve annotation information about a dead family, you'll get a simple XML document that only includes information on the replacement (if any) for the family:
You can see the XML schema for this XML document here.
Pfam-A family list
You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an XML document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:
/families?output=xml /families?output=text
You can also view the list in a web browser by removing the
output=xml
parameter from the URL.
You can see the XML schema for this XML document here.
Protein sequence data
You can retrieve a sub-set of the data in a protein page as an XML document using any of the following styles of URL:
/protein?id=CANX_CHICK&output=xml /protein?output=xml&acc=P00789 /protein?entry=P00789&output=xml /protein/P00789?output=xml
As for Pfam-A families, arguments are all case-insensitive and the
entry
parameter accepts either ID or accession.
You can see the XML schema for this XML document here.
Sequence searches
The Pfam website includes a form that allows users to upload a protein sequence and see a list of the Pfam domains that are found on their search sequence. We've now implemented a RESTful interface to this search tool, making it possible to run single-sequence Pfam searches programmatically.
Running a search is a two step process:
- submit the search sequence and specify search parameters
- retrieve search results in XML format
The reason for separating the operation into two steps rather than performing a search in a single operation is that the time taken to perform a sequence search will vary according to the length of the sequence searched. Most web clients, browsers or scripts, will simply time-out if a response is not received within a short time period, usually less than a minute. By submitting a search, waiting and then retrieving results as a separate operation, we avoid the risk of a client reaching a time-out before the results are returned.
The following example uses simple command-line tools to submit the search and retrieve results, but the whole process is easily transferred to a single script or program.
Save your sequence to file
It is usually most convenient to save your sequence into a plain text file, something like this:
The sequence should contain only valid sequence characters, i.e. letters, excluding "J" and "O". You can break the sequence across multiple lines to make it easier to handle.
Submit the search
You can see the XML schema for this XML document here.
When using curl
the value of the parameter "seq"
needs to be quoted so that its value is taken correctly from the file
"test.seq". The second parameter can also be added directly to
the URL, as a regular CGI-style parameter, if you prefer.
The search service accepts the following parameters (you can see a more complete description of these settings here):
Parameter | Description | Accepted values | Default | Notes |
---|---|---|---|---|
evalue | use this E-value cut-off | valid float < 10.0 | none | the default is to have no E-value and to use the gathering threshold. See note below. If an E-value is given, it will be used, regardless of the value of "ga" |
ga | use gathering threshhold | 0 | 1 | 1 | |
seq | protein sequence | valid sequence characters | none | required |
Note: this documentation previously suggested that searches submitted through the RESTful interface used an E-value cut-off of 1.0 by default. This is incorrect. RESTful searches use the gathering threshold and not an E-value of 1.0. This is the opposite of the behaviour of the searches run through the web interface. We apologise for the inconsistency.
Wait for the search to complete
Although you can check for results immediately, if you poll before your job has completed, you won't receive an XML document. Instead, the HTTP response to your request will have its status set appropriately and the body of the response will contain only string giving the status. You should ideally check the HTTP status of the response, rather than relying on the body of the response.
These are the possible status codes for the response:
HTTP status code |
Status description |
Response body |
Notes |
---|---|---|---|
202 | Accepted | PEND / RUN | The job has been accepted by the search system and is either pending (waiting to be started) or running. After a short delay, your script should check for results again |
502 | Bad gateway | FAIL | There was a problem scheduling or running the job. The job has failed and will not produce results. There is no need to check the status again |
503 | Service unavailable | HOLD | Your job was accepted but is on hold. This status will not be assigned by the search system, but by an administrator. There is probably a problem with the job and you should contact the help desk for assistance with it |
410 | Gone | DEL | Your job was deleted from the search system. This status will not be assigned by the search system, but by an administrator. There was probably a problem with the job and you should contact the help desk for assistance with it |
500 | Internal server error | Error message | There was some problem with running your job, but it does not fall into any of the other categories. The body of the response will contain an error message from the server. Contact the help desk for assistance with the problem |
When writing a script to submit searches and retrieve results, please add a short delay between the submission and the first attempt to retrieve results. Most search jobs are returned within four to five seconds of submission, depending greatly on the length of the sequence to be searched.
Retrieve results
The XML that was returned from the first query includes one or more URLs
from which you can now retrieve results, given in the
<result_url>
. You can now poll these URLs to retrieve
XML documents with the search hits.
You can see the XML schema for this XML document here.
Since the search is performed by the same server as searches in the Pfam website, you can view your results in a web page by modifying the URL slightly:
/search/sequence/resultset/adabec68-703f-48c4-bec7-07f1ab965fbb
Note that old search results are generally cleared out after some time, so if you wait too long before trying to view your hits in the website, you may find that they are already gone.
Retrieve domain graphics description
When you run a sequence search via the browser, the results page includes a Pfam domain graphic, showing the locations of any matching Pfam families on your search sequence. When running a search via the RESTful interface, you can't retrieve the domain graphic directly, since it's generated using a javascript class in the browser. However, you can retrieve the JSON string that describes the graphic:
/search/sequence/graphic/adabec68-703f-48c4-bec7-07f1ab965fbb
Check the domain graphics documentation for details on how you can use the JSON string locally.
Contents:
The Pfam MySQL database contains all of the data accessible via the website. The database currently consists of 65 tables. Below is some basic documentation on the schema layout and how smaller numbers of tables can be put together to enable access to a subset of the data. At the time of writing, between releases 28.0 and 29.0, the fields within the tables and the results of queries are correct. The data within the tables will change with each release. Although we do not anticipate any major changes to the database, we reserve the right to make changes with or without warning; we will endeavour to update this document if such changes are made.
A red diamond in the images below indicates a foreign key. In some images there are tables which appear not to be linked to any other table in the image. This is due to a foreign key being populated late in production of the database. The 'floating' table can still be joined and example queries of how to do so are given under each image.
VERSION table

The version table contains information that relates to a particular Pfam release. It contains the version number of the Pfam database, the version numbers of the Swiss-Prot and TrEMBL databases that were used to build Pfam, and some statistics about the number of families and coverage. This table is stand-alone and does not link to any of the other tables.
Domain information

Two of the central tables in the Pfam database are pfamseq, which contains UniProtKB reference proteomes and pfamA, which contains information about the Pfam-A families. Most of the other tables in the database link to one or both of these tables, either directly or indirectly. Note that prior to Pfam 29.0, the pfamseq table contained the whole of UniProtKB. From Pfam 29.0, this table contains only the reference proteome portion of UniProtKB. The full alignments in Pfam are based on the sequences in the pfamseq table.
The table pfamA_reg_seed contains the Pfam regions that are present in a seed alignment. All sequences in pfamA_reg_seed are in the pfamseq table or the class="table">uniprot table (the uniprot table contains all the sequences in UniProtKB). The pfamA_reg_full_significant table contains all of the sequence regions from the pfamseq table that match the HMM and score above the curated threshold, i.e. are significant matches, for each family. There is also a table named pfamA_reg_full_insignificant which contains, as the name suggests, all the insignificant matches for each family. Insignificant matches are those which match the HMM with an E-value less than 1000, but score below the curated bit score threshold for each family.
In addition to providing matches to the sequences in the pfamseq table, we also provide the significant matches for the sequences in the uniprot table. These can be found in the table uniprot_reg_full.
The tables pfamA_reg_full_significant and uniprot_reg_full contain a column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. A significant match will only be excluded from the full alignment (in_full = 0) if it matches a family that belongs to a clan, and the match overlaps with another more significant (lower E-value) match to a family within the clan.
For each sequence match we store two sets of coordinates, the envelope coordinates and the alignment coordinates. The envelope co-ordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate where HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates. In the database, envelope start and end positions are stored in the seq_start and seq_end fields columns, and the alignment coordinates are stored in the ali_start and ali_end fields.
The Pfam database has historically been built on the UniProtKB database. However, as of release 22.0 we also provide Pfam domain data for the NCBI sequence database (GenPept) and a set of metagenomics sequences. As of release 28.0, we no longer store Pfam information at the sequence level for the NCBI and metagenomics data sets in the MySQL database, but we still provide the family alignments for them in the alignment_and_tree table. The Pfam website can still be queried using NCBI and metagenomics accessions.
To report all of the overlapping domains within any clans, leave out the 'in_full =1' clause. More information on clans can be found later in this document.
Pfamseq - other tables

This section contains a few tables that link to the pfamseq table, but don't fit nicely into any of the sections described above.
The pfam_annseq table contains binary Perl data structures which are used internally to generate the Pfam domain graphics. This table is not intended for use by Pfam users, as it is very dependent on Perl module versions.
The evidence table contains the UniProtKB evidence code key that is used in the evidence field in the pfamseq and uniprot tables.
UniProtKB sequences have secondary accessions if they have been merged or split. Secondary accession numbers are stored in the table called secondary_pfamseq_acc.
Other regions

These tables contain sequence specific information about the reference proteome sequences. The other_regions table contains coiled coil, low complexity, signal peptide, transmembrane and disordered regions data. The pfamseq_markup table contains active site information which is taken from the UniProtKB feature table. Additional active site residues are predicted by Pfam based on conserved residues in a Pfam alignment. The pfamseq_disulphide tables contains disulphide bond information from the UniProtKB feature table.
Architecture information for a family

In Pfam, an architecture is the combination of domains that are present on a protein. The architecture table can be used to find out which combination of domains are found on particular sets of proteins, or to find out which proteins share the same domains architecture.
Annotation information for a family

In addition to the Pfam annotation, we also store InterPro annotation and their associated GO terms for each family. Links to other databases, e.g. SCOP) are also stored where appropriate. The pfamA table contains the GA, TC and NC cut-offs for each family, and additional information surrounding the Pfam-A family, including the number of sequences in the seed and full alignment. The pfamA_interactions table contains, where data are available, pairs of interacting Pfam domains. The data in this table are taken from the iPfam resource, which describes physical interactions between Pfam domains that have a representative structure in the PDB.
Note: The other_params column contains 'fa;' where the Pfam family corresponds to a SCOP family, and 'sf;' where the Pfam family corresponds to a SCOP superfamily.
Clan data

A Pfam clan is a set of related Pfam-A families. The information we use to determine which families belong to the same clan includes related structure, related function, matching of the same sequence to HMMs from different families, and profile-profile comparisons. Note that not all Pfam-A families belong to a clan and that a Pfam-A family cannot belong to more than one clan.
Dead families and clans

Sometimes we find that two or more Pfam-A families can be merged into a single family, which leads to the deletion of Pfam-A families. Likewise we might merge two clans together, which results in the deletion of a clan. The dead_family and dead_clan tables contain information about Pfam-A families and clans that have been deleted. These tables may be of use if you need to track what happened to the members of a particular family/clan that is no longer in Pfam.
Nested domains

Some Pfam-A domains are disrupted by the insertion of another domain (or domains) within them. The domain that is inserted into another is known as a nested domain. The nested_locations table stores all the nested Pfam-A domains. It also stores the coordinates of the nested domain with respect to a sequence that is present in the seed alignment of the domain in which it nests.
Structural data

In order for the Protein DataBank (PDB) information to be useful to Pfam, we need to map between PDB residues and UniProtKB sequence residues, which is not a trivial task. We store the residue-by-residue mapping that is provided by the PDBe group in the pdb_residue_data table. Note that the pdb_pfamA_reg table is based on seqeunces in the uniprot table, and not the pfamseq table. This is to maximise the number of structures that can be mapped to each Pfam entry.
Proteomes

As of Pfam 29.0, all sequences in the pfamseq table belong to a reference proteome, and therefore a complete proteome. Prior to Pfam 29.0 this was not the case. The complete_proteomes table contains statistics about the number of families and coverage. The tables in this section allow you to retrieve domain information about a particular species, or to retrieve all of the species which contain a particular Pfam domain.
Note: The ncbi_code for the species 'Arabidopsis thaliana' is 3702. This information can be found in the ncbi_taxonomy table.
Related families

SCOOP and HHsearch are two pieces of software that we use to help to determine which Pfam-A families are related. The scores from these programs have been a very useful aid in deciding which Pfam-A families should belong to the same clan. As a rough guide, a SCOOP score greater than 50 or a HHsearch E-value score of less than 0.01 is an indication that two families are closely related.
Data Files - Alignments, trees and HMMs

The seed, full, UniProtKB, NCBI, representative proteome and metaseq alignments are all stored as gzipped files in the database, as is the HMM for each family. Note that the NCBI and metaseq alignments may contain overlapping matches to Pfam-A families that belong to the same clan, however, the UniprotKB alignments (seed, full, uniprot and representative proteome sets) will not. This is because we have performed a clan filtering step on the UniProtKB data such that where there are overlapping Pfam-A matches within a clan, only the lowest E-value scoring match is included in the full alignment.
Pfam FTP site
The Pfam FTP site is organised into the following structure:
| +- Tools/ | +- papers/ | +- current_release/ | +- database_files/ | +- releases/ | +- Pfam23.0/ | | | +- database_files/ | +- Pfam22.0/ | | | +- database_files/ | +- ... | +- Pfam1.0/
The most important directory is probably the current_release directory. This contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:
- Pfam-A.dead.gz
- Listing of families that have been deleted from the database
- Pfam-A.fasta.gz
- A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.
- Pfam-A.full.gz
- The full alignments of the curated families, searched against pfamseq/UniProtKB reference proteomes (prior to Pfam 29.0, this file contained matches against the whole of UniProtKB).
- Pfam-A.full.uniprot.gz
- The full alignments of the curated families, searched against UniProtKB.
- Pfam-A.full.metagenomcis.gz
- The full alignments of the curated families, searched against Metagenomic proteins.
- Pfam-A.full.ncbi.gz
- The full alignments of the curated families, searched against NCBI GenPept proteins.
- Pfam-A.hmm.dat.gz
- A data file that contains information about each Pfam-A family
- Pfam-A.hmm.gz
- The Pfam HMM library for Pfam-A families
- Pfam-A.seed.gz
- The seed alignments of the curated families
- Pfam-C.gz
- The contains the information about clans and the Pfam-A membership
- active_site.dat.gz
- Tar-ball of data required for the predictions of active sites by Pfam scan.
- database.tar
- A tar-ball of the database_files directory.
- database_files
- Directory contains two files per table from the MySQL database. The .sql.gz file contains the table structure, the .txt.gz files contains the content of the table as a tab delimited file with field enclosed by a single quote (').
- diff.gz
- Stores the change status of entries between this release and last.
- metapfam.gz
- ASCII representation of the domain structure of Metagenomic proteins according to Pfam
- metaseq.gz
- Metagenomic sequence database used in this release
- ncbi.gz
- NCBI GenPept sequence database used in this release.
- ncbipfam.gz
- ASCII representation of the domain structure of GenPept proteins according to Pfam
- pdbmap.gz
- Mapping between PDB structures and Pfam domains.
- pfamseq.gz
- A fasta version of Pfam's underlying sequence database
- relnotes.txt
- Release notes
- swisspfam.gz
- ASCII representation of the domain structure of UniProt proteins according to Pfam
- uniprot_sprot.dat.gz
- Data files from UniProt containing SwissProt annotations.
- uniprot_trembl.dat.gz
- Data files from UniProt containing TrEMBL annotations.
- userman.txt
- File containing information about the flatfile format
- Pfam-A.regions.tsv.gz
- A tab separated file containing UniProtKB sequences and Pfam-A family information
- Pfam-A.clans.tsv.gz
- A tab separated file containing Pfam-A family and clan information for all Pfam-A families
The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.
The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.
The Tools directory contains code for running
pfam_scan.pl
. The README file in this directory contains
detailed information on how to install and run the script. Note that we
have gone for a modular design for the script, enabling the functionally
on the script to be easily incorporated into other Perl scripts. The
ChangeLog file lists the versions and changes to the current version of
pfam_scan.pl
(and modules). There is also an archived
version of pfam_scan.pl
that works with HMMER2. This is no
longer supported. There is also Perl code for predicting active sites
found in the ActSitePred directory, the functionality of which has been
rolled into the latest version of pfam_scan.pl
The top level directory also contains the following two files:
- COPYRIGHT
- Copyright notice for Pfam
- GNULICENSE
- The full text of the GNU Library General Public License under which Pfam is licensed
It also contains a further directory, sitesearch, that contain a subset of information from Pfam in an XML file. This XML file, primarily for use by the Sanger Web Team, is indexed using lucene and used in the WTSI site search. This is updated at each release.
Privacy issues
This section outlines the ways in which the Pfam website handles information about users. This should not be read as a legal document, but as a description of how we handle information that could be considered sensitive. It should be read in conjunction with the privacy policy documents of the individual Pfam consortium member sites. If you have any concerns about the way that information is used in the website, please contact us at the address given at the bottom of the page and we will be more than happy to discuss your concerns.
Although we make every possible effort to keep this site and the data that it manipulates safe and secure, we make no claim to be able to protect sensitive or privileged information. If you are at all concerned about sensitive information being released, please do not use the site and consider installing the Pfam database and/or this website locally.
Google analytics
We use Google analytics (GA), to track the usage of this website. GA uses a single-pixel "web bug" image, which is served from every page, a javascript script that collects information about each request, and cookies that maintain information about your usage of the site between visits. You can read more about how GA works on the GA website, which includes a detailed description of how traffic is tracked and analysed.
We use the information generated by GA purely for audit and accounting purposes, and to help us assess the usefulness and popularity of different features of the site. It does not provide the ability to track individual users' usage of the site. However, GA does provides a high-level overview of the traffic that passes through the site, including such information as the approximate geographical location of users, how often and for how long they visited the site, etc.
We understand that this level of tracking may be worrying to some of our users. If you have any concerns about our use of Google analytics, please feel free to contact us.
Browsing
All web servers maintain fairly detailed logs of their activity. This includes keeping a record of every request that they serve, usually along with the IP address of the client that made the request. This is true of the web servers that host the Pfam websites.
Although our servers do collect information about your IP address during the normal process of serving the Pfam website, we do not use this information explicitly. The Pfam group uses server logs only to help with development and debugging of the site.
Searches
The sequence search feature of the site allows you to upload a protein or DNA sequence to be searched against our library of HMMs. The sequence that you upload is stored in a database and is retrieved by a set of scripts that actually perform the search. Although we do not have any information that could be used to link that sequence to you personally, you should be aware that the sequence itself is accessible to systems administrators and other users who maintain the Pfam site.
The batch search function allows you to submit larger searches, the results of which are emailed to you. Obviously, this requires you to provide identifiable information, namely an email address. However, beyond the routine backups of our databases, we do not store any information about email addresses and sequences in the longer term and we make no attempt to keep track of the searches that a particular user may be performing.
Information from other types of search, such as a keyword search, is held only in the web server logs but, as described above, no attempt is made to interpret these logs except as part of development or debugging of the site.
Cookies
We use the following cookies to maintain some information about you between your visits to the site. The information that is stored cannot be used to identify you personally and cannot be used to track your usage of the site.
Cookie name | Purpose | Criteria |
ts | Timestamp when annotation submission form was loaded in browser | Required |
hide_posts | Keep track of whether blog posts have been hidden in home page | Optional |
In addition to these Pfam-specific cookies, GA uses a series of cookies. You can read more about these in the GA documentation , or in EMBL-EBI's cookie policy.
If you are at all concerned about the use of cookies in the Pfam site, you are free to block all cookies from this site and you should not experience any problems. You may see some unintended behaviour, such as being notified of all new features every time you visit the index page, but the core functionality of the site should be unaffected.
Third-party javascript libraries
This site makes heavy use of javascript and relies on javascript libraries that are developed by various groups and companies. In order to improve the performance of the Pfam website, we no longer serve these files ourselves, but rely on files that are hosted on third-party web-servers. In particular, we use various files that are provided by the AJAX libraries APIs, hosted by google code, and components of the Yahoo! User Interface Library (YUI), hosted by Yahoo!.
As these services are provided by commercial sites, it's likely that their usage will be carefully monitored by the companies that provide them. Although the Pfam site does not pass any information about you to these third-party sites, the sites themselves may use cookies to track your usage of the files that they serve. If you are concerned about the privacy implications of this monitoring, you may want to block cookies from the third-party hosting sites.
The Pfam Consortium
Pfam is the product from an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.
European Bioinformatics Institute (EMBL-EBI), UK
- Rob Finn - Team leader
- Alex Mitchell - Curator, lead
- Amaia Sangrador-Vegas - Curator
- Sara El-Gebali - Curator
- Lorna Richardson- Curator
- Simon Potter - Developer, Lead
- Jaina Mistry - Developer
- Matloob Qureshi - Web developer, Lead
- Gustavo Salazar - Web developer
- Aurélien Luciani - Web developer
- Alex Bateman - EMBL-EBI Protein and Protein Families cluster head
Harvard University, USA
- Sean Eddy - Founding developer and author of HMMER software
Stockholm Bioinformatics Center, Sweden
- Erik Sonnhammer - Coordinator of Pfam-Sweden and founding developer
External contributors
Pfam includes families that have been built by external contributors:
NCBI, USA
- Lakshminarayan Iyer
- L. Aravind
- Zhang Dapeng
- Vivek Anantharaman
Sanford-Burnham Medical Research Institute, USA
- Adam Godizk
Previous contributors
- Gabriel Aldam
- Shimelis Assefa
- Matthew Bashton
- Ewan Birney
- Lorenzo Cerrutti
- Yuanyuan Chang
- Jody Clements
- Penny Coggill
- Lachlan Coin
- Robson De Souza
- Richard Durbin
- Penny Coggill
- Kyle Ellrott
- Matthew Fenech
- Kristoffer Forslund
- O. Luke Gavin
- Prasad Gunasekaran
- Sam Griffiths-Jones
- Kevin Howe
- Lukasz Jaroszewski
- Nicola Kerrison
- Marta Llagostera
- Mhairi Marshall
- Nina Mian
- William Mifsud
- Simon Moxon
- Joanne Pollington
- Marco Punta
- Stephen-John Sammut
- Benjamin Schuster-Böckler
- David Studholme
- John Tate
- Benjamin Vella-Briffa
- Corin Yeats
- Arthur Wuster
Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can email Pfam using the address found at the bottom of the page.
How to contact Pfam
Contact Pfam
You can contact us in various ways. At the end of every page, you can find a contact email address. Use this email address to contact the Pfam team with a specific query or problem. group.
We run a central helpdesk, which handles annotation comments, data enquiries and general problems with the Pfam database and websites. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.
Mailing list
The Pfam mailing list is a low traffic list that has important announcements, such as releases or major changes.
To join the mailing list send a mail to pfamlist-subscribe@ebi.ac.uk.
If you should want to unsubscribe from the list send a mail to pfamlist-unsubscribe@ebi.ac.uk.
Xfam blog
The Pfam group contributes to the Xfam blog. The blog is used to announce releases, new features and important changes to Pfam, as well as for posts discussing general issues surrounding the Pfam resource. You can see blog posts that are specific to Pfam here.
RSS feed
You can keep in touch with the latest goings on by subscribing to the RSS feed from the Xfam blog.