...
Please READ: Community Standards for Arabidopsis Genetics [download PDF]. Standards of gene nomenclature have been adopted by the Arabidopsis community and should be followed in publications and presentations.
TAIR provides a Gene Class Symbol Registry. You can register a gene symbol currently in use by your lab (e.g. STM) or reserve a gene class symbol (e.g. CYP) here.
Please contact TAIR to request a new locus identifier (i.e., AGI Locus code). Consistency of locus identifiers and avoidance of duplication can only be achieved if individuals do not assign locus identifiers on their own. Once you have the new identifier please contact TAIR to provide any information you have about the function or expression pattern of the new gene. If you are registered at TAIR, you can submit functional annotation information directly or register a new gene symbol. Registration for a TAIR account is free.
Table of Contents |
---|
...
Designation of unique locus identifiers is performed as part of the genome sequence annotation at TAIR. The following section describes the syntax of chromosome based locus nomenclature and how locus identifiers are assigned. In some cases locus identifiers have been made obsolete. If you have information about a sequenced locus that has not been given a locus identifier, please contact curator@arabidopsis.org.
Guidelines for use of unique gene
...
ids (modified from MIPS)
- Format of chromosomal based nomenclature
- AT (Arabidopsis thaliana)
- 1,2,3,4,5 (chromosome number) or M for mitochondrial or C for chloroplast.
- G (gene), other letters possible for repeats etc.)
- 12300 (five-digit code, numbered from top/north to bottom/south of chromosome)
- Chromosome based locus identifiers are assigned to
- protein-coding genes
- RNA coding genes (sn, r, tRNAs)
- pseudogenes
- Chromosome based locus identifiers are not assigned to
- transposons
- Usage
- The first AGI locus identifier release made use of locus identifiers ending in zero, eg e.g. 10010, 10020, 10030 and so on so that intervening numbers could be used for newly discovered genes.
- Where there are gaps in the sequence, the first release skipped at least 200 codes for each 100 kb of gap.
- In the first release, some genes were present as fragments as they lie across the boundary of two BACS. Each fragment got its own locus identifier if there was no way to represent the whole gene. There gene fragments were merged into a single locus in later releases, and one of the AGI locus identifiers became obsolete.
...
- Adding new genes
- If there are free ATxGxxxx0 locus identifiers, those will be assigned first as in the rules above. If not, the last digit will be used, leaving space as appropriate, i.e. ...5 if the new gene is in the middle or ...8 if it is close to the neighbor with higher identifier. If there are no free identifiers between the neighboring genes at all, the nearest free identifier will be used. As often as possible, the sequential numbering of genes along the chromosome will not be disturbed, but users should be aware that adjacent loci are often not in sequential order. This may be due to reorientation of BACS, or if genes are added in an interval in which no sequential identifiers remain.
- Deleting genes
- Deleted genes are kept in the database so they can be retrieved by searching for the identifier, but are marked "obsolete" and do not appear in database displays. Identifiers from deleted genes are not used again.
- Editing genes
- Consensus in the AGI was that identifiers should be kept constant as long as there are no major changes in the gene model. As long as modifications in the gene model do not lead to a completely new protein (e.g. through use of a different reading frame), the identifier will be kept, even if exon boundaries change or individual exons are added/removed.
- Merging and splitting genes
- Splitting Genes: When it is determined that a locus identifier actually refers to more than one gene (e.g. two genes were mistakenly predicted to be one gene), one of the genes will retain the original gene name and the second will get a new gene name. Rules for deciding which gene retains the original identifier is based on which gene contains the majority of sequence from the original locus.
- Merging Genes: In the case where experimental evidence is found to indicate that two genes are actually a single locus (e.g. a full length cDNA is obtained) the two locus entries will be merged into one and the name that corresponds to the locus with the majority of sequence will be retained. The second locus identifier will be made obsolete (but kept associated to the locus identifier of the merged gene).
Tracking history
- Notes about splits and merges will be kept as well as the different versions of the locus sequence. Versions are identified by locus identifier, source, and date. For example AT2G18190 later becomes split into two entries AT2G18190 and AT2G18193 with a note that indicates that the second entry resulted from a split of AT2G18190. You can search TAIR for the annotation Locus Histories with the AGI identifiers' histories using the Advanced Gene Search > Get Locus History and download lists of locus names that are obsolete or obsoleteĀ or in use.
- What terms in history tracking refer to:
- delete means a gene model has been eliminated
- merge means a gene model has been merged with another gene but retained old name
- mergedelete means a gene model has been merged but its name has not been retained
- insert means a gene model has been inserted from scratch
- split means a gene model has been split but has retained its name
- splitinsert means a gene model has been split and has a new name
- new means a gene model has been generated
- obsoleted means a gene model has disappeared
- The terms new and obsoleted may describe MIPS data when it is unknown if an insert or delete was due to a splitinsert or mergedelete.
...
Other Notes:
Generally, the idea is to be as conservative as possible. The identifiers should identify a specific chromosome locus, not a particular protein, and even if this identifier is used in an old publication, it should still direct a user to the current annotation for that locus, so that he they will be able to see that the protein sequence has changed in the meantime. This is preferable to having a new identifier after modifications, where the user will first have to look up what is the current annotation for this locus. Keeping backwards-compatible versions of all entries cannot be achieved, and identifiers should not be a way of "versioning" genes.
...
The 'G' convention is useful as repeats (r) will soon be annotated, initially as markers. Pseudogenes will be numbered like functional genes.
Gene are numbered in order from the top to bottom of the chromosomes. In the case of chr 2 and 4 this boundary is known due to the presence of rDNA clusters. Gene AT4G00010 is the first gene south of the cluster. Gene order is defined in units of 10 ie. 00010, 00020, 00030, etc allowing 9000 genes per chromosome.
If new genes are found between two annotated genes, either by experiment or improved gene finding programs, these will be numbered as: 00010, 00012,3,4,-9. This give plenty of room for expansion.
Different versions of a gene product, eg e.g. a differentially spliced gene , are denoted as 00010.1,2,3 etc.
Where there are sequence gaps, often of uncertain size and content (eg CEN2 and CEN4), the sequence groups will leave a space the equivalent of 100 - 200 genes. Where the top arm telomeres have not yet been reached, a gap equivalent to about 50 genes should be left, ie numbering will start 05000, 05010, etc.
The numbering of repeats will follow an independent system, where repeat ids are not interpolated between gene identities.
Please don't worry that the BAC naming conventions will be lost or erased from the records. We realize these are presently the most commonly used names, therefore the databases will have a simple way of relating the two naming conventions. Note that a single "AT4G00650" gene can have two BAC names, due to overlaps, and this is one of the reasons for implementing the new nomenclature. You will be able to search for an individual gene with this new name.
We believe this system conforms to that used in other organisms, and will be very useful to the community.
...
A major source of problems occurs when more than one published name is associated with the same gene or when the same gene symbol is assigned to more than one gene. An example of the former is EMB30 which is also known as GNOM,and of the latter is the symbol FDH which has been used for both FORMATE DEHYDROGENASE and FIDDLEHEAD.These problems have been addressed, in part, by the establishment of a gene name registry for genes identified by mutation (Meinke and Koornneef, 1997). As there are many cases where same gene has been published under many names, TAIR maintains a list of aliases associated with each gene (see below). See section:Choosing a unique gene symbol.
...
Naming genes based upon mutant phenotype
Please refer to Meinke and Koornneef, 1997 for a discussion and examples of naming genes based upon mutant phenotype. This manuscript provides instructions for developing mutant gene names/symbols, proper nomenclature for publication and community standards for genetic analysis of mutant phenotypes. Mutant gene names are generally based upon one or more aspects of the mutant phenotype (e.g.NON-PHOTOTROPIC HYPOCOTYL1) or a genetic interaction such as SUPPRESSOR OF PHYA-105. Gene symbols are three letters and may or may not derive from the full name (e.g. NON-PHOTOTROPIC HYPOCOTYL1; NPH1 or ENHANCER OF AGAMOUS; HUA). For publications and presentations, mutant gene names and symbols are lowercase and italicized and wild type alleles are uppercase and italicized. Protein products of genes are uppercase and not italicized. To help alleviate the problems associated with duplication of gene names, a mutant gene name registry has been created. Names and symbols for mutant genes should be registered with the curator of mutant gene names (Dr. David Meinke) along with map location and a description of the mutant phenotype (http://mutant.lse.okstate.edu/genepage/genepage.html).
...
Before selecting a gene name/symbol check for name/symbol on the Mutant Gene Symbol list or use Arabidopsis GeneHunter. The Gene Hunter program is a text based searching tool that scans TAIR, the Mutant Gene Name Registry, GenBank, PubMed, Swiss-Pro, PIR, MIPS, AGR, Mendel-CPGN and the journals, Plant Cell and Plant Physiology for the input string (e.g. gene name or symbol) and, where appropriate, the term Arabidopsis thaliana. and double check on Google Scholar for names that may not have made it into TAIR yet. Do not use names or symbols for Arabidopsis genes that are already in use by other researchers.
...
The following table lists prefixes used by functional genomics projects for naming T-DNA insertions. This information can be used to search for all of the . For example, to find all of the deletions identified by the Stanford Genome Sequencing Center you can search by Polymorphism name starts with SGC and type is deletion.
Prefix | Source | Comment |
---|---|---|
SGC | Stanford Genome Center | Includes insertions, deletions and single nucleotide polymorphisms |
CER | Cereon Genomics | Includes insertions, deletions and single nucleotide polymorphisms. Available only to registered users from non-profit and academic institutions. |
Names used for large sets of T-DNA or transposon insertions
...
The following table lists prefixes used by functional genomics projects for naming T-DNA insertions. This information can be used to search for all of the insertion lines generated by a project. For example, to find all of the T-DNA insertion lines generated by Joe Ecker's group at the SALK institute you can search by Polymorphism name starts with SALK.
Prefix | Source | Comment |
---|---|---|
SALK | Joe Ecker et.al. | Sequence indexed library of insertion mutations generated using the pROK2 T-DNA vector. |
SGT | V.Sundareson et.al. | Gene trap lines from the Institute for Molecular Agrobiology (IMA) |
SET | V.Sundareson et.al. | Enhancer trap lines from Institute for Molecular Agrobiology (IMA) |
Clone and Vector Names
Arabidopsis clones are usually named with the acronym of the vector followed by the plate and row numbers of the isolated clone. For example, CIC (YAC), T (TAMU BAC), F (IGF BAC) are some common vector acronyms.The following table gives information about nomenclature for clones and vectors in TAIR. You can use the prefix in a wild card search for all clones from a particular source or vector. For example, you can use the DNA search to find all TAMU clones by choosing clone name [starts with] T.
Vector type | Clone Prefix | Vector Name | Source | Description |
---|---|---|---|---|
BAC | T | pBeLoBAC11 | TAMU (Texas A&M University) | from bacterial artificial chromosome library used for genomic sequencing |
BAC | F | pBELoBACkan | IGF (Institut fur Genbiologische) | from bacterial artificial chromosome library used for genomic sequencing |
P1 | M | pAd10sacBII | Mitsui et.al. | from Bacteriophage P1 library used for genomic sequencing |
TAC | K | pTAC-YL7 | Kazuza | transformation-competent bacterial artificial chromosome vector |
YAC | CIC | pYAC4 | CEPH/INRA/CNRS | From yeast artificial chromosome libary |
YAC | EG | pYAC41 | Grill and Somerville | from EG1 yeast artificial chromosome library |
YAC | EW | pYAC3 | E. Ward et.al. | from yeast artificial chromosome library |
YAC | yUP | pYAC4 | Joe Ecker et.al. | yeast artificial chromosome library |
Cosmid | G | Howard Goodman et.al. | From cosmid library prepared by H.Goodman et.al. |
GenBank Accessions
Certain objects such as genes, clones, clone ends and some insertions in TAIRs TAIR's database can be accessed by searching with the associated Genbank accession number. Each accession number in GenBank is unique. See http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#AccessionB for information about GenBank accession numbers.