...
Please READ: Community Standards for Arabidopsis Genetics [download PDF]. Standards of gene nomenclature have been adopted by the Arabidopsis community and should be followed in publications and presentations.
TAIR provides a Gene Class Symbol Registry. You can register a gene symbol currently in use by your lab (e.g. STM) or reserve a gene class symbol (e.g. CYP) here.
Please contact TAIR to request a new locus identifier (i.e., AGI Locus code). Consistency of locus identifiers and avoidance of duplication can only be achieved if individuals do not assign locus identifiers on their own. Once you have the new identifier please contact TAIR to provide any information you have about the function or expression pattern of the new gene. If you are registered at TAIR, you can submit functional annotation information directly or register a new gene symbol. Registration for a TAIR account is free.
Table of Contents |
---|
...
- Format of chromosomal based nomenclature
- AT (Arabidopsis thaliana)
- 1,2,3,4,5 (chromosome number) or M for mitochondrial or C for chloroplast.
- G (gene), other letters possible for repeats etc.)
- 12300 (five-digit code, numbered from top/north to bottom/south of chromosome)
- Chromosome based locus identifiers are assigned to
- protein-coding genes
- RNA coding genes (sn, r, tRNAs)
- pseudogenes
- Chromosome based locus identifiers are not assigned to
- transposons
- Usage
- The first AGI locus identifier release made use of locus identifiers ending in zero, eg e.g. 10010, 10020, 10030 and so on so that intervening numbers could be used for newly discovered genes.
- Where there are gaps in the sequence, the first release skipped at least 200 codes for each 100 kb of gap.
- In the first release, some genes were present as fragments as they lie across the boundary of two BACS. Each fragment got its own locus identifier if there was no way to represent the whole gene. There gene fragments were merged into a single locus in later releases, and one of the AGI locus identifiers became obsolete.
...
- Adding new genes
- If there are free ATxGxxxx0 locus identifiers, those will be assigned first as in the rules above. If not, the last digit will be used, leaving space as appropriate, i.e. ...5 if the new gene is in the middle or ...8 if it is close to the neighbor with higher identifier. If there are no free identifiers between the neighboring genes at all, the nearest free identifier will be used. As often as possible, the sequential numbering of genes along the chromosome will not be disturbed, but users should be aware that adjacent loci are often not in sequential order. This may be due to reorientation of BACS, or if genes are added in an interval in which no sequential identifiers remain.
- Deleting genes
- Deleted genes are kept in the database so they can be retrieved by searching for the identifier, but are marked "obsolete" and do not appear in database displays. Identifiers from deleted genes are not used again.
- Editing genes
- Consensus in the AGI was that identifiers should be kept constant as long as there are no major changes in the gene model. As long as modifications in the gene model do not lead to a completely new protein (e.g. through use of a different reading frame), the identifier will be kept, even if exon boundaries change or individual exons are added/removed.
- Merging and splitting genes
- Splitting Genes: When it is determined that a locus identifier actually refers to more than one gene (e.g. two genes were mistakenly predicted to be one gene), one of the genes will retain the original gene name and the second will get a new gene name. Rules for deciding which gene retains the original identifier is based on which gene contains the majority of sequence from the original locus.
- Merging Genes: In the case where experimental evidence is found to indicate that two genes are actually a single locus (e.g. a full length cDNA is obtained) the two locus entries will be merged into one and the name that corresponds to the locus with the majority of sequence will be retained. The second locus identifier will be made obsolete (but kept associated to the locus identifier of the merged gene).
Tracking history
- Notes about splits and merges will be kept as well as the different versions of the locus sequence. Versions are identified by locus identifier, source, and date. For example AT2G18190 later becomes split into two entries AT2G18190 and AT2G18193 with a note that indicates that the second entry resulted from a split of AT2G18190. You can search TAIR for the annotation Locus Histories with the AGI identifiers' histories using the Advanced Gene Search > Get Locus History and download lists of locus names that are obsolete or obsoleteĀ or in use.
- What terms in history tracking refer to:
- delete means a gene model has been eliminated
- merge means a gene model has been merged with another gene but retained old name
- mergedelete means a gene model has been merged but its name has not been retained
- insert means a gene model has been inserted from scratch
- split means a gene model has been split but has retained its name
- splitinsert means a gene model has been split and has a new name
- new means a gene model has been generated
- obsoleted means a gene model has disappeared
- The terms new and obsoleted may describe MIPS data when it is unknown if an insert or delete was due to a splitinsert or mergedelete.
...
Other Notes:
Generally, the idea is to be as conservative as possible. The identifiers should identify a specific chromosome locus, not a particular protein, and even if this identifier is used in an old publication, it should still direct a user to the current annotation for that locus, so that he they will be able to see that the protein sequence has changed in the meantime. This is preferable to having a new identifier after modifications, where the user will first have to look up what is the current annotation for this locus. Keeping backwards-compatible versions of all entries cannot be achieved, and identifiers should not be a way of "versioning" genes.
...
The 'G' convention is useful as repeats (r) will soon be annotated, initially as markers. Pseudogenes will be numbered like functional genes.
Gene are numbered in order from the top to bottom of the chromosomes. In the case of chr 2 and 4 this boundary is known due to the presence of rDNA clusters. Gene AT4G00010 is the first gene south of the cluster. Gene order is defined in units of 10 ie. 00010, 00020, 00030, etc allowing 9000 genes per chromosome.
If new genes are found between two annotated genes, either by experiment or improved gene finding programs, these will be numbered as: 00010, 00012,3,4,-9. This give plenty of room for expansion.
Different versions of a gene product, eg e.g. a differentially spliced gene , are denoted as 00010.1,2,3 etc.
Where there are sequence gaps, often of uncertain size and content (eg CEN2 and CEN4), the sequence groups will leave a space the equivalent of 100 - 200 genes. Where the top arm telomeres have not yet been reached, a gap equivalent to about 50 genes should be left, ie numbering will start 05000, 05010, etc.
The numbering of repeats will follow an independent system, where repeat ids are not interpolated between gene identities.
Please don't worry that the BAC naming conventions will be lost or erased from the records. We realize these are presently the most commonly used names, therefore the databases will have a simple way of relating the two naming conventions. Note that a single "AT4G00650" gene can have two BAC names, due to overlaps, and this is one of the reasons for implementing the new nomenclature. You will be able to search for an individual gene with this new name.
We believe this system conforms to that used in other organisms, and will be very useful to the community.
...
Before selecting a gene name/symbol check for name/symbol on the Mutant Gene Symbol list or use Arabidopsis GeneHunter. The Gene Hunter program is a text based searching tool that scans TAIR, the Mutant Gene Name Registry, GenBank, PubMed, Swiss-Pro, PIR, MIPS, AGR, Mendel-CPGN and the journals, Plant Cell and Plant Physiology for the input string (e.g. gene name or symbol) and, where appropriate, the term Arabidopsis thaliana and double check on Google Scholar for names that may not have made it into TAIR yet. Do not use names or symbols for Arabidopsis genes that are already in use by other researchers.
...
Certain objects such as genes, clones, clone ends and some insertions in TAIRs TAIR's database can be accessed by searching with the associated Genbank accession number. Each accession number in GenBank is unique. See http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#AccessionB for information about GenBank accession numbers.