Motif Finder


Using TAIR's Motif Finder This program compares the frequencies of 6-mer 'words' in a set of sequences against the frequency of the occurance of the word in the upstream sequences of the most recent genome sequence release. The input sequences can be obtained by downloading 500 or 1000 base pairs upstream of a locus or you can input your own set of sequences. The word can be found on the forward or reverse strand.The program was written by Dr. Rob Ewing (formerly from the Arabidopsis Functional Genomics Consortium). A.Defining the query sequence set The Motif Finder takes either sequences or a list of locus identifiers as input. Sequences should be in FASTA format. If using a series of AGI locus identifiers(e.g.AT1g01010) they must be separated by tabs, commas, carriage returns or spaces. Select a dataset (500 or 1000 bp upstream) to use. If using the locus IDs, the software will automatically extract either 500 or 1000 base pairs, depending on your selection. To obtain other sets of sequences sucha s 3000 base pair upstream or intron sequences in FASTA format, use TAIR's gene search to download bulk sequences in FASTA format. B. Choosing a dataset For example, to compare the frequency of a 6-mer in 500 base pairs of upstream sequences choose the 500 bp upstream dataset. C.Understanding the results set The results of the query can be displayed as an HTML file in your browser window or as a tab delimited file which you can download and open in a spreadsheet program such as Excel. The results will show how many total upstream sequences were analyzed and include only oligos found more than three times. The results are displayed according to significance (p value). The first column is the 6mer sequence. The second column shows the number of times the 6-mer was found.The third column lists the total number of times the 6-mer was found in the genomic set. The third and fourth columns show the ratio of the 6-mer frequency to the number of sequences in either the query or genomic dataset.The fifth column shows the probabililty score for the 6-mer. Scores closer to zero are considered more relevant (less likely to be random chance). The last column lists the name of the query sequence(s) which contain the motif. How p-values are calculated The motif counts are first calculated for the background dataset, i.e. all theupstream sequences (both for 1000 and 500, separately). The counts are generated the same way for both the background sequences and the query sequences (the ones the user submits). This is done as follows: 1.All possible 1-7 bps oligomers are generated as a preprocessing step. They are then assigned to a hash table with their sequence as the key and a unique index as a value. 2. For each sequence in the set that is being used to generate from (whether the background or the query), a sliding window (or frame) is used starting at the first basepair and moving over one bp at a time up to 6 (or whatever the oligo length is) to get overlapping occurrences of oligomers of size 6. For each of these iterations, all 6-oligomers are found within that particular subsequence and their counts are stored. At the end of the count for each sequence, the counts are clipped at 1, to provide the normalized count (i.e. if the oligomer occurs one or more times in the same sequence, the count will still be only 1 for that oligomer in that sequence). 3. The background normalized counts and the query normalized counts, along with the size of each set (background and query), are then fed into the binomial probability distribution calculator which then computes the probability of the count for each particular oligomer in the query set using the probability for that oligomer in the background set. The probability in the background set is simply the number of times the oligomer occurs divided by the total number of sequences in the background set (normalized as above). 3. The actual binomial formula used is: choose(n,k) * p^k * (1-p)^(n-k) where choose() = the number of ways one can pick k objects from a set of n object where the order is not considered, according to the script and how combinations are computed in general
n = the total number of sequences in the query set
k = the normalized count of the current oligomer in the query set p = the probability of the current oligomer occurring in the background (total) set of upstream sequences (normalized). Note: P-values are intended for ranking and identifying the most likely candidate motifs; no p-value correction is made for multiple testing.