BLAST stands for Basic Local Alignment Search Tool and was developed by Altschul et al. (1990) and significantly improved by
Altschul et al. (1997). It is a very fast search algorithm that is used to separately search protein or DNA databases. BLAST is best used for sequence similarity searching, rather than for motif searching. For searches using a query sequence of fewer than twenty residues,
PatMatch
is the best choice. To search nonplant datasets, try
NCBI BLAST.
A fairly complete on-line guide to BLAST searching can be found at the NCBI BLAST Help Manual. For a theoretical overview of BLAST, see the NCBI BLAST Course.
BLAST Methods
The NCBI BLAST family of programs includes:
blastp:compares an amino acid query sequence against a protein sequence database.blastncompares a nucleotide query sequence against a nucleotide sequence database.blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database.tblastncompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.tblastxcompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
TAIR Datasets
The following datasets are used in NCBI-BLAST, Patmatch and the bulk sequence download tools.
Section 1: Araport11 datasets (updated June 2016) |
---|
Dataset | Type | Description | Source |
---|
Araport11 Transcripts (- introns, + UTRs) | DNA | All Arabidopsis transcripts including predicted sequences. This datasets contains the UTRs but not the introns.Note that not ALL transcript sequences will include UTRs. | Araport11 (June 2016) |
Araport11 CDS (- introns, - UTRs) | DNA | All Arabidopsis coding sequences including predicted sequences. Similar to the transcript files but lacking the 5' and 3' UTRs. | Araport11 (June 2016) |
Araport11 Genes (+ introns, + UTRs) | DNA | All Arabidopsis transcription unit (gene) sequences. | Araport11 (June 2016) |
Araport11 Proteins | Protein | All Arabidopsis Protein sequences. | Araport11 (June 2016) |
Araport11 Loci Upstream Sequences-500bp | DNA | 500bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | Araport11 (June 2016) |
Araport11 Loci Upstream Sequences-1000bp | DNA | 1000bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | Araport11 (June 2016) |
Araport11 Loci Upstream Sequences-3000bp | DNA | 3000bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | Araport11 (June 2016) |
Araport11 Loci Downstream Sequences-500bp | DNA | 500bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | Araport11 (June 2016) |
Araport11 Loci Downstream Sequences-1000bp | DNA | 1000bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | Araport11 (June 2016) |
Araport11 Loci Downstream Sequences-3000bp | DNA | 3000bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | Araport11 (June 2016) |
Araport11 Intergenic | DNA | Contains the intergenic sequence between all the genes in the Arabidopsis genome. All the sequences are taken from the Watson strand irrespective of the direction of the annotated genes. | Araport11 (June 2016) |
Araport11 Intron | DNA | Contains all the introns of every annotated gene in the Arabidopsis genome. | Araport11 (June 2016) |
Araport11 5' UTRs | DNA | Processed 5' UTRs for all Arabidopsis genes with full length cDNA or EST sequences. | Araport11 (June 2016) |
Araport11 3' UTRs | DNA | Processed 3' UTRs for all Arabidopsis genes with full length cDNA or EST sequences. | Araport11 (June 2016) |
Section 2: TAIR10 datasets |
---|
Dataset | Type | Description | Source |
---|
TAIR10 Transcripts (- introns, + UTRs) | DNA | All Arabidopsis transcripts including predicted sequences. This datasets contains the UTRs but not the introns.Note that not ALL transcript sequences will include UTRs. | TAIR10 (November 2010) |
TAIR10 CDS (- introns, - UTRs) | DNA | All Arabidopsis coding sequences including predicted sequences. Similar to the transcript files but lacking the 5' and 3' UTRs. | TAIR10 (November 2010) |
TAIR10 Genes (+ introns, + UTRs) | DNA | All Arabidopsis transcription unit (gene) sequences. | TAIR10 (November 2010) |
TAIR10 Proteins | Protein | All Arabidopsis Protein sequences. | TAIR10 (November 2010) |
TAIR10 Whole genome (BAC clones) | DNA | Arabidopsis genomic sequences obtained from TIGR, originally sequenced by the Arabidopsis Genome Initiative (AGI) genome sequencing project. The sequences are from BAC, cosmid, TAC, P1, and YAC clones. The ends of these genomic clones were extended by TIGR in some cases using sequence from an adjacent clone to improve overlaps for annotation, resulting in differences when compared with the original GenBank records. | TAIR10 (November 2010) |
TAIR10 Loci Upstream Sequences-500bp | DNA | 500bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | TAIR10 (November 2010) |
TAIR10 Loci Upstream Sequences-1000bp | DNA | 1000bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | TAIR10 (November 2010) |
TAIR10 Loci Upstream Sequences-3000bp | DNA | 3000bp of sequence preceding the 5' end of each transcription unit. Note: The sequences in this dataset are immediately upstream of the 5'UTR for those genes with annotated UTRs and upstream of the translational start for the remainder. | TAIR10 (November 2010) |
TAIR10 Loci Downstream Sequences-500bp | DNA | 500bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | TAIR10 (November 2010) |
TAIR10 Loci Downstream Sequences-1000bp | DNA | 1000bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | TAIR10 (November 2010) |
TAIR10 Loci Downstream Sequences-3000bp | DNA | 3000bp of sequence following the 3' end of each transcription unit was used if it had an annotated 3' UTR, otherwise the sequence after the stop codon was used. | TAIR10 (November 2010) |
TAIR10 Intergenic | DNA | Contains the intergenic sequence between all the genes in the Arabidopsis genome. All the sequences are taken from the Watson strand irrespective of the direction of the annotated genes. | TAIR10 (November 2010) |
TAIR10 Intron | DNA | Contains all the introns of every annotated gene in the Arabidopsis genome. | TAIR10 (November 2010) |
TAIR10 5' UTRs | DNA | Processed 5' UTRs for all Arabidopsis genes with full length cDNA or EST sequences. | TAIR10 (November 2010) |
TAIR10 3' UTRs | DNA | Processed 3' UTRs for all Arabidopsis genes with full length cDNA or EST sequences. | TAIR10 (November 2010) |
Section 3: A. thaliana GenBank & Uniprot (last updated in 2010) |
---|
Dataset | Type | Description | Source |
---|
A. thaliana Insertion Flanks | DNA | T-DNA insertion flanking sequences. | Salk Insititute (Ecker and colleagues) and Institute of Molecular Agrobiology (IMA, Sundaresan and colleagues),GABI-KAT, Syngenta, Genbank and TAIR user submissions |
A. thaliana UniProt | Protein | All Arabidopsis proteins | UniProt |
A. thaliana GB derived from mRNA | protein | Arabidopsis thaliana protein sequences translated from experimentally isolated mRNA. Excludes proteins predicted from genomic sequences and third party (tpa) annotations. Entrez query used is ‘Arabidopsis thaliana[orgn] NOT "refseq"[Filter] NOT "tpa"[Properties] AND "derived from mrna"[Properties]’ | GenBank |
A. thaliana GB refseq/tpa | protein | Arabidopsis thaliana protein sequences predicted from the whole genome sequence. This is nearly equivalent to the “TAIR8 Proteins” set above (which are the source of the GenBank refseq records) except it also includes a few additional third party (tpa) annotations and the fasta header includes a link to GenBank rather than TAIR. Entrez query used is ‘Arabidopsis thaliana[orgn] AND ( "refseq"[Filter] OR "tpa"[Properties])’ | GenBank |
A. thaliana GB all | protein | All Arabidopsis thaliana proteins from GenBank. Entrez query used is ‘Arabidopsis thaliana[orgn]’ | GenBank |
A. thaliana GB experimental cDNA/EST | DNA | Arabidopsis thaliana experimentally isolated cDNA and EST sequences. Excludes cDNAs predicted from genomic sequences. This set combines sequences from the Core Nucleotide and EST sections of GenBank. Entrez query used is ‘Arabidopsis thaliana[orgn] AND "mrna"[Filter] NOT "refseq"[Filter] NOT "tpa"[Properties]’ | GenBank |
A. thaliana GB refseq/tpa cDNA | DNA | Arabidopsis thaliana cDNA sequences predicted from the whole genome sequence. This is nearly equivalent to the “TAIR8 Transcripts” set above (which is the source of the GenBank refseq records) except it also includes a few additional third party (tpa) annotations and the fasta header includes a link to GenBank rather than TAIR. Entrez query used is ‘Arabidopsis thaliana[orgn] AND "mrna"[Filter] AND ("refseq"[Filter] OR "tpa"[Properties])’ | GenBank |
A. thaliana GB genomic | DNA | All Arabidopsis thaliana genomic sequences from Core Nucleotide and GSS GenBank sections. Includes full BAC sequences, BAC ends, and many other types of genomic sequences. Entrez query used is ‘Arabidopsis thaliana[orgn] AND "genomic dna rna"[Filter]’ | GenBank |
Section 4: Green plant GenBank minus A. thaliana (last updated in 2010) |
---|
Dataset | Type | Description | Source |
---|
Green plant GB derived from mRNA | protein | Viridiplantae (excluding Arabidopsis thaliana) protein sequences translated from experimentally isolated mRNA. Excludes proteins predicted from genomic sequences and third party (tpa) annotations. Entrez query used is ‘viridiplantae[orgn] NOT “Arabidopsis thaliana”[orgn] NOT "refseq"[Filter] NOT "tpa"[Properties] AND "derived from mrna"[Properties]’ | GenBank |
Green plant GB refseq/tpa | protein | Viridiplantae (excluding Arabidopsis thaliana) protein sequences from the GenBank Reference Sequence section. These are mainly predicted proteins from large genomic sequencing and annotation projects. Entrez query used is ‘viridiplantae[orgn] NOT “Arabidopsis thaliana”[orgn] AND ( "refseq"[Filter] OR "tpa"[Properties])’ | GenBank |
Green plant GB all | protein | All Viridiplantae (excluding Arabidopsis thaliana) proteins from GenBank. Entrez query used is ‘viridiplantae[orgn] NOT “Arabidopsis thaliana”[orgn]’ | GenBank |
Green plant GB experimental cDNA/EST | DNA | Viridiplantae (excluding Arabidopsis thaliana) experimentally isolated cDNA and EST sequences. Excludes cDNAs predicted from genomic sequences. This set combines sequences from the Core Nucleotide and EST sections of GenBank. Entrez query used is ‘viridiplantae[orgn] AND "mrna"[Filter] NOT “Arabidopsis thaliana”[orgn] NOT "refseq"[Filter] NOT "tpa"[Properties]’ | GenBank |
Green plant GB refseq/tpa cDNA | DNA | Viridiplantae cDNA sequences (excluding Arabidopsis thaliana) from the GenBank Reference Sequence section. These are mainly predicted cDNAs from large genomic sequencing and annotation projects. Entrez query used is ‘viridiplantae[orgn] AND "mrna"[Filter] NOT “Arabidopsis thaliana”[orgn] AND ("refseq"[Filter] OR "tpa"[Properties])’ | GenBank |
Green plant GB genomic | DNA | All Viridiplantae genomic sequences (excluding Arabidopsis thaliana) from Core Nucleotide and GSS GenBank sections. Includes full BAC sequences, BAC ends, and many other types of genomic sequences. Entrez query used is ‘viridiplantae[orgn] AND "genomic dna rna"[Filter] NOT “Arabidopsis thaliana”[orgn]’ | GenBank |
Entering query sequences
When pasting sequences into the text box, be aware that a single sequence is limited to 7000 characters in length; when you paste multiple sequences (up to five are allowed) you are limited to a total of 15,000 characters. These limitations may be changed in the future. If you have a longer sequence, or many sequences, use the file upload feature. This feature is not supported on some versions of Microsoft's Internet Explorer web browser. If you do not see a "Browse..." button near the file upload text box (that displays your computer's filesystem directory when clicked), we suggest using Netscape or another browser supporting file uploading.
Multiple query sequences
To submit multiple query sequences, paste up to 5 sequences into the input box or,upload a file containing the concatenated sequences in FASTA format. For this option the files cannot be in raw format because they will be interpereted as a single query sequence. For NCBI-BLAST it may be possible to upload more than five sequences depending on the length of the query sequence and size of the target database.
Using the Browse option to upload a local file
NOTE:If you are uploading a file, make sure the file is in text format. If your file is a WORD document, open the file in Word and save again as text only format.- Macintosh
- Click on Browse button
- Click on folders to open them, and on the file to upload it
- PC
- Click on the Browse button
- Change the file type from "HTML" to "all files"
- Click on folders to open them, and on the file to upload it
- UNIX
- Click on the Browse button
- Change *.html to * at the end of the string in the Filter box
- Click on a folder and then the Filter button to open the folder
- Click on a file and then the OK button to upload it
Word documents will not work unless saved as text first.For NCBI BLAST limits are imposed on the size of the input files based upon the type of query being performed and the size of the dataset being searched. For example, the limit for TBLASTX against a large data set such as GenBank AGI sequences is 1000 characters, whereas for a small dataset like TIGR CDS sequences the limit is 3000 characters. The following table lists the search type and limits for NCBI Blast.
Search Type | Large data set input character limit | Small data set input character limit |
BLASTN | 25000 | 25000 |
BLASTX | 25000 | 25000 |
BLASTP | 5000 | 5000 |
TBLASTN | 1000 | 3000 |
TBLASTX | 1000 | 3000 |
- Raw text format
An example sequence in raw format is:
GGAAAAATCGAAGGATAATCTGTTTCTTCCAGCACAAGTTAACTTGCAAGAGAGAGCT
CAAAGATGGAACCAACAGAAAAACCATCGACCAAACCATCTTCTCGGACTCTACCTAG
AGACACTCGTGGCTCTCTCGAAGTATTCAACCCGTCAACTCAGCTGACCCGACCCGAT
AACCCGGTGTTCCGTCCTGAACCACCAGCGTGGCAAAACTTGAGTGATCCACGTGGCA
CCAGTCCTCAACCCCGACCACAACAAGAACCAGCTCCATCCAACCCTGTTCGGTCTGA
TCAAGAAATCGCTGTCACGACCTCATGGATGGCTCTGAAAGACCCATCACCGGAGACA
ATCTCCAAG
- FASTA format
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:
>gi|1122533|gb|AAB05099.1| BELL1
MARDQFYGHNNHHHQEQQHQMINQIQGFDETNQNPTDHHHYNHQIFGSNSNMGMMIDFSKQQQIRMTSGS
DHHHHHHQTSGGTDQNQLLEDSSSAMRLCNVNNDFPSEVNDERPPQRPSQGLSLSLSSSNPTSISLQSFE
LRPQQQQQGYSGNKSTQHQNLQHTQMMMMMMNSHHQNNNNNNHQHHNHHQFQIGSSKYLSPAQELLSEFC
SLGVKESDEEVMMMKHKKKQKGKQQEEWDTSHHSNNDQHDQSATTSSKKHVPPLHSLEFMELQKRKAKLL
SMLEELKRRYGHYREQMRVAAAAFEAAVGLGGAEIYTALASRAMSRHFRCLKDGLVGQIQATSQALGERE
EDNRAVSIAARGETPRLRLLDQALRQQKSYRQMTLVDAHPWRPQRGLPERAVTTLRAWLFEHFLHPYPSD
VDKHILARQTGLSRSQVSNWFINARVRLWKPMIEEMYCEETRSEQMEITNPMMIDTKPDPDQLIRVEPES
LSSIVTNPTSKSGHNSTHGTMSLGSTFDFSLYGNQAVTYAGEGGPRGDVSLTLGLQRNDGNGGVSLALSP
VTAQGGQLFYGRDHIEEGPVQYSASMLDDDQVQNLPYRNLMGAQLLHDIV
- GCG format
An example sequence in GCG format is:
!!NA_SEQUENCE 1.0
nga361
nga361.seq Length: 204 February 22, 1999 12:09 Type: N Check: 234 ..
1 TTATATGATA TATATAGTTA TGTATGTTNC AAGAATNCGA TATGGNACGC
51 ATGATTGAAG AATAATGATT GAGGAATTTT NCTGTAACAA AAAAATTNGA
101 NATAAACAAN TNTGTGGCTA AGAACTTAAC AAGGNCACAT GTTGATATGT
151 GAANTAGGAA TCTCATNATA AGGANCACAC GGTTGACAGC AAACGGGCNT
201 NTAC
- RSF format
A Rich Sequence Format (RSF) file contains one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be richly annotated with descriptive sequence information such as creator/author of the sequence, sequence weight, creation date, one-line description of the sequence, offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project, and known sequence features. An example sequence in RSF format is:
!!RICH_SEQUENCE 1.0
..
{
name Hs70_Plafa
descrip PileUp of: @Hsp70.List
type PROTEIN
longname Gendocdisk:[Gcgdoc.Program_Manual]Hsp70.Msf{Hs70_Plafa}
checksum 1012
creation-date 10/15/96 8:40:33
strand 1
sequence
~~~~~~~~~~~~~~~MASAKGSKPNLPESNIAIGIDLGTTYSCVGVWRNENVDIIANDQG
NRTTPSYVAFT.DTERLIGDAAKNQVARNPENTVFDAKRLIGRKFTESSVQSDMKHWPFT
VKSGVDEKPMIEVTYQGEKKLFHPEEISSMVLQKMKENAEAFLGKSIKNAVITVPAYFND
SQRQATKDAGTIAGLNVMRIINEPTAAAIAYGLHKKG..KGEKNILIFDLGGGTFDVSLL
TIED...G.IFEVKATAGDTHLGGEDFDNRLVNFCVEDFKRKNRGKDLSKNSRALRRLRT
QCERAKRTLSSSTQATIEIDSLFEGID....YSVTVSRARFEELCIDYFRDTLIPVEKVL
KDAMMDKKSVHEVVLVGGSTRIPKIQTLIKEFFNGKEACRSINPDEAVAYGAAVQAAILS
G.DQSNAVQDLLLLDVCSLSLGLETAGGVMTKLIERNTTIPAKKSQIFTTYADNQPGVLI
QVYEGERALTKDNNLLGKFHLDGIPPAPRKVPQIEVTFDIDANGILNVTAVEKSTGKQNH
ITITNDKGRLSQDEIDRMVNDAEKYKAEDEENRKRIEARNSLENYCYGVKSSLEDQKIKE
KLQPAEIETCMKTITTILEWLEKN.QLAGKDEYEAKQKEAESVCAPIMSKIYQDAAGAAG
.GMPGGMP..GGMPGGMPSGMPGGMNFPGGMPGAGMPGNAPAGSGPTVEEVD~~~~~~
}
Filtering
Filtering masks off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs. It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.
Results Options
Output titleType in the title you would like to appear at the top of your BLAST output. E-mail address:Entering your email address is suggested even if requesting a browser reply, and may be mandatory depending on server configuration options. Depending on server loading, or the expected execution time of your request, the server may convert your request to an email reply and auto-select the E-mail URL option. This avoids the frustration of wondering when your job will finish, and allows you to request more jobs quicker. Return Results:
To your web browserYour job will run immediately and return results directly to your Web browser in HTML format. Accession numbers that appear in the query and target loci names and descriptions will be hyperlinked to allow easy access to additional sequence information. NOTE: Attempting to view large result files may cause your browser to "blank out". If you have trouble viewing results, particularly if you've asked for many scores and alignments, or submitted many queries in one job, your browser's "memory" and "disk" cache settings may need to be increased. See your browser's help and preference menus for details. To prevent loss of a large HTML result file, you might wish to request emailing a URL to it rather than a browser reply. This way you'll be able to experiment with your browser's cache settings and retrieve the output as many times as you wish without waiting for your request to be re-run. By E-mail messageThe results are sent within the body of a normal email message, to the email address you enter. Any comments you type will appear as the Subject of the email message. Generally, you should use this option only if your mail system can handle large messages, and you've asked for textual output. If requesting HTML format by email, either of the following choices may be better if your mail reader software is not HTML-aware. Result Formats:HTML hypertext(file type "htm")HTML format is used by web browsers. Accession numbers that appear in the query and target loci names and descriptions will be hyperlinked to allow easy access to additional sequence information. Embedded Java "Applets" may be used to render graphical information (e.g., the ClustalW dendrogram), which will not appear unless your browser is set to permit these applets to run. Normal text(file type "txt")The results are returned as conventional human-readable text.
Protein:
RANK, STATUS, SCORE, E-VALUE, PROGRAM, Gap Penalties (Existence), Gap Penalty (Extension), EMPTY, EMTPY, MATRIX, TEMPFILENAME, QUERY LENGTH, empty, QUERY NAME, DATASET, Target length, empty, DESCRIPTION, empty, empty, empty, empty, empty, empty, empty, empty, empty, Identities, Positives, Gaps, Percentage ratio of identical matches to the length of the alignment, Percentage ratio of identical matches to the length of the query, unknown, unknown, Percentage ratio of identical matches to the length of the target, unknown, unknown, Query Start, Query End, Target Start, Target End, empty, QUERY NT, COMPARISON, TARGET NT
Advanced parameters
Max ScoresRestricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions. See also Expectation. Max AlignmentsRestricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 50. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see Expectation), only the matches ascribed the greatest statistical significance are reported. ExpectationThe statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECTATION threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Fractional values are acceptable. Query and Database Genetic CodeGenetic code to be used in BLASTX translation of the query. Gapped AlignmentsWhether to allow gapped alignments; either ON or OFF. Gap Opening PenaltyDefault Setting (option name: gapopen)Cost to open a gap; a 0 in the field means to use the default. Supported values for BLASTP, BLASTX, TBLASTN, and TBLASTX are limited. Gap Extension PenaltyDefault Setting (option name: gapextend)Cost to extend a gap, a 0 in this field means to use the default. Supported values for BLASTP, BLASTX, TBLASTN, and TBLASTX are limited. Nucleic MismatchPenalty for a mismatch in the BLAST™ portion of run. Nucleic MatchReward for a match in the BLAST™ portion of run. MatrixThe amino acid substitution matrix to be used for protein comparisons. Both BLOSUM and PAM matrices are available at several different levels of evolutionary distance. Extension ThresholdDefault Setting (option name: threshold)The threshold above which BLAST™ will extend a hit found. The hit is based on finding a word of a certain size (see Word Size) Word SizeDefault Setting (option name: word_size)The size of the initial word that must be matched between the database and the query sequence.