[med-svn] r13007 - in trunk/packages/meme/trunk/debian: . meme_manpages

Andreas Tille tille at alioth.debian.org
Thu Feb 14 13:58:05 UTC 2013


Author: tille
Date: 2013-02-14 13:58:05 +0000 (Thu, 14 Feb 2013)
New Revision: 13007

Added:
   trunk/packages/meme/trunk/debian/meme_manpages/mast.1
Removed:
   trunk/packages/meme/trunk/debian/mast_manual.txt
   trunk/packages/meme/trunk/debian/meme_manual.txt
Log:
Remove unneeded text copies of manuals (meme.1 existed, mast.1 just written)


Deleted: trunk/packages/meme/trunk/debian/mast_manual.txt
===================================================================
--- trunk/packages/meme/trunk/debian/mast_manual.txt	2013-02-14 12:38:03 UTC (rev 13006)
+++ trunk/packages/meme/trunk/debian/mast_manual.txt	2013-02-14 13:58:05 UTC (rev 13007)
@@ -1,413 +0,0 @@
-USAGE:
-	mast <mfile> [optional arguments ...]
-
-	<mfile>		file containing motifs to use; may be a MEME output
-			file or a file with the format given below 
-	[<database>] 	or 
-	[-d <database>] database to search with motifs or
-	[-stdin]	read database from standard input; 
-			Default: reads database specified inside <mfile>
-	[-c <count>]	only use the first <count> motifs
-	[-a <alphabet>]	<mfile> is assumed to contain motifs in the
-			format output by bin/make_logodds
-			and <alphabet> is their alphabet; -d <database>
-			or -stdin must be specified when this option is used
-	[-stdout]	print output to standard output instead of file
-	[-text]		output in text (ASCII) format;
-			(default: hypertext (HTML) format)
-	
-	[-sep]		score reverse complement DNA strand as a separate 
-			sequence
-	[-norc]		do not score reverse complement DNA strand
-	[-dna]		translate DNA sequences to protein
-	[-comp]		adjust p-values and E-values for sequence composition
-	[-rank <rank>]	print results starting with <rank> best (default: 1)
-	[-smax <smax>]	print results for no more than <smax> sequences
-			(default: all)
-	[-ev <ev>]	print results for sequences with E-value < <ev>
-			(default: 10)
-	[-mt <mt>]	show motif matches with p-value < mt (default: 0.0001)
-	[-w]		show weak matches (mt<p-value<mt*10) in angle brackets
-	[-bfile <bfile>]	read background frequencies from <bfile>
-	[-seqp]		use SEQUENCE p-values for motif thresholds
-			(default: use POSITION p-values)
-	[-mf <mf>]	print <mf> as motif file name
-	[-df <df>]	print <df> as database name
-	[-minseqs <minseqs>]	lower bound on number of sequences in db
-	[-mev <mev>]+	use only motifs with E-values less than <mev>
-	[-m <m>]+	use only motif(s) number <m> (overrides -mev)
-	[-diag <diag>]	nominal order and spacing of motifs
-	[-best]		include only the best motif in diagrams
-	[-remcorr]	remove highly correlated motifs from query
-	[-brief]	brief output--do not print documentation
-	[-b]		print only sections I and II
-	[-nostatus]	do not print progress report
-	[-hit_list]	print hit_list instead of diagram; implies -text
-
-  
-  MAST: Motif Alignment and Search Tool
-  
-  MAST is a tool for searching biological sequence databases for sequences
-  that contain one or more of a group of known motifs. 
-  
-  A motif is a sequence pattern that occurs repeatedly in a group of related
-  protein or DNA sequences. Motifs are represented as position-dependent
-  scoring matrices that describe the score of each possible letter at each
-  position in the pattern. Individual motifs may not contain gaps. Patterns with
-  variable-length gaps must be split into two or more separate motifs before
-  being submitted as input to MAST. 
-  
-  MAST takes as input a file containing the descriptions of one or more motifs
-  and searches a sequence database that you select for sequences that match
-  the motifs. The motif file can be the output of the MEME motif discovery tool 
-  or any file in the appropriate format. 
-  
-  MAST outputs three things: 
-  
-    1. The names of the high-scoring sequences sorted by the strength of the
-       combined match of the sequence to all of the motifs in the group. 
-    2. Motif diagrams showing the order and spacing of the motifs within each
-       matching sequence. 
-    3. Detailed annotation of each matching sequence showing the sequence
-       and the locations and strengths of matches to the motifs. 
-  
-  MAST works by calculating match scores for each sequence in the database
-  compared with each of the motifs in the group of motifs you provide. For each
-  sequence, the match scores are converted into various types of p-values and
-  these are used to determine the overall match of the sequence to the group of
-  motifs and the probable order and spacing of occurrences of the motifs in the
-  sequence. 
-  
-  MAST outputs a file containing:
-  
-      * the version of MAST and the date it was built, 
-      * the reference to cite if you use MAST in your research, 
-      * a description of the database and motifs used in the search, 
-      * an explanation of the results,
-      * high-scoring sequences--sequences matching the group of motifs
-        above a stated level of statistical significance, 
-      * motif diagrams showing the order and spacing of occurrences of the
-        motifs in the high-scoring sequences and 
-      * annotated sequences showing the positions and p-values of all motif
-        occurrences in each of the high-scoring sequences. 
-  
-  Each section of the results file contains an explanation of how to interpret
-  them. 
-  
-    Match Scores
-  
-  The match score of a motif to a position in a sequence is the sum of the
-  score from each column of the position-dependent scoring matrix
-  corresponding to the letter at that position in the sequence. For example, if
-  the sequence is 
-  
-  TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
-     ========
-  
-  and the motif is represented by the position-dependent scoring matrix (where
-  each row of the matrix corresponds to a position in the motif) 
-  
-  =========|=================================
-  POSITION |   A        C        G        T
-  =========|=================================
-    1      | 1.447    0.188   -4.025   -4.095 
-    2      | 0.739    1.339   -3.945   -2.325 
-    3      | 1.764   -3.562   -4.197   -3.895 
-    4      | 1.574   -3.784   -1.594   -1.994 
-    5      | 1.602   -3.935   -4.054   -1.370 
-    6      | 0.797   -3.647   -0.814    0.215 
-    7      |-1.280    1.873   -0.607   -1.933 
-    8      |-3.076    1.035    1.414   -3.913 
-  =========|=================================
-  
-  then the match score of the fourth position in the sequence (underlined)
-  would be found by summing the score for T in position 1, G in position 2 and
-  so on until G in position 8. So the match score would be 
-  
-    score = -4.095 + -3.945 + -3.895 + -1.994
-            + -4.054 + -0.814 + -1.933 + 1.414 
-          = -19.316
-  
-  The match scores for other positions in the sequence are calculated in the
-  same way. Match scores are only calculated if the match completely fits within
-  the sequence. Match scores are not calculated if the motif would overhang
-  either end of the sequence. 
-  
-    P-values
-  
-  MAST reports all matches of a sequence to a motif or group of motifs in terms
-  of the p-value of the match. MAST considers the p-values of four types of
-  events: 
-  
-      position p-value: the match of a single position within a sequence to
-      	a given motif, 
-      sequence p-value: the best match of any position within a sequence
-      	to a given motif, 
-      combined p-value: the combined best matches of a sequence to a
-      	group of motifs, and 
-      E-value: observing a combined p-value at least as small in a random
-      	database of the same size. 
-  
-  All p-values are based on a random sequence model that assumes each
-  position in a random sequence is generated according to the average letter
-  frequencies of all sequences in the the appropriate (peptide or nucleotide)
-  non-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) on September 22,
-  1996.  This can be overridden in two ways:
-  
-  	1) -bfile <bfile>
-  	The random model uses the letter frequencies given in <bfile> 
-  	instead of the non-redundant database frequencies.
-  	The format of <bfile> is the same as that for the MEME -bfile opton; 
-  	see the MEME documentation for details.  Sample files are given in 
-  	directory tests: tests/nt.freq and tests/na.freq.) 
-  	
-  	2) -comp
-  	The random model uses the letter frequencies in the current target
-  	sequence instead of the non-redundant database frequencies.  This
-  	causes p-values and E-values to be compensated individually for the 
-  	actual composition of each sequence in the database.  This option
-  	can increase search time substantially due to the need to compute
-  	a different score distribution for each high-scoring sequence.
-  
-  
-      Position p-value
-  
-      The p-value of a match of a given position within a sequence to a
-      motif is defined as the probability of a randomly selected position in a
-      randomly generated sequence having a match score at least as large
-      as that of the given position. 
-  
-      Sequence p-value
-  
-      The p-value of a match of a sequence to a motif is defined as the
-      probability of a randomly generated sequence of the same length
-      having a match score at least as large as the largest match score of
-      any position in the sequence. 
-  
-      Combined p-value
-  
-      The p-value of a match of a sequence to a group of motifs is defined
-      as the probability of a randomly generated sequence of the same
-      length having sequence p-values whose product is at least as small
-      as the product of the sequence p-values of the matches of the motifs
-      to the given sequence. 
-  
-      E-value
-  
-      The E-value of the match of a sequence in a database to a a group
-      of motifs is defined as the expected number of sequences in a random
-      database of the same size that would match the motifs as well as the
-      sequence does and is equal to the combined p-value of the sequence
-      times the number of sequences in the database. 
-  
-    High-scoring Sequences
-  
-  MAST lists the names and part of the descriptive text of all sequences
-  whose E-value is less than E. Sequences shorter than one or more of the
-  motifs are skipped. The sequences are sorted by increasing E-value. The
-  value of E is set to 10 for the WEB server but is user-selectable in the
-  down-loadable version of MAST. 
-  
-    Motif Diagrams
-  
-  Motif diagrams show the order and spacing of non-overlapping matches to
-  the motifs in each high-scoring sequence. Motif occurrences are determined
-  based on the position p-value of matches to the motif. Strong matches
-  (p-value < M) are shown in square brackets (`[ ]'), weak matches (M <
-  p-value < M × 10) are shown in angle brackets (`< >') and the length of
-  non-motif sequence ("spacer") is shown between dashes (`-'). For example, 
-  
-          27-[3]-44-<4>-99-[1]-7
-  
-  shows an initial spacer of length 27, followed by a strong match to motif 3, a
-  spacer of length 44, a weak match to motif 4, a spacer of length 99, a strong
-  match to motif 1 and a final non-motif sequence of length 7. The value of M is
-  0.0001 for the WEB server but is user-selectable in the down-loadable
-  version of MAST. 
-  
-  Note: If you specify the -hit_list switch to MAST, the motif "diagram" takes the form
-  of a comma separated list of motif occurrences ("hits").  Each "hit" has the format:
-  	<strand><motif> <start> <end> <p-value>
-  where 
-          <strand>        is the strand (+ or - for DNA, blank for protein),
-          <motif>         is the motif number,
-          <start>         is the starting position of the hit,
-          <end>           is the ending position of the hit, and
-          <p-value>       is the position p-value of the hit.
-  
-    Annotated Sequences
-  
-  MAST annotates each high-scoring sequence by printing the sequence
-  along with the position and strength of all the non-overlapping motif
-  occurrences. The four lines above each motif occurrence contain,
-  respectively, 
-  
-      the motif number of the occurrence, 
-      the position p-value of the occurence, 
-      the best possible match to the motif, and 
-      a plus sign (`+') above each letter in the occurrence that has a positive
-      match score to the motif. 
-  
-  The best possible match to a motif is the sequence of letters which would
-  acheive the highest match score. 
-  
-  
-  MOTIF FORMAT 
-  
-  MAST can search using (multiple) motifs contained in 
-  
-      a MEME output file, 
-      a GCG profile file, 
-      two or more GCG profile filess concatenated together, or 
-      a file with the following format. 
-  
-                    Motif file format
-  
-       ALPHABET= alphabet
-       log-odds matrix: alength= alength w= w
-       row_1
-       row_2
-       ...
-       row_w  
-  
-  
-  
-      A motif is represented by a position-dependent scoring matrix. 
-      A scoring matrix is preceded by a line starting with the words
-      log-odds matrix: and specifying alength, the length of
-      the alphabet (number of columns in the scoring matrix), and the w, the
-      width of the motif (number of rows in the scoring matrix). 
-      The following w lines (no blank lines allowed) contain the rows of the
-      scoring matrix. Row i, column j of the matrix gives the score for the j-th
-      letter in alphabet appearing at position i in an occurrence of the
-      motif. 
-      The spaces after the equals signs and the colon are required. 
-      The number of letters in alphabet must equal alength. 
-      Any number of additional motifs may follow the first one. 
-      The motif file must contain a line starting with 
-  
-              ALPHABET= 
-  
-      followed by alphabet, a list containing the letters used in the motifs. 
-      The order of the letters in alphabet must be the same as the order of the
-      columns of scores in the motifs. The order need not be alphabetical
-      and case does not matter, but there should be no spaces in alphabet.
-      The letters in alphabet must be a subset of either the IUB/IUPAC DNA
-      (ABCDGHKMNRSTUVWY) or protein
-      (ABCDEFGHIKLMNPQRSTUVWXYZ) alphabets. DNA alphabets
-      must contain at least the letters ACGT. Protein alphabets must contain
-      at least the letters ACDEFGHIKLMNPQRSTVWY. All other letters in
-      the alphabets are optional. If any of the optional letters are missing 
-      from alphabet, MAST automatically generates scores for them by taking the
-      weighted average of the scores for the letters which the missing letter
-      could match. (The weights are the frequencies of the replaced letters in
-      the appropriate non-redundant database.) Replacements for the
-      optional letters are given in the following table. 
-  
-             LETTERS MATCHED BY OPTIONAL LETTERS
-      =================================================
-      optional          matches 
-      letter      DNA             protein 
-      =================================================
-       B          CGT             DN 
-       D          AGT
-       H          ACT
-       K          GT
-       M          AC
-       N          ACGT
-       R          AG
-       S          CG
-       U          T               ACDEFGHIKLMNPQRSTVWY 
-       V          CAG
-       W          AT
-       X                          ACDEFGHIKLMNPQRSTVWY 
-       Y          CT
-       Z                          EQ 
-       *          ACGT            ACDEFGHIKLMNPQRSTVWY
-       -          ACGT            ACDEFGHIKLMNPQRSTVWY
-      =================================================
-  
-  
-  EXAMPLE 
-  
-  Here is an example of a DNA motif file that contains two motifs. 
-  
-                    Sample motif file 
-  
-          ALPHABET= ACGT
-          log-odds matrix: alength= 4 w= 9
-           -4.275  -0.182  -4.195   1.408
-           -4.296  -1.487   1.880  -0.816
-           -2.160  -1.492  -4.171   1.474
-           -0.810  -4.076   1.872  -2.164
-            1.537  -1.487  -4.195  -4.205
-            0.113   0.340  -0.237  -0.209
-           -0.454   0.923   0.390  -0.834
-           -1.336  -0.082   0.905   0.100
-            0.674  -4.183   0.130  -0.201
-          log-odds matrix: alength= 4 w= 6
-           -2.032   0.324   1.371  -0.781
-           -0.409   0.560  -0.250   0.119
-           -4.274  -0.519  -0.260   1.167
-           -2.188   2.300  -4.191  -2.465
-            1.265  -4.111  -0.267  -2.180
-           -1.977   2.158  -1.661  -2.071 
-  
-  
-  
-  In the example above, because the order of the letters in alphabet is
-  ACGT, the first column of each motif gives the scores for the letter A at each
-  position in the motif, the second column gives the scores for C and so forth.
-  
-  Note: If -d <database> is not given, MAST looks for database
-  	specified inside of <mfile>
-  
-  Creates file (unless [-stdout] given) after stripping ".html" from the end of
-  <mfile>:
-  	mast.<mfile>[.<database>][.c<count>][.m<motif>]+[.rank<rank>][.ev<ev>][.mt<mt>][.b]
-  
-  EXAMPLES:
-  
-  The following examples assume that file "meme.results" is the
-  output of a MEME run containing at least 3 motifs and file
-  SwissProt is a copy of the Swiss-Prot database on your local disk.
-  DNA_DB is a copy of a DNA database on your local disk.
-   
-  1) Annotate the training set:
-   
-  	mast meme.results
-   
-  2) Find sequences matching the motif and annotate them in
-  the SwissProt database:
-   
-  	mast meme.results -d SwissProt
-   
-  3) Show sequences with weaker combined matches to motifs.
-   
-  	mast meme.results -d SwissProt -ev 200
-   
-  4) Indicate weaker matches to single motifs in the annotation so
-  that sequences with weak matches to the motifs (but perhaps with
-  the "correct" order and spacing) can be seen:
-  
-  	mast meme.results -d SwissProt -w
-   
-  5) Include a nominal order and spacing of the first three motifs
-  in the calculation of the sequence p-values to increase the
-  sensitivity of the search for matching sequences:
-   
-  	mast meme.results -d SwissProt -diag "9-[2]-61-[1]-62-[3]-91"
-   
-  6) Use only the first and third motifs in the search:
-   
-  	mast meme.results -d SwissProt -m 1 -m 3
-   
-  7) Use only the first two motifs in the search:
-   
-  	mast meme.results -d SwissProt -c 2
-  
-  8) Search DNA sequences using protein motifs, adjusting p-values and E-values 
-  for each sequence by that sequence's composition:
-  
-  	mast meme.results -d DNA_DB -dna -comp
-  

Added: trunk/packages/meme/trunk/debian/meme_manpages/mast.1
===================================================================
--- trunk/packages/meme/trunk/debian/meme_manpages/mast.1	                        (rev 0)
+++ trunk/packages/meme/trunk/debian/meme_manpages/mast.1	2013-02-14 13:58:05 UTC (rev 13007)
@@ -0,0 +1,475 @@
+.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.40.10.
+.TH MAST: "1" "February 2013" "Motif Alignment and Search Tool" "User Commands"
+.SH NAME
+MAST \- Motif Alignment and Search Tool
+.SH SYNOPSIS
+.B mast <motif file> <sequence file>
+[\fIoptions\fR]
+.SH DESCRIPTION
+MAST: Motif Alignment and Search Tool
+.SS
+Inputs
+.TP
+\fB<motif file>\fR
+file containing motifs to use; normally a MEME output file
+.TP
+\fB<sequence file>\fR
+search sequences in FASTA\-formatted database with motifs
+.TP
+\fB\-bfile <file>\fR
+read background frequencies from <file>
+.TP
+\fB\-dblist\fR
+read the <sequence file> as a list of FASTA\-formatted databases
+.SS
+Outputs
+.TP
+\fB\-o <dir>\fR
+directory to output mast results; directory must not exist
+.TP
+\fB\-oc <dir>\fR
+directory to output mast results with overwriting allowed
+.TP
+\fB\-hit_list\fR
+print a machine\-readable list of all hits only; outputs to standard out and overrides \fB\-seqp\fR
+.SS
+Which Motifs To Use
+.TP
+\fB\-remcorr\fR
+remove highly correlated motifs from query
+.TP
+\fB\-m <m>+\fR
+use only motif number \fB<m>\fR (overrides \fB\-mev\fR); this can be
+repeated to select multiple motifs
+.TP
+\fB\-c <count>\fR
+only use the first \fB<count>\fR motifs or all motifs when \fB<count>\fR is zero (default: 0)
+.TP
+\fB\-mev <mev>\fR
+use only motifs with E\-values less than \fB<mev>\fR
+.TP
+\fB\-diag <diag>\fR
+nominal order and spacing of motifs is specified by \fB<diag>\fR which is a block diagram
+.SS
+DNA\-Only Options
+.TP
+\fB\-norc\fR
+do not score reverse complement DNA strand
+.TP
+\fB\-sep\fR
+score reverse complement DNA strand as a separate sequence
+.TP
+\fB\-dna\fR
+translate DNA sequences to protein; motifs must be protein; sequences must be DNA
+.TP
+\fB\-comp\fR
+adjust p\-values and E\-values for sequence composition
+.SS
+Which Results To Print
+.TP
+\fB\-ev <ev>\fR
+print results for sequences with E\-value < \fB<ev>\fR (default: 10)
+.SS
+Appearance Of Block Diagrams
+.TP
+\fB\-mt <mt>\fR
+show motif matches with p\-value < \fB<mt>\fR (default: 0.0001)
+.TP
+\fB\-w\fR show weak matches (\fB<mt>\fR < p\-value < \fB<mt>\fR*10) in angle brackets in
+the hit list or when the xml is converted to text
+.TP
+\fB\-best\fR
+include only the best motif hits in \fB\-hit_list\fR diagrams
+.TP
+\fB\-seqp\fR
+use SEQUENCE p\-values for motif thresholds (default: use POSITION p\-values)
+.SS
+Miscellaneous
+.TP
+\fB\-mf <mf>\fR
+in results use \fB<mf>\fR as motif file name
+.TP
+\fB\-df <df>\fR
+in results use \fB<df>\fR as database name (ignored when \fB\-dblist\fR)
+.TP
+\fB\-dl <dl>\fR
+in results use \fB<dl>\fR as link to search sequence names; token
+SEQUENCEID is replaced with the FASTA sequence ID; ignored when \fB\-dblist\fR;
+.TP
+\fB\-minseqs <ms>\fR
+lower bound on number of sequences in db
+.TP
+\fB\-nostatus\fR
+do not print progress report
+.TP
+\fB\-notext\fR
+do not create text output
+.TP
+\fB\-nohtml\fR
+do not create html output
+.SS
+Description
+.P
+MAST is a tool for searching biological sequence databases for
+sequences that contain one or more of a group of known motifs.
+.PP
+A motif is a sequence pattern that occurs repeatedly in a group of
+related protein or DNA sequences. Motifs are represented as
+position\-dependent scoring matrices that describe the score of each
+possible letter at each position in the pattern. Individual motifs may
+not contain gaps. Patterns with variable\-length gaps must be split into
+two or more separate motifs before being submitted as input to MAST.
+.PP
+MAST takes as input a file containing the descriptions of one or more
+motifs and searches a sequence database that you select for sequences
+that match the motifs. The motif file can be the output of the MEME
+motif discovery tool or any file in the appropriate format.
+.PP
+MAST outputs an xml file which can then be converted into html or text
+format. The xml file is designed for machine processing and the html
+file is designed for human viewing. The text format is available for
+backwards compatibility though due to design decisions made to optimise
+the xml for html generation the output for separate scoring mode is not
+identical and some options were removed. The text format will be
+unsupported in future releases and so we recommend you migrate any
+programs reading mast output to the xml format.
+.SS
+MAST outputs three things:
+.IP
+1. The names of the high\-scoring sequences sorted by the strength of
+the combined match of the sequence to all of the motifs in the
+group.
+.IP
+2. Motif diagrams showing the order and spacing of the motifs within
+each matching sequence.
+.IP
+3. Detailed annotation of each matching sequence showing the sequence
+and the locations and strengths of matches to the motifs.
+.PP
+MAST works by calculating match scores for each sequence in the
+database compared with each of the motifs in the group of motifs you
+provide. For each sequence, the match scores are converted into various
+types of p\-values and these are used to determine the overall match of
+the sequence to the group of motifs and the probable order and spacing
+of occurrences of the motifs in the sequence.
+.PP
+MAST generates a human readable file from the xml output containing:
+.IP
+* the version of MAST and the date it was built,
+.IP
+* the reference to cite if you use MAST in your research,
+.IP
+* a description of the databases and motifs used in the search,
+.IP
+* an explanation of the result,
+.IP
+* the sequences identifier and score sorted by score matching the
+group of motifs above a stated level of statistical significance,
+.IP
+* motif diagrams showing the order and spacing of occurrences of the
+motifs in the significant sequences and,
+.IP
+* annotated sequences showing the positions and p\-values of all motif
+occurrences in each of the high\-scoring sequences.
+.PP
+The html version is the recommended version for human reading and has
+all sections documented however the text version has no documentation
+for the first section. That section lists each motif along with the
+sequence that would achieve the best possible match score. In order to
+avoid biased scores when multiple motif scores are combined, MAST also
+computes the pairwise correlations between each pair of motifs. The
+correlation between two motifs is the maximum sum of Pearson's
+correlation coefficients for aligned columns divided by the width of
+the shorter motif. The maximum is found by trying all alignments of the
+two motifs. Motifs with correlations below 0.60 have little effect on
+the accuracy of the combined scores. Pairs of motifs with higher
+correlations should be removed from the query.
+.SS
+Match Scores
+.PP
+The match score of a motif to a position in a sequence is the sum of
+the score from each column of the position\-dependent scoring matrix
+corresponding to the letter at that position in the sequence. For
+example, if the sequence is
+.IP
+TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
+.IP
+========
+.PP
+and the motif is represented by the position\-dependent scoring matrix
+(where each row of the matrix corresponds to a position in the motif)
+.TP
+Position
+A      C      G      T
+.TP
+1
+1.447  0.188  \fB\-4\fR.025 \fB\-4\fR.095
+.TP
+2
+0.739  1.339  \fB\-3\fR.945 \fB\-2\fR.325
+.TP
+3
+1.764  \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
+.TP
+4
+1.574  \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
+.TP
+5
+1.602  \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
+.TP
+6
+0.797  \fB\-3\fR.647 \fB\-0\fR.814 0.215
+.TP
+7
+\fB\-1\fR.280 1.873  \fB\-0\fR.607 \fB\-1\fR.993
+.TP
+8
+\fB\-3\fR.076 1.035  1.414  \fB\-3\fR.913
+.PP
+then the match score of the fourth position in the sequence
+(underlined) would be found by summing the score for T in position 1, G
+in position 2 and so on until G in position 8. So the match score would
+be
+.IP
+score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
+.IP
++ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
+.IP
+= \fB\-19\fR.316
+.PP
+The match scores for other positions in the sequence are calculated in
+the same way. Match scores are only calculated if the match completely
+fits within the sequence. Match scores are not calculated if the motif
+would overhang either end of the sequence.
+.SS
+P\-values
+.PP
+MAST reports all matches of a sequence to a motif or group of motifs in
+terms of the p\-value of the match. MAST considers the p\-values of four
+types of events:
+.IP
+* position p\-value: the match of a single position within a sequence
+to a given motif,
+.IP
+* sequence p\-value: the best match of any position within a sequence
+to a given motif,
+.IP
+* combined p\-value: the combined best matches of a sequence to a
+group of motifs, and
+.IP
+* E\-value: observing a combined p\-value at least as small in a random
+database of the same size.
+.PP
+All p\-values are based on a random sequence model that assumes each
+position in a random sequence is generated according to the average
+letter frequencies of all sequences in the appropriate (peptide or
+nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
+on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
+or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
+the positive and reverse complement strand frequencies are averaged
+together.
+.IP
+1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
+in <bfile> instead of the non\-redundant database frequencies. The
+format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
+see the MEME documentation for details. You can create files in the
+appropriate format based on the base/residue composition of your
+own FASTA sequence files using the command "fasta\-get\-markov"
+included in the MEME distribution. Type fasta\-get\-markov on the
+command line for documentation. (Sample files are also given in
+directory tests: tests/nt.freq and tests/na.freq.)
+.IP
+2. \fB\-comp\fR The random model uses the letter frequencies in the current
+target sequence instead of the non\-redundant database frequencies.
+This causes p\-values and E\-values to be compensated individually
+for the actual composition of each sequence in the database. This
+option can increase search time substantially due to the need to
+compute a different score distribution for each high\-scoring
+sequence. With this option and DNA sequences, the positive and
+reverse complement strand frequencies are not averaged together.
+.SS
+Position p\-value
+.PP
+The p\-value of a match of a given position within a sequence to a motif
+is defined as the probability of a randomly selected position in a
+randomly generated sequence having a match score at least as large as
+that of the given position. Note:If MAST is combining reverse
+complement DNA strands, the position p\-value is not corrected for
+multiple tests.
+.SS
+Sequence p\-value
+.PP
+The p\-value of a match of a sequence to a motif is defined as the
+probability of a randomly generated sequence of the same length having
+a match score at least as large as the largest match score of any
+position in the sequence.
+.SS
+Combined p\-value
+.PP
+The p\-value of a match of a sequence to a group of motifs is defined as
+the probability of a randomly generated sequence of the same length
+having sequence p\-values whose product is at least as small as the
+product of the sequence p\-values of the matches of the motifs to the
+given sequence.
+.SS
+E\-value
+.PP
+The E\-value of the match of a sequence in a database to a a group of
+motifs is defined as the expected number of sequences in a random
+database of the same size that would match the motifs as well as the
+sequence does and is equal to the combined p\-value of the sequence
+times the number of sequences in the database.
+.SS
+High\-scoring Sequences
+.PP
+MAST lists the names and part of the descriptive text of all sequences
+whose E\-value is less than E. Sequences shorter than one or more of the
+motifs are skipped. The sequences are sorted by increasing E\-value. The
+value of E is set to 10 for the WEB server but is user\-selectable in
+the down\-loadable version of MAST.
+.SS
+Motif Diagrams
+.PP
+Motif diagrams show the order and spacing of non\-overlapping matches to
+the motifs in each high\-scoring sequence. Motif occurrences are
+determined based on the position p\-value of matches to the motif.
+Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
+matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
+the length of non\-motif sequence ("spacer") is shown between
+underscores (`_'). For example,
+.IP
+27_[3]_44_<4>_99_[1]_7
+.PP
+shows an initial spacer of length 27, followed by a strong match to
+motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
+length 99, a strong match to motif 1 and a final non\-motif sequence of
+length 7. The value of M is 0.0001 for the WEB server but is
+user\-selectable in the downloadable version of MAST.
+.PP
+Annotated Sequences
+.SS
+MAST annotates each high\-scoring sequence by printing the sequence
+along with the position and strength of all the non\-overlapping motif
+occurrences. The four lines above each motif occurrence contain,
+respectively,
+.IP
+* the motif number of the occurrence,
+.IP
+* the position p\-value of the occurrence,
+.IP
+* the best possible match to the motif, and
+.IP
+* a plus sign (`+') above each letter in the occurrence that has a
+positive match score to the motif.
+.PP
+The best possible match to a motif is the sequence of letters which
+would achieve the highest match score.
+.SS
+Hit List
+.PP
+If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
+of "hits" in easily machine\-readable format. Each line corresponds to
+one motif occurrence in one sequence. The format of the hit lines is
+.IP
+[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
+.PP
+where
+.TP
+<sequence_name> is the name of the sequence containing the hit
+.TP
+<strand>        is the strand (+ or \- for DNA, blank for protein),
+.TP
+<motif>         is the motif number,
+.TP
+<start>         is the starting position of the hit,
+.TP
+<end>           is the ending position of the hit, and
+.TP
+<score>         is the score the hit,
+.TP
+<p\-value>       is the position p\-value of the hit.
+.PP
+Two comment lines (starting with "#") are written above the list of
+hits, and the MAST command line is printed as a comment line after the
+list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
+.IP
+# All non\-overlapping hits in all sequences.
+.IP
+# sequence_name motif hit_start hit_end score hit_p\-value
+.IP
+ce1cg \fB\-2\fR 8 22  1459.90 1.67e\-06
+.IP
+ara +2 2 16  1661.18 5.04e\-08
+.IP
+bglr1 +2 1 15  1274.97 1.42e\-05
+.IP
+cya \fB\-2\fR 19 33  1101.37 6.64e\-05
+.IP
+gale +2 5 19  1076.21 8.11e\-05
+.IP
+ilv \fB\-2\fR 6 20  1098.85 6.78e\-05
+.IP
+malk +2 37 51  1085.02 7.56e\-05
+.IP
+ompa +2 5 19  1583.18 2.43e\-07
+.IP
+# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
+.SS
+Loading Multiple Sequence Databases
+.PP
+Multiple sequence databases can be loaded by MAST by putting the file
+names into a file and specifying that file instead of the sequence
+database with the option \fB\-dblist\fR.
+.PP
+The file list has one file name on each line with the optional name and
+link as follows:
+.IP
+<file> [<name> <link>]
+.IP
+\&...
+.IP
+\&...
+.PP
+If it is specified then the name will be used instead of the file name
+in the output. If the link is specified then all sequences for that
+database in the html output will have a hyperlink to the URL specified
+with the text SEQUENCEID replaced with the FASTA sequence id.
+.SH
+EXAMPLES:
+.PP
+The following examples assume that file "meme.results" is the output of
+a MEME run containing at least 3 motifs which was created on the
+trainingset "training.fasta" and file SwissProt is a copy of the
+Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
+database on your local disk.
+.IP
+1. Annotate the training set:
+mast meme.results training.fasta
+.IP
+2. Find sequences matching the motif and annotate them in the
+SwissProt database:
+.IP
+mast meme.results SwissProt
+.IP
+3. Show sequences with weaker combined matches to motifs.
+.IP
+mast meme.results SwissProt \fB\-ev\fR 200
+.IP
+4. Include a nominal order and spacing of the first three motifs in
+the calculation of the sequence p\-values to increase the
+sensitivity of the search for matching sequences:
+.IP
+mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
+.IP
+5. Use only the first and third motifs in the search:
+.IP
+mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
+.IP
+6. Use only the first two motifs in the search:
+.IP
+mast meme.results SwissProt \fB\-c\fR 2
+.IP
+7. Search DNA sequences using protein motifs, adjusting p\-values and
+E\-values for each sequence by that sequence's composition:
+.IP
+mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR

Deleted: trunk/packages/meme/trunk/debian/meme_manual.txt
===================================================================
--- trunk/packages/meme/trunk/debian/meme_manual.txt	2013-02-14 12:38:03 UTC (rev 13006)
+++ trunk/packages/meme/trunk/debian/meme_manual.txt	2013-02-14 13:58:05 UTC (rev 13007)
@@ -1,650 +0,0 @@
-USAGE:
-	meme	<dataset> [optional arguments]
-
-	<dataset> 		file containing sequences in FASTA format
-	[-h]			print this message
-	[-dna]			sequences use DNA alphabet
-	[-protein]		sequences use protein alphabet
-	[-mod oops|zoops|anr]	distribution of motifs
-	[-nmotifs <nmotifs>]	maximum number of motifs to find
-	[-evt <ev>]		stop if motif E-value greater than <evt>
-	[-nsites <sites>]	number of sites for each motif
-	[-minsites <minsites>]	minimum number of sites for each motif
-	[-maxsites <maxsites>]	maximum number of sites for each motif
-	[-wnsites <wnsites>]	weight on expected number of sites
-	[-w <w>]		motif width
-	[-minw <minw>]		minumum motif width
-	[-maxw <maxw>]		maximum motif width
-	[-nomatrim]		do not adjust motif width using multiple
-				alignment
-	[-wg <wg>]		gap opening cost for multiple alignments
-	[-ws <ws>]		gap extension cost for multiple alignments
-	[-noendgaps]		do not count end gaps in multiple alignments
-	[-bfile <bfile>]	name of background Markov model file
-	[-revcomp]		allow sites on + or - DNA strands
-	[-pal]			force palindromes (requires -dna)
-	[-maxiter <maxiter>]	maximum EM iterations to run
-	[-distance <distance>]	EM convergence criterion
-	[-prior dirichlet|dmix|mega|megap|addone]
-				type of prior to use
-	[-b <b>]		strength of the prior
-	[-plib <plib>]		name of Dirichlet prior file
-	[-spfuzz <spfuzz>]	fuzziness of sequence to theta mapping
-	[-spmap uni|pam]	starting point seq to theta mapping type
-	[-cons <cons>]		consensus sequence to start EM from
-	[-text]			output in text format (default is HTML)
-	[-maxsize <maxsize>]	maximum dataset size in characters
-	[-nostatus]		do not print progress reports to terminal
-	[-p <np>]		use parallel version with <np> processors
-	[-time <t>]		quit before <t> CPU seconds consumed
-	[-sf <sf>]		print <sf> as name of sequence file
-
-  MEME -- Multiple EM for Motif Elicitation
-   
-  MEME is a tool for discovering motifs in a group of related DNA or protein
-  sequences.
-   
-  A motif is a sequence pattern that occurs repeatedly in a group of related
-  protein or DNA sequences. MEME represents motifs as position-dependent
-  letter-probability matrices which describe the probability of each possible
-  letter at each position in the pattern. Individual MEME motifs do not 
-  contain gaps. Patterns with variable-length gaps are split by MEME into two 
-  or more separate motifs.
-   
-  MEME takes as input a group of DNA or protein sequences (the training set)
-  and outputs as many motifs as requested. MEME uses statistical modeling
-  techniques to automatically choose the best width, number of occurrences,
-  and description for each motif.
-   
-  MEME outputs its results as a hypertext (HTML) document.
-  
-  The MEME results consist of:
-  
-         The version of MEME and the date it was released. 
-  
-         The reference to cite if you use MEME in your research. 
-  
-         A description of the sequences you submitted (the "training set")
-         showing the name, "weight" and length of each sequence. 
-  
-         The command line summary detailing the parameters with which you
-         ran MEME. 
-  
-         Information on each of the motifs MEME discovered, including: 
-             1.A summary line showing the width, number of occurrences, log
-                likelihood ratio and statistical significance of the motif. 
-             2.A simplified position-specific probability matrix. 
-             3.A diagram showing the degree of conservation at each motif
-                position. 
-             4.A multilevel consensus sequence showing the most conserved
-                letter(s) at each motif position. 
-             5.The occurrences of the motif sorted by p-value and aligned with
-                each other. 
-             6.Block diagrams of the occurrences of the motif within each
-                sequence in the training set. 
-             7.The motif in BLOCKS format. 
-             8.A position-specific scoring matrix (PSSM) for use by the
-                MAST database search program. 
-             9.The position specific probability matrix (PSPM) describing the
-                motif. 
-  
-         A summary of motifs showing an optimized (non-overlapping) tiling of
-         all of the motifs onto each of the sequences in the training set. 
-  
-         The reason why MEME stopped and the name of the CPU on which it
-         ran. 
-  
-         This explanation of how to interpret MEME results.  
-  
-  REQUIRED ARGUMENTS:
-  	<dataset>       The name of the file containing the training set 
-  			sequences.  If <dataset> is the word "stdin", MEME
-  			reads from standard input.  
-  
-  			The sequences in the dataset should be in 
-  			Pearson/FASTA format.  For example:
-  
-  			>ICYA_MANSE INSECTICYANIN A FORM (BLUE BILIPROTEIN)
-  			GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK
-  			LPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDA
-  			>LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) 
-  			MKCLLLALALTCGAQALIVTQTMKGLDI
-  			QKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKW
-  				
-  			Sequences start with a header line followed by
-  			sequence lines.  A header line has
-  			the character ">" in position one, followed by
-  			an unique name without any spaces, followed by
-  			(optional) descriptive text.  After the header line 
-  			come the actual sequence lines.  Spaces and blank 
-  			lines are ignored.  Sequences may be in capital or 
-  			lowercase or both.  
-  
-  			MEME uses the first word in the header line of each 
-  			sequence, truncated to 24 characters if necessary,
-  			as the name of the sequence. This name must be unique. 
-  			Sequences with duplicate names will be ignored. 
-  			(The first word in the title line is 
-  			everything following the ">" up to the first blank.)
-  
-  			Sequence weights may be specified in the dataset
-  			file by special header lines where the unique name
-  			is "WEIGHTS" (all caps) and the descriptive 
-  			text is a list of sequence weights. 
-  			Sequence weights are numbers in the range 0 < w <=1.
-  			All weights are assigned in order to the
-  			sequences in the file. If there are more sequences
-  			than weights, the remainder are given weight one.
-  			Weights must be greater than zero and less than
-  			or equal to one.  Weights may be specified by
-  			more than one "WEIGHT" entry which may appear
-  			anywhere in the file.  When weights are used, 
-  			sequences will contribute to motifs in proportion
-  			to their weights.  Here is an example for a file
-  			of three sequences where the first two sequences are 
-  			very similar and it is desired to down-weight them:
-  
-  			>WEIGHTS 0.5 .5 1.0 
-  			>seq1
-  			GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK
-  			>seq2
-  			GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK
-  			>seq3
-  			QKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKW
-  
-  
-  OPTIONAL ARGUMENTS:
-   
-  MEME has a large number of optional inputs that can be used
-  to fine-tune its behavior.  To make these easier to understand
-  they are divided into the following categories:
-   
-  		ALPHABET	- control the alphabet for the motifs
-  				  (patterns) that MEME will search for
-   
-  		DISTRIBUTION	- control how MEME assumes the occurrences
-  				  of the motifs are distributed throughout
-  				  the training set sequences
-   
-  		SEARCH		- control how MEME searches for motifs
-   
-                  SYSTEM          - the -p <np> argument causes a version of MEME
-                                    compiled for a parallel CPU architecture
-                                    to be run.  (By placing <np> in quotes you
-                                    may pass installation specific switches to
-  				  the 'mpirun' command.  The number of 
-                                    processors to run on must be the first 
-  				  argument following -p).
-  
-   
-  In what follows, <n> is an integer, <a> is a decimal number, and <string> 
-  is a string of characters.
-   
-  ALPHABET
-  --------
-  MEME accepts either DNA or protein sequences, but not both in the same run.
-  By default, sequences are assumed to be protein.  The sequences must be in 
-  FASTA format.
-  
-  DNA sequences must contain only the letters "ACGT", plus the ambiguous
-  letters "BDHKMNRSUVWY*-". 
-  Protein sequences must contain only the letters "ACDEFGHIKLMNPQRSTVWY",
-  plus the ambiguous letters "BUXZ*-".
-  
-  MEME converts all ambiguous letters to "X", which is treated as "unknown".
-   
-  	-dna		Assume sequences are DNA; default: protein sequences
-  	-protein	Assume sequences are protein
-  
-   
-  DISTRIBUTION
-  ------------
-  If you know how occurrences of motifs are distributed in the training set 
-  sequences, you can specify it with the following optional switches.  The 
-  default distribution of motif occurrences is assumed to be zero or one 
-  occurrence of per sequence.
-   
-  	-mod <string>   The type of distribution to assume.
-  			oops    One Occurrence Per Sequence
-  				MEME assumes that each sequence in the dataset
-  				contains exactly one occurrence of each motif.
-  				This option is the fastest and most sensitive
-  				but the motifs returned by MEME may be 
-  				"blurry" if any of the sequences is missing
-  				them. 	
-   
-  			zoops   Zero or One Occurrence Per Sequence
-  				MEME assumes that each sequence may contain at
-  				most one occurrence of each motif. This option
-  				is useful when you suspect that some motifs
-  				may be missing from some of the sequences. In
-  				that case, the motifs found will be more
-  				accurate than using the first option. This
-  				option takes more computer time than the
-  				first option (about twice as much) and is
-  				slightly less sensitive to weak motifs present
-  				in all of the sequences.
-   
-  			anr 	Any Number of Repetitions
-  				MEME assumes each sequence may contain any
-  				number of non-overlapping occurrences of each
-  				motif. This option is useful when you suspect
-  				that motifs repeat multiple times within a
-  				single sequence. In that case, the motifs 
-  				found will be much more accurate than using 
-  				one of the other options. This option can also
-  				be used to discover repeats within a single
-  				sequence. This option takes the much more
-  				computer time than the first option (about ten
-  				times as much) and is somewhat less sensitive
-  				to weak motifs which do not repeat within a
-  				single sequence than the other two options.
-   
-   
-  SEARCH
-  ------
-  
-  A) OBJECTIVE FUNCTION
-  
-  MEME uses an objective function on motifs to select the "best" motif.
-  The objective function is based on the statistical significance of the 
-  log likelihood ratio (LLR) of the occurrences of the motif.  
-  The E-value of the motif is an estimate of the number of motifs (with the 
-  same width and number of occurrences) that would have equal or higher log 
-  likelihood ratio if the training set sequences had been generated randomly 
-  according to the (0-order portion of the) background model. 
-  
-  MEME searches for the motif with the smallest E-value.
-  It searches over different motif widths, numbers of occurrences, and
-  positions in the training set for the motif occurrences.
-  The user may limit the range of motif widths and number of occurrences
-  that MEME tries using the switches described below.  In addition,
-  MEME trims the motif (using a dynamic programming multiple alignment) to 
-  eliminate any positions where there is a gap in any of the occurrences.  
-  
-  The log likelihood ratio of a motif is
-  	llr = log (Pr(sites | motif) / Pr(sites | back))
-  and is a measure of how different the sites are from the background model.
-  Pr(sites | motif) is the probability of the occurrences given the a model
-  consisting of the position-specific probability matrix (PSPM) of the motif.
-  (The PSPM is output by MEME).
-  Pr(sites | back) is the  probability of the occurrences given the background
-  model.  The background model is an n-order Markov model.  By default,
-  it is a 0-order model consisting of the frequencies of the letters in
-  the training set.  A different 0-order Markov model or higher order Markov 
-  models can be specified to MEME using the -bfile option described below.
-  
-  The E-value reported by MEME is actually an approximation of the E-value
-  of the log likelihood ratio.  (An approximation is used because it is far
-  more efficient to compute.)  The approximation is based on the fact that
-  the log likelihood ratio of a motif is the sum of the log 
-  likelihood ratios of each column of the motif.  Instead of computing the 
-  statistical significance of this sum (its p-value), MEME computes the 
-  p-value of each column and then computes the significance of their product.  
-  Although not identical to the significance of the log likelihood ratio, this 
-  easier to compute objective function works very similarly in practice.
-  
-  The motif significance is reported as the E-value of the motif.  
-  The statistical signficance of a motif is computed based on:
-  	1) the log likelihood ratio,
-  	2) the width of the motif,
-  	3) the number of occurrences,
-  	4) the 0-order portion of the background model,
-  	5) the size of the training set, and
-  	6) the type of model (oops, zoops, or anr, which determines the
-  	   number of possible different motifs of the given width and
-  	   number of occurrences).
-  
-  MEME searches for motifs by performing Expectation Maximization (EM) on a 
-  motif model of a fixed width and using an initial estimate of the number of 
-  sites.  It then sorts the possible sites according to their probability 
-  according to EM.  MEME then and calculates the E-values of the first n sites 
-  in the sorted list for different values of n.  This procedure (first EM, 
-  followed by computing E-values for different numbers of sites) is repeated 
-  with different widths and different initial estimates of the number of 
-  sites.  MEME outputs the motif with the lowest E-value.
-  
-   
-  B) NUMBER OF MOTIFS
-   
-  	-nmotifs <n>    The number of *different* motifs to search
-  			for.  MEME will search for and output <n> motifs.
-  			Default: 1
-   
-  	-evt <p>	Quit looking for motifs if E-value exceeds <p>.
-  			Default: infinite (so by default MEME never quits
-  			before -nmotifs <n> have been found.)
-   
-   
-  C) NUMBER OF MOTIF OCCURENCES
-   
-  	-nsites <n>
-  	-minsites <n>
-  	-maxsites <n>
-  			The (expected) number of occurrences of each motif.
-  			If -nsites is given, only that number of occurrences
-  			is tried.  Otherwise, numbers of occurrences between
-  			-minsites and -maxsites are tried as initial guesses
-  			for the number of motif occurrences.  These
-  			switches are ignored if mod = oops.
-   
-  			Default: -minsites sqrt(number sequences)
-  				 -maxsites Default:
-  					zoops 	# of sequences
-  					anr	MIN(5*#sequences, 50)
-  
-  	-wnsites <n>	The weight on the prior on nsites.  This controls
-  			how strong the bias towards motifs with exactly
-  			nsites sites (or between minsites and maxsites sites)
-  			is.  It is a number in the range [0..1).  The
-  			larger it is, the stronger the bias towards 
-  			motifs with exactly nsites occurrences is.
-  			Default: 0.8
-   
-  D) MOTIF WIDTH
-   
-  	-w <n>
-  	-minw <n>
-  	-maxw <n>
-  
-  			The width of the motif(s) to search for.
-  			If -w is given, only that width is tried.
-  			Otherwise, widths between -minw and -maxw are tried.
-  			Default: -minw  8, -maxw 50 (defined in user.h)
-  
-  			Note: If <n> is less than the length of the shortest 
-  			sequence in the dataset, <n> is reset by MEME to 
-  			that value. 
-  
-  	-nomatrim
-  	-wg <a>
-  	-ws <a>
-  	-noendgaps
-  			These switches control trimming (shortening) of
-  			motifs using the multiple alignment method.
-  			Specifying -nomatrim causes MEME to skip this and
-  			causes the other switches to be ignored.
-  			MEME finds the best motif
-  			found and then trims (shortens) it using the multiple 
-  			alignment method (described below). The number of 
-  			occurrences is then adjusted to maximize the motif 
-  			E-value, and then the motif width is further
-  			shortened to optimize the E-value.
-  
-  			The multiple alignment method performs a separate 
-  			pairwise alignment of the site with the highest
-  			probability and each other possible site.
-  			(The alignment includes width/2 positions on either 
-  			side of the sites.) The pairwise alignment
-  			is controlled by the switches:
-  				-wg <a> (gap cost; default: 11), 
-  				-ws <a> (space cost; default 1), and, 
-  				-noendgaps (do not penalize endgaps; default: 
-  					penalize endgaps).  
-  			The pairwise alignments are then combined and the 
-  			method determines the widest section of the motif with 
-  			no insertions or deletions.  If this alignment
-  		        is shorter than <minw>, it tries to find an alignment
-  			allowing up to one insertion/deletion per motif
-  			column.  This continues (allowing up to 2, 3 ...
-  			insertions/deletions per motif column) until an 
-  			alignment of width at least <minw> is found. 
-  
-  
-  E) BACKGROUND MODEL
-  	-bfile <bfile>	The name of the file containing the background model
-  			for sequences.  The background model is the model
-  			of random sequences used by MEME.  The background 
-  			model is used by MEME 
-  				1) during EM as the "null model",
-  				2) for calculating the log likelihood ratio
-  				   of a motif,
-  				3) for calculating the significance (E-value) 
-  				   of a motif, and, 
-  				4) for creating the position-specific scoring
-  				   matrix (log-odds matrix).
-  
-  			By default, the background model is a 0-order Markov 
-  			model based on the letter frequencies in the training 
-  			set.  
-  
-  			Markov models of any order can be specified in <bfile> 
-  			by listing frequencies of all possible tuples of 
-  			length up to order+1.  
-  
-  			Note that MEME uses only the 0-order portion (single
-  			letter frequencies) of the background model for
-  			purposes 3) and 4), but uses the full-order model
-  			for purposes 1) and 2), above.
-  
-  			Example: To specify a 1-order Markov background model
-  		 		 for DNA, <bfile> might contain the following
-  				 lines.  Note that optional comment lines are
-  				 by "#" and are ignored by MEME.
-  
-  				# tuple   frequency_non_coding
-  				a       0.324
-  				c       0.176
-  				g       0.176
-  				t       0.324
-  				# tuple   frequency_non_coding
-  				aa      0.119
-  				ac      0.052
-  				ag      0.056
-  				at      0.097
-  				ca      0.058
-  				cc      0.033
-  				cg      0.028
-  				ct      0.056
-  				ga      0.056
-  				gc      0.035
-  				gg      0.033
-  				gt      0.052
-  				ta      0.091
-  				tc      0.056
-  				tg      0.058
-  				tt      0.119
-  
-  Sample -bfile files are given in directory tests: 
-  	tests/nt.freq (DNA), and 
-  	tests/na.freq (amino acid).
-  
-  F) DNA PALINDROMES AND STRANDS
-   
-  	-revcomp	motifs occurrences may be on the given DNA strand
-  			or on its reverse complement.
-  			Default: look for DNA motifs only on the strand given 
-  			in the training set.
-   
-  	-pal		
-  			Choosing -pal causes MEME to look for palindromes in 
-  			DNA datasets.  
-  
-  			MEME averages the letter frequencies in corresponding 
-  			columns of the motif (PSPM) together. For instance, 
-  			if the width of the motif is 10, columns 1 and 10, 2 
-  			and 9, 3 and 8, etc., are averaged together.  The 
-  			averaging combines the frequency of A in one column 
-  			with T in the other, and the frequency of C in one 
-  			column with G in the other.  
-  			If neither option is not chosen, MEME does not 
-  			search for DNA palindromes.
-  
-  
-  G) EM ALGORITHM
-   
-  	-maxiter <n>    The number of iterations of EM to run from
-  			any starting point.
-  			EM is run for <n> iterations or until convergence
-  			(see -distance, below) from each starting point.
-  			Default: 50
-   
-  	-distance <a>   The convergence criterion.  MEME stops
-  			iterating EM when the change in the
-  			motif frequency matrix is less than <a>.
-  			(Change is the euclidean distance between
-  			two successive frequency matrices.)
-  			Default: 0.001
-   
-  	-prior <string> The prior distribution on the model parameters:
-  			dirichlet       simple Dirichlet prior
-  					This is the default for -dna and 
-  					-alph.  It is based on the 
-  					non-redundant database letter
-  					frequencies.
-  			dmix		mixture of Dirichlets prior
-  					This is the default for -protein. 
-  			mega		extremely low variance dmix;
-  					variance is scaled inversely with
-  					the size of the dataset.
-  			megap		mega for all but last iteration
-  					of EM; dmix on last iteration.
-  			addone		add +1 to each observed count
-   
-  	-b <a>	  The strength of the prior on model parameters:
-  				<a> = 0 means use intrinsic strength of prior
-  					for prior = dmix.
-  			Defaults:
-  				0.01 if prior = dirichlet
-  				0 if prior = dmix
-   
-  	-plib <string>  The name of the file containing the Dirichlet prior
-  			in the format of file prior30.plib.
-   
-   
-  H) SELECTING STARTS FOR EM
-   
-  The default is for MEME to search the dataset for good starts for EM.  How 
-  the starting points are derived from the dataset is specified by the 
-  following switches.
-   
-  The default type of mapping MEME uses is:
-  		-spmap uni for -dna and -alph <string>
-  		-spmap pam for -protein
-   
-  	-spfuzz <a>     The fuzziness of the mapping.
-  			Possible values are greater than 0.  Meaning
-  			depends on -spmap, see below.
-   
-  	-spmap <string> The type of mapping function to use.
-  			uni     Use add-<a> prior when converting a substring
-  				to an estimate of theta.
-  				Default -spfuzz <a>: 0.5
-  			pam     Use columns of PAM <a> matrix when converting
-  				a substring to an estimate of theta.
-  				Default -spfuzz <a>: 120 (PAM 120)
-   
-  			Other types of starting points
-  			can be specified using the following switches.
-   
-  	-cons <string>  Override the sampling of starting points
-  			and just use a starting point derived from
-  			<string>.
-  			This is useful when an actual occurrence of
-  			a motif is known and can be used as the
-  			starting point for finding the motif.
-  
-  EXAMPLES:
-  
-  The following examples use data files provided in this release of MEME.  
-  MEME writes its output to standard output, so you will want to redirect it 
-  to a file in order for use with MAST.
-   
-  1) A simple DNA example:
-   
-  	 meme crp0.s -dna -mod oops -pal > ex1.html
-   
-  MEME looks for a single motif in the file crp0.s which contains DNA 
-  sequences in FASTA format.  The OOPS model is used so MEME assumes that 
-  every sequence contains exactly one occurrence of the motif.  The 
-  palindrome switch is given so the motif model (PSPM) is converted into a 
-  palindrome by combining corresponding frequency columns.  MEME automatically 
-  chooses the best width for the motif in this example since no width was 
-  specified.
-   
-  2) Searching for motifs on both DNA strands:
-  
-           meme crp0.s -dna -mod oops -revcomp > ex2.html
-  
-  This is like the previous example except that the -revcomp switch tells
-  MEME to consider both DNA strands, and the -pal switch is absent so the
-  palindrome conversion is omitted.  When DNA uses both DNA strands, motif
-  occurrences on the two strands may not overlap.  That is, any position
-  in the sequence given in the training set may be contained in an occurrence
-  of a motif on the positive strand or the negative strand, but not both.
-  
-  3) A fast DNA example:
-   
-  	meme crp0.s -dna -mod oops -revcomp -w 20 > ex3.html
-   
-  This example differs from example 1) in that MEME is told to only 
-  consider motifs of width 20.  This causes MEME to execute about 10 
-  times faster.  The -w switch can also be used with protein datasets if 
-  the width of the motifs are known in advance.
-  
-  4) Using a higher-order background model:
-  
-  	meme INO_up800.s -dna -mod anr -revcomp -bfile yeast.nc.6.freq > ex4.html
-  
-  In this example we use -mod anr and -bfile yeast.nc.6.freq.  This specifies 
-  that
-  	a) the motif may have any number of occurrences in each sequence, and,
-  	b) the Markov model specified in yeast.nc.6.freq is used as the 
-  	   background model.  This file contains a fifth-order Markov model 
-             for the non-coding regions in the yeast genome.
-  Using a higher order background model can often result in more sensitive
-  detection of motifs.  This is because the background model more accurately
-  models non-motif sequence, allowing MEME to discriminate against it and find 
-  the true motifs.
-  
-  5) A simple protein example:
-   
-  	meme lipocalin.s -mod oops -maxw 20 -nmotifs 2 > ex5.html
-   
-  The -dna switch is absent, so MEME assumes the file lipocalin.s contains 
-  protein sequences.  MEME searches for two motifs each of width less than or 
-  equal to 20.
-  (Specifying -maxw 20 makes MEME run faster since it does not have to 
-  consider motifs longer than 20.) Each motif is assumed to occur in each 
-  of the sequences because the OOPS model is specified.
-   
-  6) Another simple protein example:
-   
-  	meme farntrans5.s -mod anr -maxw 40 -maxsites 50 > ex6.html
-   
-  MEME searches for a motif of width up to 40 with up to 50 occurrences in
-  the entire training set.  The ANR sequence model is specified,
-  which allows each motif to have any number of occurrences in each sequence.  
-  This dataset contains motifs with multiple repeats of motifs in each 
-  sequence.  This example is fairly time consuming due to the fact that the 
-  time required to initiale the motif probability tables is proportional 
-  to <maxw> times <maxsites>.  By default, MEME only looks for motifs up to 
-  29 letters wide with a maximum total of number of occurrences equal to twice 
-  the number of sequences or 30, whichever is less.
-  
-  7) A much faster protein example:
-  
-  	meme farntrans5.s -mod anr -w 10 -maxsites 30 -nmotifs 3 > ex7.html
-  
-  This time MEME is constrained to search for three motifs of width exactly 
-  ten.  The effect is to break up the long motif found in the previous 
-  example.  The -w switch forces motifs to be *exactly* ten letters wide.
-  This example is much faster because, since only one width is considered, the
-  time to build the motif probability tables is only proportional to 
-  <maxsites>.
-  
-  8) Splitting the sites into three:
-  
-  	meme farntrans5.s -mod anr -maxw 12 -nsites 24 -nmotifs 3 > ex8.html
-  
-  This forces each motif to have 24 occurrences, exactly, and be up to 12 
-  letters wide.
-  
-  9) A larger protein example with E-value cutoff:
-  
-  	meme adh.s -mod zoops -nmotifs 20 -evt 0.01 > ex9.html
-  
-  In this example, MEME looks for up to 20 motifs, but stops when a motif is
-  found with E-value greater than 0.01.  Motifs with large E-values are likely
-  to be statistical artifacts rather than biologically significant.
-




More information about the debian-med-commit mailing list