[med-svn] [mhap] 01/04: Imported Upstream version 2.1.1+dfsg

Afif Elghraoui afif at moszumanska.debian.org
Sun Oct 9 19:02:34 UTC 2016


This is an automated email from the git hooks/post-receive script.

afif pushed a commit to branch master
in repository mhap.

commit 6f4bf90ddaac4f517b3bd044438713519e99b2d1
Author: Afif Elghraoui <afif at debian.org>
Date:   Sun Oct 9 11:15:12 2016 -0700

    Imported Upstream version 2.1.1+dfsg
---
 README.md                                          |  4 +-
 docs/source/installation.rst                       | 18 ++++-----
 docs/source/quickstart.rst                         |  7 +++-
 docs/source/utilities.rst                          |  6 +--
 pom.xml                                            |  2 +-
 .../mhap/impl/MinHashBitSequenceSubSketches.java   |  7 ++--
 .../edu/umd/marbl/mhap/impl/MinHashSearch.java     |  7 +---
 .../edu/umd/marbl/mhap/impl/SequenceSketch.java    | 17 ++++----
 .../marbl/mhap/impl/SequenceSketchStreamer.java    | 45 ++++++++++++++++------
 .../java/edu/umd/marbl/mhap/main/AlignmentTry.java |  3 +-
 .../java/edu/umd/marbl/mhap/main/EstimateROC.java  |  2 +-
 .../edu/umd/marbl/mhap/main/KmerStatSimulator.java | 13 ++++---
 .../java/edu/umd/marbl/mhap/main/MhapMain.java     |  5 +--
 ...edNGramHashes.java => BottomOverlapSketch.java} | 16 ++++----
 .../sketch/{BottomHash.java => BottomSketch.java}  |  8 ++--
 .../edu/umd/marbl/mhap/sketch/FrequencyCounts.java | 34 +++++++++-------
 .../umd/marbl/mhap/sketch/MinHashBitSketch.java    |  2 +-
 .../edu/umd/marbl/mhap/sketch/MinHashSketch.java   | 41 +++++++++++---------
 .../mhap/sketch/ZeroNGramsFoundException.java      | 43 +++++++++++++++++++++
 19 files changed, 177 insertions(+), 103 deletions(-)

diff --git a/README.md b/README.md
index c54debd..f994920 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # MHAP
 
-MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. This is the development branch, please use the [latest tagged](https://github.com/marbl/MHAP/releases/tag/v2.1).
+MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. This is the development branch, please use the [latest tagged](https://github.com/marbl/MHAP/releases/tag/v2.1.1).
 
 ## Build
 
@@ -13,7 +13,7 @@ You must have a recent  [JDK](http://www.oracle.com/technetwork/java/javase/down
 For a quick user-quide, run:
 
     cd target
-    java -jar mhap-2.1.jar
+    java -jar mhap-2.1.1.jar
 
 ## Docs
 For the full documentation information please see http://mhap.readthedocs.io/en/latest/
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index 0ff9a46..0abefe2 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -28,19 +28,19 @@ The pre-compiled version is recommended to users who want to run MHAP, without d
 
 .. code-block:: bash
 
-    $ wget https://github.com/marbl/MHAP/releases/download/v2.1/mhap-2.1.tar.gz
+    $ wget https://github.com/marbl/MHAP/releases/download/v2.1.1/mhap-2.1.1.tar.gz
 
 And if ``wget`` not available, you can use ``curl`` instead:
 
 .. code-block:: bash
 
-    $ curl -L https://github.com/marbl/MHAP/releases/download/v2.1/mhap-2.1.tar.gz > mhap-2.1.tar.gz
+    $ curl -L https://github.com/marbl/MHAP/releases/download/v2.1.1/mhap-2.1.1.tar.gz > mhap-2.1.1.tar.gz
 
 Then run
 
 .. code-block:: bash
 
-   $ tar xvzf mhap-2.1.tar.gz
+   $ tar xvzf mhap-2.1.1.tar.gz
 
 Source
 -----------------
@@ -49,7 +49,7 @@ To build the code from the release:
 
 .. code-block:: bash
 
-    $ wget https://github.com/marbl/MHAP/archive/v2.1.zip
+    $ wget https://github.com/marbl/MHAP/archive/v2.1.1.zip
 
 If you see a certificate not trusted error, you can add the following option to wget:
 
@@ -61,22 +61,22 @@ And if ``wget`` not available, you can use ``curl`` instead:
 
 .. code-block:: bash
 
-    $ curl -L https://github.com/marbl/MHAP/archive/v2.1.zip > v2.1.zip
+    $ curl -L https://github.com/marbl/MHAP/archive/v2.1.1.zip > v2.1.zip
 
-You can also browse the https://github.com/marbl/MHAP/tree/v2.1
+You can also browse the https://github.com/marbl/MHAP/tree/v2.1.1
 and click on Downloads. 
 
 Once downloaded, extract to unpack:
 
 .. code-block:: bash
 
-    $ unzip v2.1.zip
+    $ unzip v2.1.1.zip
 
 Change to MASH directory:
 
 .. code-block:: bash
 
-    $ cd MHAP-2.1
+    $ cd MHAP-2.1.1
 
 Once inside the directory, run:
 
@@ -84,4 +84,4 @@ Once inside the directory, run:
 
     $ maven install
 
-This will compile the program and create a target/mhap-2.1.jar file which you can use to run MHAP. The quick-start instructions assume you are in the target directory when running the program. You can also use the target/mhap-2.1.jar file to copy MHAP to a different system or directory. If you would like to run the `validation utilties <utilities.html>`_ you must also download and build the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_. Follow the in [...]
+This will compile the program and create a target/mhap-2.1.1.jar file which you can use to run MHAP. The quick-start instructions assume you are in the target directory when running the program. You can also use the target/mhap-2.1.1.jar file to copy MHAP to a different system or directory. If you would like to run the `validation utilties <utilities.html>`_ you must also download and build the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_. Follow th [...]
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
index 2492065..d0d0032 100644
--- a/docs/source/quickstart.rst
+++ b/docs/source/quickstart.rst
@@ -9,7 +9,7 @@ Running MHAP provides command-line documenation if you run it without parameters
  
 .. code-block:: bash
 
-    $ java -jar mhap-2.1.jar
+    $ java -jar mhap-2.1.1.jar
 
 MHAP has two main usage modes, the main finds all overlaps between the input sequences. The second  only constructs an index which can be subsequently reused. 
 
@@ -18,7 +18,7 @@ Finding overlaps
 
 .. code-block:: bash
 
-   $ java -Xmx32g -server -jar mhap-2.1.jar -s<fasta/dat from/self file> [-q<fasta/dat to file or directory>] [-f<kmer filter list, must be sorted>]
+   $ java -Xmx32g -server -jar mhap-2.1.1.jar -s<fasta/dat from/self file> [-q<fasta/dat to file or directory>] [-f<kmer filter list, must be sorted>]
 
 Both the -s and -q options can accept either FastA sequences or binary dat files (generated as described below). The -q option can accept either a file or a directory, in which case all FastA/dat files in the specified directory will be used. By default, only the sequences specified by -s are indexed and the sequences in -q are streamed against the constructed index. Generally, 32GB of RAM is sufficient to index 40K sequences. If you have more sequences, you can partition your data and r [...]
 
@@ -101,6 +101,8 @@ The full list of options is available via command-line help (--help or -h). Belo
 			[int] The size of k-mers used in the ordered second stage filter.
 		--ordered-sketch-size, default = 1536
 			[int] The sketch size for second stage filter.
+		--repeat-idf-scale, default = 3.0
+			[double] The upper range of the idf (from tf-idf) scale. The full scale will be [1,X], where X is the parameter.
 		--repeat-weight, default = 0.9
 			[double] Repeat suppression strength for tf-idf weighing. <0.0 do unweighted MinHash (version 1.0), >=1.0 do only the tf weighing. To perform no idf weighting, do no supply -f option. 
 		--settings, default = 0
@@ -125,3 +127,4 @@ The full list of options is available via command-line help (--help or -h). Belo
 			Usage 1: The FASTA file of reads, or a directory of files, that will be compared to the set of reads in the box (see -s). Usage 2: The output directory for the binary formatted dat files.
 		-s, default = ""
 			Usage 1 only. The FASTA or binary dat file (see Usage 2) of reads that will be stored in a box, and that all subsequent reads will be compared to.
+
diff --git a/docs/source/utilities.rst b/docs/source/utilities.rst
index f041afb..705e067 100644
--- a/docs/source/utilities.rst
+++ b/docs/source/utilities.rst
@@ -14,7 +14,7 @@ Assuming you have a mapping of sequences to a truth (such as a reference genome)
 
 .. code-block:: bash
 
-   $ java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.EstimateROC <reference mapping M4> <overlaps M4/MHAP> <fasta of sequences> [minimum overlap length to evaluate] [number of random trials] [use dynamic programming] [verbose] [minimum identity of overlap] [maximum different between expected overlap and reported] [load all overlaps]
+   $ java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.EstimateROC <reference mapping M4> <overlaps M4/MHAP> <fasta of sequences> [minimum overlap length to evaluate] [number of random trials] [use dynamic programming] [verbose] [minimum identity of overlap] [maximum different between expected overlap and reported] [load all overlaps]
 
 The default minimum overlap length is 2000 and default number of trials is 10000. This will estimate sensitivity/specificity to within 1%. It can be increased at the expense of runtime. Specifying 0 will examine all possible N^2 overlap pairs. 
 
@@ -41,12 +41,12 @@ MHAP includes a tool to simulate sequencing data with random error as well as es
 
 .. code-block:: bash
 
-   $ java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# sequences> <sequence length (bp)> <insertion error rate> <deletion error rate> <substitution error rate> [reference genome]
+   $ java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# sequences> <sequence length (bp)> <insertion error rate> <deletion error rate> <substitution error rate> [reference genome]
 
 The error rates must be between 0 and 1 and are additive. Specifying 10% insertion, 2% deletion, and 1% substitution will result in sequences with a 13% error rate. If no reference sequence is given, completely random sequences are generated and errors added. Otherwise, random sequences are drawn from the reference and errors added. Errors are added randomly with no bias.
 
 .. code-block:: bash
 
-   $  java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# trials> <kmer size> <sequence length> <overlap length> <insertion error rate> <deletion error rate> <substitution error rate> [one-sided error] [reference genome] [kmer filter]
+   $  java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# trials> <kmer size> <sequence length> <overlap length> <insertion error rate> <deletion error rate> <substitution error rate> [one-sided error] [reference genome] [kmer filter]
 
 This usage will output a distribution of Jaccard similarity between a pair of overlapping sequences with the specified error rate (when using the specified k-mer size) and two random sequences of the same length. If no reference sequence is given, completely random sequences are generated and errors added, otherwise sequences are drawn from the reference. When one-sided error is specified (by typing true for the parameter), only one of the two sequences will have error simulated, matchin [...]
diff --git a/pom.xml b/pom.xml
index 008edef..defbee2 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3,7 +3,7 @@
 	<modelVersion>4.0.0</modelVersion>
 	<groupId>mhap</groupId>
 	<artifactId>mhap</artifactId>
-	<version>2.1</version>
+	<version>2.1.1</version>
 	<name>MinHash Alignment Process</name>
 	<build>
 		<resources>
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/MinHashBitSequenceSubSketches.java b/src/main/java/edu/umd/marbl/mhap/impl/MinHashBitSequenceSubSketches.java
index 24aa93b..913d25e 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/MinHashBitSequenceSubSketches.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/MinHashBitSequenceSubSketches.java
@@ -36,12 +36,13 @@ import edu.umd.marbl.mhap.align.AlignElementDoubleSketch;
 import edu.umd.marbl.mhap.align.Aligner;
 import edu.umd.marbl.mhap.sketch.MinHashBitSketch;
 import edu.umd.marbl.mhap.sketch.MinHashSketch;
+import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException;
 
 public final class MinHashBitSequenceSubSketches
 {
 	private final AlignElementDoubleSketch<MinHashBitSketch> alignmentSketch;
 	
-	public final static MinHashBitSketch[] computeSequences(String seq, int nGramSize, int stepSize, int numWords)
+	public final static MinHashBitSketch[] computeSequences(String seq, int nGramSize, int stepSize, int numWords) throws ZeroNGramsFoundException
 	{
 		int remainder = seq.length()%stepSize;
 		
@@ -70,7 +71,7 @@ public final class MinHashBitSequenceSubSketches
 		return sequence;
 	}
 	
-	public final static MinHashBitSketch[] computeSequencesDouble(String seq, int nGramSize, int stepSize, int numWords)
+	public final static MinHashBitSketch[] computeSequencesDouble(String seq, int nGramSize, int stepSize, int numWords) throws ZeroNGramsFoundException
 	{
 		int remainder = seq.length()%stepSize;
 		
@@ -137,7 +138,7 @@ public final class MinHashBitSequenceSubSketches
 		this.alignmentSketch = new AlignElementDoubleSketch<>(sketches, stepSize, seqLength);
 	}
 	
-	public MinHashBitSequenceSubSketches(String seq, int kmerSize, int stepSize, int numWords)
+	public MinHashBitSequenceSubSketches(String seq, int kmerSize, int stepSize, int numWords) throws ZeroNGramsFoundException
 	{
 		this.alignmentSketch = new AlignElementDoubleSketch<>(computeSequencesDouble(seq, kmerSize, stepSize, numWords), stepSize, seq.length());
 	}
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java b/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
index 5929561..3b34589 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
@@ -158,12 +158,7 @@ public final class MinHashSearch extends AbstractMatchSearch
 			throw new MhapRuntimeException("Number of hashes does not match. Stored size " + this.hashes.size()
 					+ ", input size " + minHash.numHashes() + ".");
 		
-		//estimate size
-		long numLookups = this.getNumberSequencesSearched();
-		long numProcessed = this.numberElementsProcessed.get();
-		int mapSize = Math.max(256, (int)(4.0*(double)numProcessed/(double)numLookups));
-
-		Map<SequenceId, HitCounter> bestSequenceHit = new Object2ObjectOpenHashMap<>(mapSize);
+		Map<SequenceId, HitCounter> bestSequenceHit = new Object2ObjectOpenHashMap<>(256);
 		int[] minHashes = minHash.getMinHashArray();
 		
 		int hashIndex = 0;
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
index d304cdf..eec82b7 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
@@ -38,7 +38,8 @@ import java.io.Serializable;
 
 import edu.umd.marbl.mhap.sketch.FrequencyCounts;
 import edu.umd.marbl.mhap.sketch.MinHashSketch;
-import edu.umd.marbl.mhap.sketch.OrderedNGramHashes;
+import edu.umd.marbl.mhap.sketch.BottomOverlapSketch;
+import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException;
 
 public final class SequenceSketch implements Serializable
 {
@@ -49,7 +50,7 @@ public final class SequenceSketch implements Serializable
 
 	private final SequenceId id;
 	private final MinHashSketch mainHashes;
-	private final OrderedNGramHashes orderedHashes;
+	private final BottomOverlapSketch orderedHashes;
 	//private final MinHashBitSequenceSubSketches alignmentSketches;
 	private final int sequenceLength;
 
@@ -78,8 +79,8 @@ public final class SequenceSketch implements Serializable
 			if (mainHashes == null)
 				throw new MhapRuntimeException("Unexpected data read error.");
 
-			OrderedNGramHashes orderedHashes = null;
-			orderedHashes = OrderedNGramHashes.fromByteStream(input);
+			BottomOverlapSketch orderedHashes = null;
+			orderedHashes = BottomOverlapSketch.fromByteStream(input);
 			if (orderedHashes == null)
 				throw new MhapRuntimeException("Unexpected data read error when reading ordered k-mers.");
 
@@ -92,7 +93,7 @@ public final class SequenceSketch implements Serializable
 		}
 	}
 
-	public SequenceSketch(SequenceId id, int sequenceLength, MinHashSketch mainHashes, OrderedNGramHashes orderedHashes)
+	public SequenceSketch(SequenceId id, int sequenceLength, MinHashSketch mainHashes, BottomOverlapSketch orderedHashes)
 	{
 		this.sequenceLength = sequenceLength;
 		this.id = id;
@@ -100,13 +101,13 @@ public final class SequenceSketch implements Serializable
 		this.orderedHashes = orderedHashes;
 	}
 
-	public SequenceSketch(Sequence seq, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, double repeatWeight)
+	public SequenceSketch(Sequence seq, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, double repeatWeight) throws ZeroNGramsFoundException
 	{
 		this.sequenceLength = seq.length();
 		this.id = seq.getId();
 		this.mainHashes = new MinHashSketch(seq.getSquenceString(), kmerSize, numHashes, kmerFilter, repeatWeight);
 		
-		this.orderedHashes = new OrderedNGramHashes(seq.getSquenceString(), orderedKmerSize, orderedSketchSize);
+		this.orderedHashes = new BottomOverlapSketch(seq.getSquenceString(), orderedKmerSize, orderedSketchSize);
 	}
 
 	public SequenceSketch createOffset(int offset)
@@ -145,7 +146,7 @@ public final class SequenceSketch implements Serializable
 		return this.mainHashes;
 	}
 
-	public OrderedNGramHashes getOrderedHashes()
+	public BottomOverlapSketch getOrderedHashes()
 	{
 		return this.orderedHashes;
 	}
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
index 61fa704..665db82 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
@@ -48,6 +48,7 @@ import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicLong;
 
 import edu.umd.marbl.mhap.sketch.FrequencyCounts;
+import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException;
 import edu.umd.marbl.mhap.utils.ReadBuffer;
 import edu.umd.marbl.mhap.utils.Utils;
 
@@ -57,16 +58,16 @@ public class SequenceSketchStreamer
 	private final FastaData fastaData;
 	private final FrequencyCounts kmerFilter;
 	private final int kmerSize;
+	private final int minOlapLength;
 	private final AtomicLong numberProcessed;
 	private final int numHashes;
 	private final int offset;
-	private final double repeatWeight;
-	private final int minOlapLength;
-
 	private final int orderedKmerSize;
+
 	private final int orderedSketchSize;
 	private boolean readClosed;
 	private final boolean readingFasta;
+	private final double repeatWeight;
 	private final ConcurrentLinkedQueue<SequenceSketch> sequenceHashList;
 
 	public SequenceSketchStreamer(String file, int minOlapLength, int offset) throws FileNotFoundException
@@ -111,12 +112,12 @@ public class SequenceSketchStreamer
 
 	public SequenceSketch dequeue(boolean fwdOnly, ReadBuffer buf) throws IOException
 	{
-		enqueue(fwdOnly, buf);
+		enqueueUntilFound(fwdOnly, buf);
 
 		return this.sequenceHashList.poll();
 	}
-
-	private boolean enqueue(boolean fwdOnly, ReadBuffer buf) throws IOException
+	
+	private boolean enqueue(boolean fwdOnly, ReadBuffer buf) throws IOException, ZeroNGramsFoundException
 	{
 		SequenceSketch seqHashes;
 		if (this.readingFasta)
@@ -189,7 +190,7 @@ public class SequenceSketchStreamer
 
 					try
 					{
-						while (enqueue(fwdOnly, buf))
+						while (enqueueUntilFound(fwdOnly, buf))
 						{
 						}
 					}
@@ -217,6 +218,26 @@ public class SequenceSketchStreamer
 		}
 	}
 
+	private boolean enqueueUntilFound(boolean fwdOnly, ReadBuffer buf) throws IOException
+	{
+		boolean getNext = true;
+		boolean returnValue = false;
+		while(getNext)
+		{
+			try
+			{
+				returnValue = enqueue(fwdOnly, buf);
+				getNext = false;
+			}
+			catch (ZeroNGramsFoundException e)
+			{
+				System.err.println("Could not process sketch for a read because zero valid n-grams found: "+e.getSequenceString());
+			}
+		}
+		
+		return returnValue; 
+	}
+
 	public Iterator<SequenceSketch> getDataIterator()
 	{
 		return this.sequenceHashList.iterator();
@@ -230,15 +251,15 @@ public class SequenceSketchStreamer
 		return this.fastaData.getNumberProcessed();
 	}
 
-	public SequenceSketch getSketch(Sequence seq)
+	public int getNumberProcessed()
 	{
-		// compute the hashes
-		return new SequenceSketch(seq, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight);
+		return this.numberProcessed.intValue();
 	}
 
-	public int getNumberProcessed()
+	public SequenceSketch getSketch(Sequence seq) throws ZeroNGramsFoundException
 	{
-		return this.numberProcessed.intValue();
+		// compute the hashes
+		return new SequenceSketch(seq, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight);
 	}
 
 	protected void processAddition(SequenceSketch seqHashes)
diff --git a/src/main/java/edu/umd/marbl/mhap/main/AlignmentTry.java b/src/main/java/edu/umd/marbl/mhap/main/AlignmentTry.java
index f5009e5..905edff 100644
--- a/src/main/java/edu/umd/marbl/mhap/main/AlignmentTry.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/AlignmentTry.java
@@ -35,12 +35,13 @@ import edu.umd.marbl.mhap.align.Alignment;
 import edu.umd.marbl.mhap.impl.MinHashBitSequenceSubSketches;
 import edu.umd.marbl.mhap.impl.OverlapInfo;
 import edu.umd.marbl.mhap.sketch.MinHashBitSketch;
+import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException;
 import edu.umd.marbl.mhap.utils.RandomSequenceGenerator;
 
 public class AlignmentTry
 {
 
-	public static void main(String[] args)
+	public static void main(String[] args) throws ZeroNGramsFoundException
 	{
 		String a = "bcdefghij1234567890";
 		String b = "abcdefghij1234567890";
diff --git a/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java b/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
index 9ad26b7..012276d 100755
--- a/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
@@ -713,7 +713,7 @@ public class EstimateROC {
 				break;
 			case 'M':
 				for (int i = 0; i < cLen; i++) {
-					if (ref.charAt(refPos) != qry.charAt(qryPos)) {
+					if (ref.toUpperCase().charAt(refPos) != qry.toUpperCase().charAt(qryPos)) {
 						errors++;
 					} else {
 						// do nothing
diff --git a/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java b/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
index 6d11188..da37e6d 100644
--- a/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
@@ -41,9 +41,10 @@ import java.io.BufferedReader;
 import java.io.PrintStream;
 
 import edu.umd.marbl.mhap.impl.FastaData;
-import edu.umd.marbl.mhap.sketch.BottomHash;
+import edu.umd.marbl.mhap.sketch.BottomSketch;
 import edu.umd.marbl.mhap.sketch.MinHashSketch;
-import edu.umd.marbl.mhap.sketch.OrderedNGramHashes;
+import edu.umd.marbl.mhap.sketch.BottomOverlapSketch;
+import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException;
 import edu.umd.marbl.mhap.utils.Utils;
 
 public class KmerStatSimulator {
@@ -186,13 +187,13 @@ public class KmerStatSimulator {
 	}
 	
 	public double compareMinHash(String first, String second) {
-		BottomHash h1 = new BottomHash(first, this.kmer, 1256);
-		BottomHash h2 = new BottomHash(second, this.kmer, 1256);
+		BottomSketch h1 = new BottomSketch(first, this.kmer, 1256);
+		BottomSketch h2 = new BottomSketch(second, this.kmer, 1256);
 		
 		return h1.jaccard(h2);
 	}
 	
-	public double compareMinHash2(String first, String second) {
+	public double compareMinHash2(String first, String second) throws ZeroNGramsFoundException {
 		MinHashSketch h1 = new MinHashSketch(first, this.kmer, 1256, null, 1.0);
 		MinHashSketch h2 = new MinHashSketch(second, this.kmer, 1256, null, 1.0);
 		
@@ -459,7 +460,7 @@ public class KmerStatSimulator {
 			System.out.println(this.sharedMerCounts.get(i) + "\t"
 					+ this.sharedJaccard.get(i) + "\t"
 					+ this.sharedMinHash.get(i) + "\t"
-					+ OrderedNGramHashes.jaccardToIdentity(this.sharedMinHash.get(i), this.kmer) + "\t"
+					+ BottomOverlapSketch.jaccardToIdentity(this.sharedMinHash.get(i), this.kmer) + "\t"
 					+ this.randomMerCounts.get(i) + "\t"
 					+ this.randomJaccard.get(i) + "\t"
 					+ this.randomMinHash.get(i));
diff --git a/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java b/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
index 43e2d08..1391b54 100644
--- a/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
@@ -63,10 +63,9 @@ public final class MhapMain
 	private final String toFile;
 	private final double repeatWeight;
 
-	//private static final double DEFAULT_OVERLAP_ACCEPT_SCORE = 0.024;
 	private static final double DEFAULT_OVERLAP_ACCEPT_SCORE = 0.78;
 
-	private static final double DEFAULT_REPEAT_WEIGHT= 0.0;
+	private static final double DEFAULT_REPEAT_WEIGHT= 0.9;
 
 	private static final double DEFAULT_FILTER_CUTOFF = 1.0e-5;
 
@@ -117,7 +116,7 @@ public final class MhapMain
 		options.addOption("--no-self", "Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files.", false);
 		options.addOption("--store-full-id", "Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format.", false);
 		options.addOption("--supress-noise", "[int] 0) Does nothing, 1) completely removes any k-mers not specified in the filter file, 2) supresses k-mers not specified in the filter file, similar to repeats. ", 0);
-		options.addOption("--no-tf", "Do not perform the tf weighing, of the tf-idf weighing.", false);
+		options.addOption("--no-tf", "Do not perform the tf weighing, in the tf-idf weighing.", false);
 		options.addOption("--settings", "Set all unset parameters for the default settings. Same defaults are applied to Nanopore and Pacbio reads. 0) None, 1) Default, 2) Fast, 3) Sensitive.", 0);
 		
 		if (!options.process(args))
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java b/src/main/java/edu/umd/marbl/mhap/sketch/BottomOverlapSketch.java
similarity index 95%
rename from src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java
rename to src/main/java/edu/umd/marbl/mhap/sketch/BottomOverlapSketch.java
index de35494..4b3380e 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/BottomOverlapSketch.java
@@ -41,7 +41,7 @@ import java.util.Arrays;
 import edu.umd.marbl.mhap.impl.OverlapInfo;
 import edu.umd.marbl.mhap.utils.Utils;
 
-public final class OrderedNGramHashes
+public final class BottomOverlapSketch
 {
 	private final static class EdgeData
 	{
@@ -74,7 +74,7 @@ public final class OrderedNGramHashes
 		private final int seqLength1;
 		private final int seqLength2;
 
-		public MatchData(OrderedNGramHashes o1, OrderedNGramHashes o2, double maxShiftPercent)
+		public MatchData(BottomOverlapSketch o1, BottomOverlapSketch o2, double maxShiftPercent)
 		{
 			this.seqLength1 = o1.getSequenceLength();
 			this.seqLength2 = o2.getSequenceLength();
@@ -342,7 +342,7 @@ public final class OrderedNGramHashes
 		return score;
 	}
 
-	public final static OrderedNGramHashes fromByteStream(DataInputStream input) throws IOException
+	public final static BottomOverlapSketch fromByteStream(DataInputStream input) throws IOException
 	{
 		try
 		{
@@ -358,7 +358,7 @@ public final class OrderedNGramHashes
 				orderedHashes[iter][1] = input.readInt();
 			}
 
-			return new OrderedNGramHashes(seqLength, kmerSize, orderedHashes);
+			return new BottomOverlapSketch(seqLength, kmerSize, orderedHashes);
 
 		}
 		catch (EOFException e)
@@ -442,20 +442,20 @@ public final class OrderedNGramHashes
 		}
 	}
 
-	private OrderedNGramHashes(int seqLength, int kmerSize, int[][] orderedHashes)
+	private BottomOverlapSketch(int seqLength, int kmerSize, int[][] orderedHashes)
 	{
 		this.seqLength = seqLength;
 		this.orderedHashes = orderedHashes;
 		this.kmerSize = kmerSize;
 	}
 
-	public OrderedNGramHashes(String seq, int kmerSize, int sketchSize)
+	public BottomOverlapSketch(String seq, int kmerSize, int sketchSize) throws ZeroNGramsFoundException
 	{
 		this.kmerSize = kmerSize;
 		this.seqLength = seq.length() - kmerSize + 1;
 		
 		if (this.seqLength<=0)
-			throw new SketchRuntimeException("Sequence length must be greater or equal to n-gram size.");
+			throw new ZeroNGramsFoundException("Sequence length must be greater or equal to n-gram size "+kmerSize+".", seq);
 		
 		// compute just direct hash of sequence
 		int[] hashes = HashUtils.computeSequenceHashes(seq, kmerSize);
@@ -516,7 +516,7 @@ public final class OrderedNGramHashes
 		return this.orderedHashes[index][0];
 	}
 	
-	public OverlapInfo getOverlapInfo(OrderedNGramHashes toSequence, double maxShiftPercent)
+	public OverlapInfo getOverlapInfo(BottomOverlapSketch toSequence, double maxShiftPercent)
 	{
 		if (this.kmerSize!=toSequence.kmerSize)
 			throw new SketchRuntimeException("Sketch k-mer size does not match between the two sequences.");
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/BottomHash.java b/src/main/java/edu/umd/marbl/mhap/sketch/BottomSketch.java
similarity index 85%
rename from src/main/java/edu/umd/marbl/mhap/sketch/BottomHash.java
rename to src/main/java/edu/umd/marbl/mhap/sketch/BottomSketch.java
index a25434a..eaf7a17 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/BottomHash.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/BottomSketch.java
@@ -2,7 +2,7 @@ package edu.umd.marbl.mhap.sketch;
 
 import it.unimi.dsi.fastutil.ints.IntArrays;
 
-public class BottomHash implements Sketch<BottomHash>
+public class BottomSketch implements Sketch<BottomSketch>
 {
 	private final int[] hashPositions;
 	
@@ -11,7 +11,7 @@ public class BottomHash implements Sketch<BottomHash>
 	 */
 	private static final long serialVersionUID = 9035607728472270206L;
 
-	public BottomHash(String str, int nGramSize, int k)
+	public BottomSketch(String str, int nGramSize, int k)
 	{
 		int[] hashes = HashUtils.computeSequenceHashes(str, nGramSize);
 		
@@ -34,7 +34,7 @@ public class BottomHash implements Sketch<BottomHash>
 
 	}
 	
-	public double jaccard(BottomHash sh)
+	public double jaccard(BottomSketch sh)
 	{
 		//make sure you look at same number
 		int k = Math.min(this.hashPositions.length, sh.hashPositions.length);
@@ -64,7 +64,7 @@ public class BottomHash implements Sketch<BottomHash>
 	}
 
 	@Override
-	public double similarity(BottomHash sh)
+	public double similarity(BottomSketch sh)
 	{
 		return jaccard(sh);
 	}
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java b/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
index 4ba06f4..115b9b3 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
@@ -63,6 +63,13 @@ public final class FrequencyCounts
 	
 	public FrequencyCounts(BufferedReader bf, double filterCutoff, double offset, int removeUnique, boolean noTf, int numThreads) throws IOException
 	{
+		//removeUnique = 0: do nothing extra to k-mers not specified in the file
+		//removeUnique = 1: remove k-mers not specified in the file from the sketch
+		//removeUnique = 2: supress k-mers not specified in the file the same as max supression
+		
+		if (removeUnique<0 || removeUnique>2)
+			throw new MhapRuntimeException("Unknown removeUnique option "+removeUnique+".");
+		
 		if (offset<0.0 || offset>=1.0)
 			throw new MhapRuntimeException("Offset can only be between 0 and 1.0.");
 
@@ -72,11 +79,12 @@ public final class FrequencyCounts
 		
 		// generate hashset
 		Long2DoubleOpenHashMap validMap = new Long2DoubleOpenHashMap();
-		BloomFilter<Long> validMers = null;
+		BloomFilter<Long> validMers;
 
 		//the max value observed in the list
 		AtomicReference<Double> maxValue = new AtomicReference<Double>(Double.NEGATIVE_INFINITY);
 
+		//read in the first line to generate the bloom filter
 		String line = bf.readLine();
 		try
 		{
@@ -100,8 +108,11 @@ public final class FrequencyCounts
 				}
 			}
 			
-			if (removeUnique>1)
+			//if no nothing, no need to store the while list
+			if (removeUnique>0)
 				validMers = BloomFilter.create((value, sink) -> sink.putLong(value), size, 1.0e-5);
+			else
+				validMers = null;
 		}
 		catch (Exception e)
 		{
@@ -111,8 +122,6 @@ public final class FrequencyCounts
 		final ThreadPoolExecutor executor = new ThreadPoolExecutor(numThreads, numThreads, 100L, TimeUnit.MILLISECONDS,
 				new LinkedBlockingQueue<Runnable>(10000), new ThreadPoolExecutor.CallerRunsPolicy());
 		
-		BloomFilter<Long> currValidMers = validMers;
-
 		line = bf.readLine();			
 		while (line != null)
 		{
@@ -140,7 +149,7 @@ public final class FrequencyCounts
 						double percent = Double.parseDouble(str[1]);
 						
 						// if greater, add to hashset
-						if (percent > filterCutoff)
+						if (percent >= filterCutoff)
 						{
 							maxValue.getAndUpdate(v -> Math.max(v, percent));
 							
@@ -154,9 +163,9 @@ public final class FrequencyCounts
 		
 					//store in the bloom filter
 					if (removeUnique>0)
-						synchronized (currValidMers)
+						synchronized (validMers)
 						{
-							currValidMers.put(hash[0]);							
+							validMers.put(hash[0]);							
 						}
 				}
 				catch (Exception e)
@@ -216,7 +225,7 @@ public final class FrequencyCounts
 	
 	public double idf(double freq)
 	{
-		return Math.log(this.maxValue/freq-offset);
+		return Math.log(this.maxValue/freq-this.offset);
 		//return Math.log1p(this.maxValue/freq);
 	}
 	
@@ -238,13 +247,10 @@ public final class FrequencyCounts
 
 	public boolean keepKmer(long hash)
 	{
-		if (this.removeUnique==0 || this.removeUnique==2)
-			return true;
-		
-		if (this.validMers==null)
-			return false;
+		if (this.removeUnique==1)
+			return this.validMers.mightContain(hash);
 			
-		return this.validMers.mightContain(hash);
+		return true;
 	}
 	
 	public double maxIdf()
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashBitSketch.java b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashBitSketch.java
index de0f91a..d16a840 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashBitSketch.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashBitSketch.java
@@ -75,7 +75,7 @@ public final class MinHashBitSketch extends AbstractBitSketch<MinHashBitSketch>
 		super(getAsBits(minHashes));
 	}
 	
-	public MinHashBitSketch(String seq, int nGramSize, int numWords)
+	public MinHashBitSketch(String seq, int nGramSize, int numWords) throws ZeroNGramsFoundException
 	{
 		super(getAsBits(new MinHashSketch(seq, nGramSize, numWords*64).getMinHashArray()));
 	}
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
index 12d3d05..a75f657 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
@@ -49,19 +49,21 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 	private static final long serialVersionUID = 8846482698636860862L;
 	
 	private final static int[] computeNgramMinHashesWeighted(String seq, final int nGramSize, final int numHashes,
-			FrequencyCounts kmerFilter, double repeatWeight)
+			FrequencyCounts kmerFilter, double repeatWeight) throws ZeroNGramsFoundException
 	{
 		final int numberNGrams = seq.length() - nGramSize + 1;
 	
 		if (numberNGrams < 1)
-			throw new SketchRuntimeException("N-gram size bigger than string length.");
+			throw new ZeroNGramsFoundException("N-gram size bigger than string length.", seq);
 	
+		//if (repeatWeight>=1.0)
+		//	throw new SketchRuntimeException("repeatWeight cannot be >=1.");
+
 		// get the kmer hashes
 		final long[] kmerHashes = HashUtils.computeSequenceHashesLong(seq, nGramSize, 0);
 		
 		//now compute the counts of occurance
 		Long2ObjectLinkedOpenHashMap<HitCounter> hitMap = new Long2ObjectLinkedOpenHashMap<HitCounter>(kmerHashes.length);
-		int maxCount = 0;
 		for (long kmer : kmerHashes)
 		{
 			//do not add unique kmers to the sketch
@@ -76,14 +78,11 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 			}
 			else
 				counter.addHit();
-
-			if (maxCount<counter.count)
-				maxCount = counter.count;
 		}
 		
-		//make sure don't create a non-zero value
+		//make sure don't create a zero value
 		if (hitMap.isEmpty())
-			hitMap.put(kmerHashes[0], new HitCounter(1));
+			throw new ZeroNGramsFoundException("Found zero unfiltered n-grams in the string.", seq);
 	
 		//allocate the space
 		int[] hashes = new int[Math.max(1,numHashes)];		
@@ -91,11 +90,14 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 		Arrays.fill(best, Long.MAX_VALUE);
 
 		//go through all the k-mers and find the min values
+		int numberValid = 0;
+		
 		for (Entry<Long, HitCounter> kmer : hitMap.entrySet())
 		{
 			long key = kmer.getKey();
 			int weight = kmer.getValue().count;
 			
+			//original version of MHAP
 			if (repeatWeight<0.0)
 			{
 				weight = 1;
@@ -109,27 +111,25 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 				if (repeatWeight>=0.0 && repeatWeight<1.0)
 				{
 					//compute the td part
-					double td = (double)kmerFilter.tfWeight(weight);
+					double tf = (double)kmerFilter.tfWeight(weight);
 					
 					//compute the idf part, 1-3
 					double idf = kmerFilter.scaledIdf(key);
 					
 					//compute td-idf
-					weight = (int)Math.round(td*idf);
+					weight = (int)Math.round(tf*idf);
 					if (weight<1)
 						weight = 1;
 				}
-				else
-				if (repeatWeight>=1.0)
-				{
-					if (kmerFilter.isPopular(key))
-						weight = 0;
-				}
 			}
+			//keep the tf weight otherwise			
 						
 			if (weight<=0)
 				continue;
-		
+			
+			//increment valid counter
+			numberValid++;
+			
 			//set the initial shift value
 			long x = key;
 			for (int word = 0; word < numHashes; word++)
@@ -153,6 +153,9 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 			}
 		}
 		
+		if (numberValid<=0)
+			throw new ZeroNGramsFoundException("Found zero unfiltered n-grams in the string.", seq);
+
 		//now combine into super shingles
 		/*
 		HashFunction hf = Hashing.murmur3_32(0);
@@ -202,12 +205,12 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
 		this.minHashes = minHashes;
 	}
 	
-	public MinHashSketch(String str, int nGramSize, int numHashes)
+	public MinHashSketch(String str, int nGramSize, int numHashes) throws ZeroNGramsFoundException
 	{
 		this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(str, nGramSize, numHashes, null, -1.0);
 	}
 	
-	public MinHashSketch(String seq, int nGramSize, int numHashes, FrequencyCounts freqFilter, double repeatWeight)
+	public MinHashSketch(String seq, int nGramSize, int numHashes, FrequencyCounts freqFilter, double repeatWeight) throws ZeroNGramsFoundException
 	{
 		this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(seq, nGramSize, numHashes, freqFilter, repeatWeight);
 	}
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/ZeroNGramsFoundException.java b/src/main/java/edu/umd/marbl/mhap/sketch/ZeroNGramsFoundException.java
new file mode 100644
index 0000000..cb3cac3
--- /dev/null
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/ZeroNGramsFoundException.java
@@ -0,0 +1,43 @@
+package edu.umd.marbl.mhap.sketch;
+
+public class ZeroNGramsFoundException extends Exception
+{
+
+	private final String seqString;
+	
+	/**
+	 * 
+	 */
+	private static final long serialVersionUID = -3655558540692106680L;
+
+	public ZeroNGramsFoundException(String message, String seqString)
+	{
+		super(message);
+		this.seqString = seqString;
+	}
+
+	public ZeroNGramsFoundException(String message, Throwable cause, boolean enableSuppression,
+			boolean writableStackTrace, String seqString)
+	{
+		super(message, cause, enableSuppression, writableStackTrace);
+		this.seqString = seqString;
+	}
+
+	public ZeroNGramsFoundException(String message, Throwable cause, String seqString)
+	{
+		super(message, cause);
+		this.seqString = seqString;
+	}
+
+	public ZeroNGramsFoundException(Throwable cause, String seqString)
+	{
+		super(cause);
+		this.seqString = seqString;
+	}
+	
+	public String getSequenceString()
+	{
+		return this.seqString;
+	}
+
+}

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/mhap.git



More information about the debian-med-commit mailing list