[med-svn] [Git][med-team/proteinortho][upstream] New upstream version 5.16+dfsg

Andreas Tille gitlab at salsa.debian.org
Wed Feb 21 07:45:29 UTC 2018


Andreas Tille pushed to branch upstream at Debian Med / proteinortho


Commits:
6ecf7dce by Andreas Tille at 2018-02-21T08:33:57+01:00
New upstream version 5.16+dfsg
- - - - -


12 changed files:

- Makefile
- manual.html
- proteinortho5.pl
- proteinortho5_clean_edges2.pl
- proteinortho5_clustering.cpp
- test/C.faa
- test/E.faa
- test/L.faa
- + tools/check_graph_duplicate_ids.pl
- + tools/extract_from_graph.pl
- + tools/graph_conv.pl
- + tools/remove_graph_duplicates.pl


Changes:

=====================================
Makefile
=====================================
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@
 INSTALLDIR=/usr/local/bin
 
 CPP = g++
-CPPFLAGS += -Wall -O3 -Wno-unused-result
+CPPFLAGS += -Wall -O2 -std=gnu++11 -Wno-unused-result
 
 all: proteinortho5_clustering proteinortho5_tree
 


=====================================
manual.html
=====================================
--- a/manual.html
+++ b/manual.html
@@ -4,7 +4,7 @@
 	</head>
 <body>
 <h1>Proteinortho Manual / PoFF Manual</h1>
-<small>This manual corresponds to version 5.15</small>
+<small>This manual corresponds to version 5.16</small>
 <h2>Introduction</h2>
 Proteinortho is a tool to detect orthologous genes within different species.
 For doing so, it compares similarities of given gene sequences and clusters them to find significant groups.
@@ -154,17 +154,27 @@ They are:
 <h2>Hints</h2>
 <ul>
 <li>Using <i>.faa</i> to indicate that your file contains amino acids and <i>.fna</i> to show it contains nucleotides makes life much easier.
-<li>Each sequence ID in your FASTA files must be unique. Consider renaming the otherwise.
+<li>Sequence IDs must be unique within a single FASTA file. Consider renaming otherwise. <br>Note: Till version 5.15 sequences IDs had to be unique among the whole dataset. Proteinortho now keeps track of name and species to avoid the necessissity of renaming.
 <li>You need write permissions in the directory of your FASTA files as Proteinortho will create blast databases. If this is not the case, consider using symbolic links to the FASTA files.
-<li>The directory <i>tools</i> cotains useful tools, e.g. <code>grab_proteins.pl</code> which fetches protein sequences of orthologous groups from Proteinortho output table
+<li>The directory <i>tools</i> contains useful tools, e.g. <code>grab_proteins.pl</code> which fetches protein sequences of orthologous groups from Proteinortho output table
 </ul>
 
 <h2>Usage</h2>
 <code>proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTAn...]</code> <p>
 
-<table border="1" cellspacing="0" cellpadding="2">
+<table border="0" cellspacing="2" cellpadding="2">
 <tr><td><b>Option</b></td><td><b>Default value</b></td><td><b>Description</b></td></tr>
+<tr><td colspan=3><hr><b>[General options]</b><hr></td></tr>
 
+<tr><td>-project=</td><td>myproject</td><td> prefix for all result file names</td></tr>
+<tr><td>-cpus=</td><td>auto</td><td>use the given number of threads</td></tr>
+<tr><td>-verbose</td><td> </td><td>give a lot of information about the current progress</td></tr>
+<tr><td>-keep</td><td>-</td><td>store temporary blast results for reuse (advisable for larger jobs)</td></tr>
+<tr><td>-temp=</td><td>working directory</td><td> path for temporary files</td></tr>
+<tr><td>-force</td><td> </td><td>force recalculation of blast results in any case</td></tr>
+<tr><td>-clean</td><td> </td><td>remove all unnecessary files after processing</td></tr>
+
+<tr><td colspan=3><hr><b>[Search options]</b><hr></td></tr>
 <tr><td>-e=</td><td>1e-05</td><td>E-value for blast</td></tr>
 <tr><td>-p=</td><td>blastp+</td><td>blast program to use:
 <ul>
@@ -179,29 +189,34 @@ in case you only have access to blast legacy:
 <li>blastp → sequences are given as amino acids
 <li>tblastx → sequences are given as nucleotides and will be interpreted as (translated) amino acids in all three reading frames
 </ul>
-
 </td></tr>
-
-<tr><td>-project=</td><td>myproject</td><td> prefix for all result file names</td></tr>
-<tr><td>-temp=</td><td>working directory</td><td> path for temporary files</td></tr>
-<tr><td>-synteny</td><td>-</td><td>activate PoFF extension to separate similar sequences using synteny (requires a GFF file for each FASTA file)</td></tr>
-<tr><td>-nograph</td><td>-</td><td>do not generate .graph files with pairwise orthology data<br>saves some time</td></tr>
-<tr><td>-keep</td><td>-</td><td>store temporary blast results for reuse</td></tr>
-<tr><td>-clean</td><td>-</td><td>remove all unnecessary files after processing</td></tr>
-<tr><td>-desc</td><td>-</td><td>write gene description file (XXX.descriptions); works only with NCBI-formated FASTA entries currently</td></tr>
-<tr><td>-force</td><td>-</td><td>force recalculation of blast results in any case</td></tr>
-<tr><td>-cpus=</td><td>auto</td><td>use the given number of threads</td></tr>
-<tr><td>-selfblast</td><td>-</td><td>apply selfblast to directly paralogs; normally these are inferred indirectly from orthology data to other species (experimental!)</td></tr>
-<tr><td>-singles</td><td>-</td><td>Also report genes without orthologs in table output</td></tr>
+<tr><td>-selfblast</td><td> </td><td>apply selfblast to directly paralogs; normally these are inferred indirectly from orthology data to other species (experimental!)</td></tr>
+<tr><td>-sim=</td><td>0.95</td><td>min. similarity for additional hits</td></tr>
 <tr><td>-identity=</td><td>25</td><td>min. percent identity of best blast alignments</td></tr>
 <tr><td>-cov=</td><td>50</td><td>min. coverage of best blast alignments in percent</td></tr>
+<tr><td>-subpara=</td><td> </td><td>additional parameters for blast; set these in quotes (e.g. -subpara='-seg no') <br>This parameter was named -blastParameters in earlier versions</td></tr>
+
+
+<tr><td colspan=3><hr><b>[Synteny options]</b><hr></td></tr>
+<tr><td>-synteny</td><td>-</td><td>activate PoFF extension to separate similar sequences using synteny (requires a GFF file for each FASTA file)</td></tr>
+<tr><td>-dups=</td><td>0</td><td> applied in combination with -synteny; number of reiterations for adjacencies heuristic to determine duplicated regions; <br>
+if set to a higher number, co-orthologs will tend to get clustered together rather than getting separated
+</td></tr>
+<tr><td>-cs=</td><td>3</td><td> applied in combination with -synteny; size of a maximum common substring (MCS) for adjacency matches; <br> the longer this value becomes the longer syntenic regions need to be in order to be detected </td></tr>
+<tr><td>alpha=</td><td>0.5</td><td>weight of adjacencies vs. sequence similarity</td></tr>
+
+<tr><td colspan=3><hr><b>[Clustering options]</b><hr></td></tr>
+<tr><td>-singles</td><td> </td><td>also report genes without orthologs in table output</td></tr>
+<tr><td>-purity=</td><td>1</td><td>avoid spurious graph assignments [range: 0.01-1, default 0.75]</td></tr>
 <tr><td>-conn=</td><td>0.1</td><td>min. algebraic connectivity of orthologous groups during clustering</td></tr>
-<tr><td>-sim=</td><td>0.95</td><td>min. similarity for additional hits</td></tr>
-<tr><td>-blastpath=</td><td>-</td><td>path to your local blast installation (if not in present in default paths)</td></tr>
-<tr><td>-blastParameters=</td><td>-</td><td>additional parameters for blast; set these in quotes (e.g. -blastParameters='-seg no')</td></tr>
-<tr><td>-verbose</td><td>-</td><td>give a lot of information about the current progress</td></tr>
-<tr><td>-debug</td><td>-</td><td>gives detailed information for bug tracking</td></tr>
+<tr><td>-nograph</td><td>-</td><td>do not generate .graph files with pairwise orthology data<br>saves some time</td></tr>
 
+<tr><td colspan=3><hr><b>[Misc options]</b><hr></td></tr>
+<tr><td>-desc</td><td> </td><td>write gene description file (XXX.descriptions); works only with NCBI-formated FASTA entries currently</td></tr>
+<tr><td>-blastpath=</td><td> </td><td>path to your local blast installation (if not in present in default paths)</td></tr>
+<tr><td>-debug</td><td> </td><td>gives detailed information for bug tracking</td></tr>
+
+<tr><td colspan=3><hr><b>[Large compute jobs]</b><hr></td></tr>
 <tr><td colspan=3><b>Parameters needed to distribute the runs over several machines</b></td></tr>
 <tr><td>-step=</td><td>0</td><td>
 perform only specific steps of the analysis
@@ -211,31 +226,32 @@ perform only specific steps of the analysis
 <li>3 → perform clustering
 <li>0 → perform all steps
 </ul></td></tr>
-<tr><td>startat=</td><td>0</td><td>file number to start with</td></tr>
-<tr><td>stopat=</td><td>-1</td><td>file number to end with (very last one = -1)</td></tr>
-
-<tr><td colspan=3><b>Parameters applied in combination with -synteny<b></td></tr>
-<tr><td>-dups=</td><td>0</td><td> applied in combination with -synteny; number of reiterations for adjacencies heuristic to determine duplicated regions; <br>
-if set to a higher number, co-orthologs will tend to get clustered together rather than getting separated
-</td></tr>
-<tr><td>-cs=</td><td>3</td><td> applied in combination with -synteny; size of a maximum common substring (MCS) for adjacency matches; <br> the longer this value becomes the longer syntenic regions need to be in order to be detected </td></tr>
-<tr><td>alpha=</td><td>0.5</td><td>weight of adjacencies vs. sequence similarity</td></tr>
+<tr><td>jobs=N/M</td><td> </td><td>distribute blast step into M subsets and run job number N out of M in this very process, only works in combination with -step=2</td></tr>
 </table>
 
 <h3>Using several machines</h3>
-If you want to involve multiple machines or separate a Proteinortho run into smaller chunks, use the <i>-steps=</i> option.
+If you want to involve multiple machines or separate a Proteinortho run into smaller chunks, use the <i>-jobs=M/N</i> option.
 First, run <code>proteinortho5.pl -steps=1 ...</code> to generate the indices.
-Then you can run <code>proteinortho5.pl -steps=2 -startat=X -stopat=Y ...</code> to run small chunks separately.
-Instead of X and Y numbers must be set representing the file number to be calculated. X and Y can be equal.
-E.g. you have four fasta files, the you can to run <br>
+Then you can run <code>proteinortho5.pl -steps=2 -jobs=M/N ...</code> to run small chunks separately.
+Instead of M and N numbers must be set representing the number of jobs you want to divide the run into (M) and the job division to be performed by the process.
+E.g. to divide a Proteinortho run into 4 jobs to run on several machines, use <br>
+
+<code>
+proteinortho5.pl -steps=1 ... <br>
+</code>
+
+on a single PC, then <br>
+
 <code>
-proteinortho5.pl -steps=2 -startat=0 -stopat=0 ... <br>
-proteinortho5.pl -steps=2 -startat=1 -stopat=1 ... <br>
-proteinortho5.pl -steps=2 -startat=2 -stopat=2 ... <br>
-proteinortho5.pl -steps=2 -startat=3 -stopat=3 ... <br>
+proteinortho5.pl -steps=2 -jobs=1/4 ... <br>
+proteinortho5.pl -steps=2 -jobs=2/4 ... <br>
+proteinortho5.pl -steps=2 -jobs=3/4 ... <br>
+proteinortho5.pl -steps=2 -jobs=4/4 ... <br>
 </code>
-seperateltly on different machines (within the same shared working directory).
-After all step 2 runs are done, <code>proteinortho5.pl -steps=3 ...</code> can be executed to perform the clustering and merge all calculations.
+separately on different machines (can be run in parallel or iteratively within the same shared working directory).
+After all step 2 runs are done, run 
+<br><code>proteinortho5.pl -steps=3 ...</code>
+<br> to perform the clustering and merge all calculations on a single PC.
 </body>
 
 </html>


=====================================
proteinortho5.pl
=====================================
--- a/proteinortho5.pl
+++ b/proteinortho5.pl
@@ -30,7 +30,7 @@
 # @authors Marcus Lechner, Clemens Elias Thoelken
 # @email lechner at staff.uni-marburg.de
 # @company University of Maruburg
-# @date 2016-08-26
+# @date 2017-04-18
 #
 ##########################################################################################
 
@@ -47,7 +47,7 @@ use Thread::Queue;
 ##########################################################################################
 # Variables
 ##########################################################################################
-our $version = "5.15";
+our $version = "5.16";
 our $step = 0;		# 0/1/2/3	-> do all / only apply step 1 / only apply step 2 / only apply step 3
 our $verbose = 1;	# 0/1		-> don't / be verbose
 our $debug = 0;		# 0/1		-> don't / show debug data
@@ -60,14 +60,15 @@ our $alpha = 0.5;	# Alpha value for ffadj_mcs.py
 our $connectivity = 0.1;# Algebr. connectivity threshold
 our $cpus   = 0;	# 0 = autodetect
 our $evalue = "1e-05";
+our $purity = 0.75;
 our $coverage = 50;	# Percent coverage threshold for two proteins
 our $identity = 25;	# Percent identity threshold for two proteins
 our $blastmode = "blastp+";
 #our $tmpdir = "./";	# Dir for tmp-files
 our $sim = 0.95;
 our $report = 3;
-our $startat = 0;
-our $stopat = -1;
+our $startat = undef;	# removed 5.16
+our $stopat = undef;	# removed 5.16
 our $keep = 0;
 our $force = 0;
 our $selfblast = 0;
@@ -98,6 +99,12 @@ our $run_id = "";
 our %gene_counter;		# Holds the number of genes for each data file (for sorting)
 our $threads_per_process :shared = 1;	# Number of subthreads for blast
 
+# Split work
+our $split_to_X_jobs = -1;
+our $jobnumber = -1;
+our $part = -1;
+
+
 ##########################################################################################
 # Parameters
 ##########################################################################################
@@ -117,6 +124,7 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?cpus=(\d*)$/)    		{ $cpus = $1; }
   elsif ($option =~ m/^--?cpus=auto$/)	     		{ $cpus = 0; }
   elsif ($option =~ m/^--?alpha=([0-9\.]+)$/)    	{ $alpha = $1; }
+  elsif ($option =~ m/^--?purity=([0-9\.]+)$/)    	{ $purity = $1; }
   elsif ($option =~ m/^--?report=([0-9]+)$/)   		{ $report = $1; }  
   elsif ($option =~ m/^--?conn=([0-9\.]+)$/)   		{ $connectivity = $1; }
   elsif ($option =~ m/^--?cov=([0-9]+)$/)    		{ $coverage = $1; }
@@ -126,6 +134,7 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?sim=([0-9\.]+)$/)    		{ $sim = $1; }
   elsif ($option =~ m/^--?startat=([0-9]+)$/)    	{ $startat = $1; }
   elsif ($option =~ m/^--?stopat=([0-9]+)$/)    	{ $stopat = $1; }
+  elsif ($option =~ m/^--?jobs?=([\d]+)\/([\d]+)$/) 	{ $jobnumber = $1; $split_to_X_jobs = $2; }  
   elsif ($option =~ m/^--?selfblast$/) 			{ $selfblast = 1; }  
   elsif ($option =~ m/^--?selfblast=(0|1)$/) 		{ $selfblast = $1; }  
   elsif ($option =~ m/^--?singles$/) 			{ $singles = 1; }  
@@ -145,7 +154,7 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?graph/)			{ $nograph = 0; }  
   elsif ($option =~ m/^--?desc/)			{ $desc = 1; }  
   elsif ($option =~ m/^--?project=(.*)$/)		{ $project = $1; }
-  elsif ($option =~ m/^--?blastParameters=(.*)$/i)	{ $blastOptions = $1;}
+  elsif ($option =~ m/^--?subpara=(.*)$/i)	{ $blastOptions = $1;}
   elsif ($option !~ /^-/)				{ push(@files,$option); }
   else  {&print_usage(); die "Invalid command line option: \'$option\'!\n"; }
 }
@@ -153,13 +162,23 @@ foreach my $option (@ARGV) {
 ##########################################################################################
 # Check parameters
 ##########################################################################################
-if ($startat != 0 || $stopat != -1) {
-	if ($step != 2) {
-		&Error("Parameters -startat and -stopat only work for step 2!");
-	}
-	$run_id = "_$startat-$stopat";
+if (defined($startat) || defined($stopat)) {
+	&Error("Sorry, -startat and -stopat were removed. Please use -jobs=M/N for more flexible job splitting.");	
+}
+
+if ($split_to_X_jobs == 0) {
+	&Error("Job parameter use incorrectly. Number of jobs cannot be 0. Valid format: M/N");
+}
+
+if ($jobnumber != -1 && ($jobnumber == 0 || $jobnumber > $split_to_X_jobs)) {
+	&Error("Job parameter use incorrectly. Job number cannot be 0 or greater than number of jobs. Valid format: M/N");
+}
+if ($jobnumber != -1) {
+	if ($step != 2) {&Error("Parameter -jobs only works for step 2!");}
+	$run_id = "_$jobnumber"."_".$split_to_X_jobs;
 }
 
+
 our $simgraph = "$project.blast-graph$run_id";		# Output file graph
 our $syngraph = "$project.ffadj-graph$run_id";		# Output file synteny
 
@@ -175,7 +194,7 @@ our $desctable = "$project.descriptions";			# Output file seq descriptions
 
 if (-e "$project.proteinortho") {
 	print STDERR "!!!\nWarning: Data files for project '$project' already exists. Previous output files might be overwritten.\n";
-	print STDERR "Press CTRL + C to prevent me from proceeding)\nWaiting for 10 seconds...\n!!!\n";
+	print STDERR "Press CTRL + C to prevent me from proceeding\nWaiting for 10 seconds...\n!!!\n";
 	sleep(10);
 	print STDERR "\nWell then, proceeding...\n\n";
 }
@@ -190,7 +209,7 @@ if (-e "$project.proteinortho") {
 # Always do
 &check_files;		# Check files, count sequences
 @files = ();
-foreach my $file (sort { $gene_counter{$b} <=> $gene_counter{$a} } keys %gene_counter) {push(@files,$file);}	# Biggest first
+foreach my $file (sort { if ($gene_counter{$a} == $gene_counter{$b}) {$a cmp $b;} else {$gene_counter{$b} <=> $gene_counter{$a};} } keys %gene_counter) {push(@files,$file);}	# Biggest first # Alphabet otherwise 5.16
 
 
 # Step 1, check files and make indices
@@ -246,36 +265,36 @@ sub clean {
 
 sub cluster {
 	print STDERR "Clustering by similarity (Proteinortho mode)\n";
-	system("$po_path/proteinortho5_clustering -verbose $verbose -conn $connectivity -rmgraph '$rm_simgraph' $simgraph* >'$simtable'");
+	system("$po_path/proteinortho5_clustering -verbose $verbose -conn $connectivity -purity $purity -rmgraph '$rm_simgraph' $simgraph* >'$simtable'");
 	if ($singles) {
 		print STDERR "Adding singles...\n";
 		my $fastas = "'".join("' '", at files)."'";
 		system("$po_path/proteinortho5_singletons.pl $fastas <'$simtable' >>'$simtable'");
 	}
-        print STDERR "-> Output written to $simtable\n";
+        print STDERR "[OUTPUT] -> written to $simtable\n";
 	unless ($nograph) {
 		print STDERR "Writing graph...\n";
 #		system("$po_path/proteinortho5_clean_edges -e '$rm_simgraph' $simgraph* >'$csimgraph'");
 		system("$po_path/proteinortho5_clean_edges2.pl '$rm_simgraph' $simgraph* >'$csimgraph'");
 		unless ($keep) {unlink($rm_simgraph);}
-		print STDERR "-> Output written to $csimgraph\n";
+		print STDERR "[OUTPUT] -> written to $csimgraph\n";
 	}
 
 	if ($synteny) {
 		print STDERR "\nClustering by gene-order (POFF mode)\n";
-		system("$po_path/proteinortho5_clustering -verbose $verbose -conn $connectivity -rmgraph '$rm_syngraph' $syngraph* >'$syntable'");
+		system("$po_path/proteinortho5_clustering -verbose $verbose -conn $connectivity -purity $purity -rmgraph '$rm_syngraph' $syngraph* >'$syntable'");
 		if ($singles) {
 		print STDERR "Adding singles...\n";
 			my $fastas = "'".join("' '", at files)."'";
 			system("$po_path/proteinortho5_singletons.pl $fastas <'$syntable' >>'$syntable'");
 		}
-		print STDERR "-> Output written to $syntable\n";
+		print STDERR "[OUTPUT] -> written to $syntable\n";
 		unless ($nograph) {
 			print STDERR "Writing graph...\n";
 #			system("$po_path/proteinortho5_clean_edges -e '$rm_syngraph' $syngraph* >'$csyngraph'");
 			system("$po_path/proteinortho5_clean_edges2.pl '$rm_syngraph' $syngraph* >'$csyngraph'");
 			unless ($keep) {unlink($rm_syngraph);}
-			print STDERR "-> Output written to $csyngraph\n";
+			print STDERR "[OUTPUT] -> written to $csyngraph\n";
 		}
 	}
 }
@@ -290,11 +309,32 @@ sub print_usage {
 print STDERR << "JUS";
 
 Usage: proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTA...]
-Options: -e=          E-value for blast [default: 1e-05]
-         -p=          blast program [default: blastp+]
-                      {blastn|blastp|tblastx|blastn+|blastp+|tblastx+}
+Options:
+         [General options]
          -project=    prefix for all result file names [default: myproject]
+         -cpus=       number of processors to use [default: auto]
+         -verbose     keeps you informed about the progress
          -temp=       path for temporary files [default: working directory]
+         -keep        stores temporary blast results for reuse
+         -force       forces recalculation of blast results in any case
+         -clean       remove all unnecessary files after processing
+         -step=       1 -> generate indices
+                      2 -> run blast (and ff-adj, if -synteny is set)
+                      3 -> clustering
+                      0 -> all (default)
+
+         [Search options]
+         -p=          blast program [default: blastp+]
+         -e=          E-value for blast [default: 1e-05]
+                      {blastn|blastp|tblastx|blastn+|blastp+|tblastx+}
+         -selfblast   apply selfblast, detects paralogs without orthologs
+         -sim=        min. similarity for additional hits (0..1) [default: 0.95]
+         -identity=   min. percent identity of best blast hits [default: 25]
+         -cov=        min. coverage of best blast alignments in % [default: 50]
+         -subpara=    additional parameters for the search tool
+                      example -subpara='-seg no'
+
+         [Synteny options]
          -synteny     activate PoFF extension to separate similar sequences
                       by contextual adjacencies (requires .gff for each .fasta)
          -dups=       PoFF: number of reiterations for adjacencies heuristic,
@@ -303,33 +343,30 @@ Options: -e=          E-value for blast [default: 1e-05]
                       adjacency matches (default: 3)
          -alpha=      PoFF: weight of adjacencies vs. sequence similarity
                       (default: 0.5)
-         -desc        write description files (for NCBI FASTA input only)
-         -keep        stores temporary blast results for reuse
-         -force       forces recalculation of blast results in any case
-         -cpus=       number of processors to use [default: auto]
-         -selfblast   apply selfblast, detects paralogs without orthologs
+
+         [Clustering options]
          -singles     report singleton genes without any hit
-         -identity=   min. percent identity of best blast hits [default: 25]
-         -cov=        min. coverage of best blast alignments in % [default: 50]
+         -purity=     avoid spurious graph assignments [0.01-1, default 0.75]
          -conn=       min. algebraic connectivity [default: 0.1]
-         -sim=        min. similarity for additional hits (0..1) [default: 0.95]
-         -step=       1 -> generate indices
-                      2 -> run blast (and ff-adj, if -synteny is set)
-                      3 -> clustering
-                      0 -> all (default)
+         -nograph     do not generate .graph file (pairwise orthology relations)
+
+         [Misc options]
+         -desc        write description files (for NCBI FASTA input only)
          -blastpath=  path to your local blast (if not installed globally)
-         -verbose     keeps you informed about the progress
-         -clean       remove all unnecessary files after processing
-         -graph       generate .graph files (pairwise orthology relations)
          -debug       gives detailed information for bug tracking
 
-         More specific blast parameters can be defined by
-         -blastParameters='[parameters]' (e.g. -blastParameters='-seg no')
-
-         In case jobs should be distributed onto several machines, use
-         -startat=    File number to start with (default: 0)
-         -stopat=     File number to end with (default: -1)
-
+         [Large compute jobs]
+          In case jobs should be distributed onto several machines, use
+         -jobs=M/N    N defines the number of defined job groups (e.g. PCs)
+                      M defines the set of jobs to run in this process
+
+         Example:     run with 
+                        -step=1
+                      to prepare data then to split a run on two machines use
+                        -jobs=1/2 -step=2 on PC one and
+                        -jobs=2/2 -step=2 on PC two
+                      finally run with
+                        -step=3 to finalize
 JUS
 }
 
@@ -373,7 +410,23 @@ sub set_threads_per_process {
 sub run_blast {
 	# Jobs todo
 	$jobs_todo = 0;
-	for (my $k = $startat; $k < scalar(@files); $k++) {$jobs_todo += scalar(@files)-1-$k+$selfblast;if ($stopat > -1 && $stopat <= $k) {last;}}
+	for (my $i = 0; $i < scalar(@files)-1+$selfblast; $i++) {$jobs_todo += scalar(@files)-$i-1+$selfblast;}
+
+	# Divide 5.16
+	$part = int($jobs_todo/$split_to_X_jobs);	# Round up to not miss anything
+#	print STDERR "$jobs_todo/$split_to_X_jobs = $part\n";
+	my $from = ($jobnumber-1)*($part)+1;
+	if ($jobnumber == 1) {$from = 1;}
+	my $to = ($jobnumber-1)*($part)+$part;
+	if ($jobnumber == $split_to_X_jobs) {$to = $jobs_todo;}
+
+	$part = 1+$to-$from;	# real part
+
+	if ($split_to_X_jobs != -1 && $part < 1) {&Error("I have problems coordinating $split_to_X_jobs groups for $jobs_todo jobs.");}
+	if ($split_to_X_jobs <= 1) {$from = -1; $to = -1; $split_to_X_jobs = -1; $part = -1;}
+
+#	print STDERR "$from - $to (TODO: $jobs_todo)\n";
+
 	&set_threads_per_process(0);	# Check if we can apply more threads, nothing is running so far
 	&print_blast_stats();
 
@@ -381,26 +434,26 @@ sub run_blast {
 	for (my $i = 0; $i < $cpus; $i++) {threads->create('workerthread');}
 
 	# For each file against each other file
-	for (my $i = $startat; $i < scalar(@files); $i++) {
+	my $job_number = 0;
+	SPEC: for (my $i = 0; $i < scalar(@files)-1+$selfblast; $i++) {
 		for (my $j = $i+1-$selfblast; $j < scalar(@files); $j++) {
+			$job_number++;
+			if ($from != -1 && $job_number < $from) {next;}
 			# Wait for queque to get empty (with some buffer)
-			while ($jobque->pending() > 256 + 2*$cpus) {
-				sleep(1);
-			}
+			while ($jobque->pending() > 32 + 2*$cpus) {sleep(1);}
 			# Syncronize with other processes
 			$jobque->enqueue("$i $j");
+#			print STDERR "EN: $job_number -> $i $j\n";
+			if ($to != -1 && $job_number >= $to) {last SPEC;}
 		}
-		# Start-Stop
-		if ($stopat > -1 && $stopat <= $i) {last;}
 	}
-
 	# Tell all threads they can stop
 	{lock($all_jobs_submitted); $all_jobs_submitted = 1;}
 
 	# Wait until all jobs are done
 	foreach (threads->list()) 	{$_->join();}
 	&print_blast_stats();		print STDERR "\n";
-	print STDERR "-> Output written to $simgraph\n";
+	print STDERR "[OUTPUT] -> written to $simgraph\n";
 }
 
 sub workerthread {
@@ -423,7 +476,7 @@ sub workerthread {
 
 				{
 				lock($jobs_done);							# 	Productive
-				if ($jobs_done >= $jobs_todo) {						# 	Productive
+				if ($jobs_done >= $jobs_todo || ($part != -1 && $jobs_done >= $part)) {	# 	Productive
 #				lock($all_jobs_submitted);						# 	DEBUGGING
 #				if ($all_jobs_submitted) {						# 	DEBUGGING
 					if ($debug) {print STDERR "Thread $thread_id\tis leaving\n";}
@@ -459,6 +512,8 @@ sub workerthread {
 			$result_ji = &blast($file_j,$file_i);
 		}
 
+		if ($file_i eq $file_j && !$selfblast) {die("Selfblast is disabled but I want to check $file_i vs $file_j");}
+
 		my %lengths;
 		if ($file_i eq $file_j) {
 			# Get lengths once is enough (selfblast)
@@ -490,7 +545,7 @@ sub workerthread {
 			my %full_id_map_i;
 			my %full_id_map_j;
 			foreach (sort keys %reciprocal_matches) {
-				my ($a, $b) = split(" ",$_);
+				my ($a, $b) = split(/\s/,$_);
 				my $aa = &convertNCBI($a);
 				my $bb = &convertNCBI($b);
 				if ($aa ne $a) {
@@ -694,13 +749,13 @@ sub synteny_matches {
 	my @multis;	# array that contains all multi-edges
 	# Convert reciprocal matches to ffadj input
 	foreach (keys %reciprocal_matches) {
-		my @values = split(" ",$reciprocal_matches{$_});
-		my ($a, $b) = split(" ",$_);
+		my @values = split(/\s/,$reciprocal_matches{$_});
+		my ($a, $b) = split(/\s/,$_);
 		unless (defined($order{$a})) {$a = &convertNCBI($a);}
 		unless (defined($order{$b})) {$b = &convertNCBI($b);}
-		my @a = split(" ",$order{$a});
-		my @b = split(" ",$order{$b});
-		unless (defined($a[0])) {die();}
+		my @a = split(/\s/,$order{$a});
+		my @b = split(/\s/,$order{$b});
+		unless (defined($a[0])) {die("Failed parsing gene IDs from blast output/gff input\n");}
 
 		unless (defined($multis[$a[0]])) {$multis[$a[0]] = $b[0];}
 		else 				 {$multis[$a[0]] .= ','.$b[0];}
@@ -765,7 +820,14 @@ sub print_blast_stats {
 		lock($jobs_done);
 		my $percent = int($jobs_done/$jobs_todo*10000)/100;
 		print STDERR "\r                                                                               ";
-		print STDERR "\rRunning blast analysis: $percent% ($jobs_done/$jobs_todo)";
+	
+		if ($split_to_X_jobs == -1) {
+			print STDERR "\rRunning blast analysis: $percent% ($jobs_done/$jobs_todo)";
+		}
+		else {	# 5.16
+			$percent = int($jobs_done/$part*10000)/100;
+			print STDERR "\rRunning blast analysis: $percent% ($jobs_done/$part, $jobs_todo in total)";
+		}
 	}
 }
 
@@ -823,8 +885,10 @@ sub get_legal_matches {
 		if ($blastmode eq "tblastx+" || $blastmode eq "tblastx") {$alignment_length *= 3;}
 		if ($alignment_length < $length{$query_id}*($coverage/100)+0.5) 		{next;}
 		if ($alignment_length < $length{$subject_id}*($coverage/100)+0.5) 		{next;}
+
 		# It hit itself (only during selfblast)
-		if ($query_id eq $subject_id) 							{next;}
+		# if ($selfblast && $query_id eq $subject_id) 					{next;}	# 5.16 reuse IDs
+		## Listing them in the graph is okay, clustering will ignore them
 
 		# Similar hits? Take the better one
 		if (defined($result{"$query_id $subject_id"})) {
@@ -844,17 +908,22 @@ sub get_legal_matches {
 # Auto set the number of CPUs
 sub auto_cpus {
 	if ($cpus == 0) {
-		my $cpu_x = 0;
-		# Linux
-		if (-e "/proc/cpuinfo") {
-			$cpu_x = qx(grep processor /proc/cpuinfo | wc -l);
+		my $cpu_x = qx(getconf _NPROCESSORS_ONLN); 		$cpu_x =~ s/[^0-9]//g;
+		# Fallback
+		if (length($cpu_x) == 0 || $cpu_x == 0) {
+			# Linux
+			if (-e "/proc/cpuinfo") {
+				$cpu_x = qx(grep processor /proc/cpuinfo | wc -l);
+			}
+			# Try Mac
+			else {
+				$cpu_x = qx(system_profiler | grep CPUs:);
+			}
+			$cpu_x =~ s/[^0-9]//g;
 		}
-		# Try Mac
-		else {
-			$cpu_x = qx(system_profiler | grep CPUs:);
+		if (length($cpu_x) == 0 || $cpu_x == 0) {
+			print STDERR "failed! Use 1 core only\n";$cpu_x = 1;
 		}
-		$cpu_x =~ s/[^0-9]//g;
-		if (length($cpu_x) == 0 || $cpu_x == 0) {print STDERR "failed! Use 1 core only\n";$cpu_x = 1;}
 		print STDERR "Detected $cpu_x available CPU threads, ";
 		$cpus = $cpu_x;
 	}
@@ -892,8 +961,9 @@ sub blast {
 
 	# File does not exists yet or I am forced to rewrite it
 	if (!(-s $bla) || $force) {
-		if ($debug) {print STDERR "$command >$bla\n";}
-		system("$command | sort -k11,11g >'$bla'");		# run blast and presort (speeds up best alignment search but is NOT mandatory)
+		if ($debug) {print STDERR "$command >$bla.tmp\n";}											# 5.16
+		system("$command | sort -k11,11g >'$bla.tmp'");		# run blast and presort (speeds up best alignment search but is NOT mandatory)	# 5.16
+		system("mv '$bla.tmp' '$bla'");				# move when done, aids when job fails (no half bla files)			# 5.16
 		my @data = &readFile($bla);
 		unless ($keep) {unlink($bla);}				# delete tmp file
 		return \@data;
@@ -956,11 +1026,12 @@ sub check_blast {
 
 # Check plausibility of files
 sub check_files {
-	my %ids;
 	if (scalar(@files) < 2 && $step != 3)		{&print_usage; &Error("I need at least two files to compare something!");}
 	print STDERR "Checking input files\n";
 	foreach my $file (@files) {
-		&read_details($file,\%ids);
+		if ($verbose) {print STDERR "Checking $file... ";}
+		&read_details($file,1);
+		if ($verbose) {print STDERR "ok\n";}
 	}
 }
 
@@ -972,12 +1043,13 @@ sub convertNCBI {
 }
 
 sub read_details {
-	my %ids;
+	my %ids;				# local test for duplicated IDs
 	my %genes;
 	my $file = shift;
 	my $test = 0;
-	if (defined($_[0])) {$test = 1; %ids = %{(shift)};}	# if no ID Hash is give, we do not want to test but to fetch the gff data
+	if (defined($_[0])) {$test = 1;}	# if no ID Hash is give, we do not want to test but to fetch the gff data
 
+#	print STDERR "TEST: $test\n";
 	if (!-e $file)		{&Error("File '$file' not found!");}
 	open(FASTA,"<$file") || &Error("Could not open '$file': $!");
 	while (<FASTA>) {
@@ -987,7 +1059,8 @@ sub read_details {
 			$_ =~ s/^>//;
 			$_ =~ s/\s.*//;
 			if ($test) {
-				if (defined($ids{$_}))	{&Error("Gene ID '$_' is defined at least twice:\n$ids{$_}\n$file");}
+#				print STDERR "ID: $_\n";
+				if (defined($ids{$_}))	{&Error("Gene ID '$_' is defined at least twice in $file");}
 				$ids{$_} = $file;
 			}
 			if ($synteny) {
@@ -1001,10 +1074,11 @@ sub read_details {
 	unless ($synteny) {return;}
 
 	my %coordinates;
-	if ($verbose && $test) {print STDERR "$file\t".scalar(keys %genes)." genes\n";}
+	if ($verbose && $test) {print STDERR "$file\t".scalar(keys %genes)." genes\t";}
 	my $gff = &gff4fasta($file);
 	open(GFF,"<$gff") || &Error("Could not open '$gff': $!");
 	while (<GFF>) {
+		if ($_ =~ /^##FASTA/) {last;}	# deal with prokka gffs, thx 2 Ben Woodcroft
 		if ($_ =~ /^#/) {next;}
 		# e.g. NC_009925.1	RefSeq	CDS	9275	10096	.	-	0	ID=cds8;Name=YP_001514414.1;Parent=gene9;Dbxref=Genbank:YP_001514414.1,GeneID:5678848;gbkey=CDS;product=signal peptide peptidase SppA;protein_id=YP_001514414.1;transl_table=11
 		my @col = split(/\t+/,$_);
@@ -1033,7 +1107,7 @@ sub read_details {
 }
 
 sub Error {
-	print STDERR "Error: ".$_[0]."\n";
+	print STDERR "\n[ERROR] ".$_[0]."\n";
 	exit 0;
 }
 
@@ -1075,6 +1149,6 @@ sub write_descriptions {
 			}
 		}
 	}
-	print STDERR "-> Output written to $desctable\n";
+	print STDERR "[OUTPUT] -> written to $desctable\n";
 
 }


=====================================
proteinortho5_clean_edges2.pl
=====================================
--- a/proteinortho5_clean_edges2.pl
+++ b/proteinortho5_clean_edges2.pl
@@ -11,13 +11,19 @@ if (!defined($ARGV[1])) {
 
 # Proteinortho Output
 print STDERR "Cleaning edge list...\n";
+# Reading removal list
 my %map;
 open(IN,"<$ARGV[0]") || die("Could not open file $ARGV[0]: $!");
 while (<IN>) {
 	chomp;
-	my ($a, $b) = sort split(' ',$_,2);
-	unless ($b) {die("Line does not match filter list pattern\n");}
+	my ($ta, $sa, $tb, $sb) = split("\t",$_,4);	# name, species, name, species
+	unless ($sb) {die("Line does not match filter list pattern\n");}
+
+	$ta .= " ".$sa;
+	$tb .= " ".$sb;
+	my ($a, $b) = sort($ta,$tb); # bidirectional edges store once is fine
 	$map{$a.' '.$b} = 1;
+	
 }
 close(IN);
 
@@ -25,20 +31,48 @@ close(IN);
 shift(@ARGV);
 my $rm = 0;
 my $all = 0;
+# For all blastgraphs
+my $filea = "";
+my $fileb = "";
 foreach my $file (@ARGV) {
 	open(IN,"<$file") || die("Could not open file $file: $!");
 	while (<IN>) {
-		if ($_ =~ /^#/) {print $_; next;}
-		my ($a,$b,undef) = split("\t",$_,3);
-		unless ($b) {die("Line does not match Proteinortho graph format\n");}
-		$all++;
-		($a,$b) = sort($a, $b);
-
-		if (exists($map{$a.' '.$b})) {$rm++; next;}
-		print $_;
+		if ($_ =~ /^#/) {
+			print $_; 
+			chomp;
+			$_ =~ s/^#\s*//;
+			my ($a,$b,undef) = split("\t",$_,3);
+			if ($a eq "a" || $a eq "file_a") {next;}
+			unless (defined($b)) {die("Line does not match Proteinortho graph format. Filename missing.\n");}
+			# Keep track of files/species
+			$filea = $a;
+			$fileb = $b;
+		}
+		else {
+			unless ($_ =~ /\t/) {next;}
+			my ($ta,$tb,undef) = split("\t",$_,3);
+			unless ($tb) {die("Line does not match Proteinortho graph format\n");}
+			$all++;
+	
+			$ta .= " ".$filea;
+			$tb .= " ".$fileb;
+
+			my ($a, $b) = sort($ta,$tb); # bidirectional edges store once is fine
+			($a,$b) = sort($a, $b);
+			if (exists($map{$a.' '.$b})) {
+				$rm++;
+				next;
+			}
+	
+	
+#			if ($a eq $b) {$all--; next;}	# Selfhits are ignored
+			print $_;
+		}
 	}
 	close(IN);
 }
 
 print STDERR "Removed $rm / $all edges\n";
 print STDERR "Done.\n";
+
+## TODO: Adjust to remove selfblast hits and watch


=====================================
proteinortho5_clustering.cpp
=====================================
--- a/proteinortho5_clustering.cpp
+++ b/proteinortho5_clustering.cpp
@@ -3,7 +3,7 @@
  *	Reads edge list and splits connected components
  *	according to algebraic connectivity threshold
  *
- *	Last updated: 2016/04/22		
+ *	Last updated: 2017/03/20		
  *	Author: Marcus Lechner
  */
 
@@ -40,6 +40,8 @@ vector<unsigned int> get_deg_one (vector<unsigned int>& );
 // Parameters
 bool param_verbose 		= false;
 double param_con_threshold 	= 0.1;
+unsigned int debug_level	= 0;
+double param_sep_purity 	= 0.75;
 string param_rmgraph            = "remove.graph";
 unsigned int min_iter 		= 100;		// min number of alg con iterations
 unsigned int max_iter 		= 10000;	// max number of alg con iterations
@@ -98,6 +100,14 @@ int main(int argc, char *argv[]) {
 			paras++;
 			param_con_threshold = string2double(string(argv[paras]));
 		}
+		else if (parameter == "-purity") {
+			paras++;
+			param_sep_purity = string2double(string(argv[paras]));
+		}
+		else if (parameter == "-debug") {
+			paras++;
+			debug_level = int(string2double(string(argv[paras])));
+		}
 		else if (parameter == "-rmgraph") {
 			paras++;
 			param_rmgraph = string(argv[paras]);
@@ -109,9 +119,14 @@ int main(int argc, char *argv[]) {
 		}
 	}
 	
+	if (debug_level > 0) cerr << "[DEBUG] Debug level " << debug_level << endl;
 
 	// Parse files
-	for (vector<string>::iterator it=files.begin() ; it != files.end(); it++) parse_file(*it);
+	for (vector<string>::iterator it=files.begin() ; it != files.end(); it++) {
+		if (debug_level > 0) cerr << "[DEBUG] Parsing file " << *it << endl;
+		parse_file(*it);
+		if (debug_level > 0) cerr << "[DEBUG] I know " << species_counter <<  " species with " << protein_counter << " proteins and " << edges << " edges in sum" << endl;
+	};
 
 	// Free memory
 	files.clear();
@@ -122,6 +137,7 @@ int main(int argc, char *argv[]) {
 	if (param_verbose) cerr << species_counter << " species" << endl << protein_counter << " paired proteins" << endl << edges << " bidirectional edges" << endl;
 
 	// Prepare sort of output
+	if (debug_level > 0) cerr << "[DEBUG] Sorting known species" << endl;
 	sort_species();
 
 	// Write output header
@@ -131,6 +147,7 @@ int main(int argc, char *argv[]) {
 	graph_clean.open(param_rmgraph.c_str());
 
 	// Clustering
+	if (debug_level > 0) cerr << "[DEBUG] Clustering" << endl;
 	partition_graph();
 	graph_clean.close();
 	}
@@ -283,14 +300,19 @@ void partition_graph() {
 
 		// Do not report singles
 		if (current_group.size() < 2) {
+			if (debug_level > 1) cerr << "[DEBUG] Single gene skipped" << endl;
 			protein_id++;
 			continue;
 		}
 
 		// Connectivity analysis
+		if (debug_level > 0) cerr << "[DEBUG] Calculating connectivity of a group. " << current_group.size() << " proteins (ID: " << protein_id << ")" << endl;
+		if (current_group.size() > 16000) cerr << "[WARN] Current connected component conatins " << current_group.size() << " proteins. That might be way to much for me. Try raising the e-value!" << endl;
 		double connectivity = getConnectivity(current_group);
+		if (debug_level > 0) cerr << "[DEBUG] Connectivity was " << connectivity << endl;
 
 		if (connectivity < param_con_threshold && current_group.size() > 3) {
+			if (debug_level > 0) cerr << "[DEBUG] Reiterating" << endl;
 			// Split groups is done in getConnectivity function
 			// Reset flags and repeat without incrementing protein counter
 			for (unsigned int i = 0; i < current_group.size(); i++) done[current_group[i]] = false;
@@ -432,7 +454,8 @@ void removeExternalEdges(map<unsigned int,bool>& a) {
 			for (vector<unsigned int>::iterator ita = graph[protein].edges.begin(); ita != graph[protein].edges.end(); ita++) {
 				// If it is not present the own group, set flag
 				if (a.find(*ita) == a.end()) {
-					graph_clean << graph[protein].full_name << "\t" << graph[*ita ].full_name << endl; // Improved graph cleaning
+					// 5.16 store name AND species in removal list
+					graph_clean << graph[protein].full_name << "\t" << species[graph[protein].species_id] << "\t" << graph[*ita ].full_name << "\t" << species[graph[*ita ].species_id] << endl; // Improved graph cleaning
 					swap = true;
 				}
 				// Otherwise, add it to the new edge list
@@ -497,12 +520,16 @@ void splitGroups(vector<double>& y, vector<unsigned int>& nodes){
 //		return;
 //	}
 
-	// Store data about two groups
-	map<unsigned int,bool> groupA, groupB;
+	// Store data about two groups (Zero cannot be assigned with certainty)
+	map<unsigned int,bool> groupA, groupB, groupZero;
 	for (unsigned int i = 0; i < y.size(); i++) {
-//		cerr << i << " = " << y[i] << endl;		
-		if (y[i] < 0) 		{groupA[nodes[i]] = true;} // cerr << graph[nodes[i]].full_name << " {color:#95cde5}" << endl; }
-		else              	{groupB[nodes[i]] = true;} // cerr << graph[nodes[i]].full_name << " {color:#b01700}" << endl; }
+		unsigned int style = 1;
+		if (y[i] < 0) style = 2;
+		if (abs(y[i]) < param_sep_purity) style = 0;
+		if (debug_level > 2) cerr << "[DEBUG L3] " << i << " = " << y[i] << " -> " << style << endl;
+		if      (abs(y[i]) < param_sep_purity)	{groupZero[nodes[i]] = true;} // cerr << graph[nodes[i]].full_name << " {color:#95cde5}" << endl; }
+		else if (y[i] < 0) 				{groupA[nodes[i]] = true;} // cerr << graph[nodes[i]].full_name << " {color:#95cde5}" << endl; }
+		else              				{groupB[nodes[i]] = true;} // cerr << graph[nodes[i]].full_name << " {color:#b01700}" << endl; }
 	}
 
 //	cerr << groupA.size() << " -- " << groupB.size() << endl;
@@ -512,22 +539,18 @@ void splitGroups(vector<double>& y, vector<unsigned int>& nodes){
 		throw "Failed to partition subgraph! This might lead to an infinit loop. Please submit the .blastgraph file to lechner at staff.uni-marburg.de to help fixing this issue.";
 	}
 
+	removeExternalEdges(groupZero);
 	removeExternalEdges(groupA);
 	removeExternalEdges(groupB);
 }
 
 double getConnectivity(vector<unsigned int>& nodes) {
-	// Special cases hotfix
-	if (nodes.size() == 2) {return 1;}
-
-	if (nodes.size() == 3) {
-		vector<unsigned int> min = get_deg_one(nodes);
-		// fully connected
-		if (min.size() == 0) {return 1;}
-		// not
-		else {return 1/3;}
+	// Special case quicky
+	if (nodes.size() == 2) {
+		if (debug_level > 1) cerr << "[DEBUG L2] Shortcut for two-piece component -> conn = 1" << endl;
+		return 1;
 	}
-	// Hotfix end
+	// quicky end
 
 	// Get max degree of nodes
 	unsigned int max_degree = max_deg(nodes);
@@ -546,7 +569,9 @@ double getConnectivity(vector<unsigned int>& nodes) {
 
 	// Repeat until difference < epsilon
 	unsigned int iter = 0;							// catch huge clustering issues by keeping track here
+	
 	while(iter++ < max_iter) { 
+		if (debug_level > 2 && iter%100 == 0) cerr << "[DEBUG L2] Step " << iter << " / " << max_iter << endl;
 		last_length = current_length;
 		// Get a new x
 		x = get_new_x(norm, nodes, mapping);
@@ -557,9 +582,10 @@ double getConnectivity(vector<unsigned int>& nodes) {
 		// Get lenght (lambda) & normalize vector
 		norm = nomalize(x_hat, &current_length);
 ///		cerr << "I " << iter << ": " << abs(current_length-last_length) << endl; 
-		if (abs(current_length-last_length) < 0.00001 && iter >= min_iter) break;	// min 100 iterations (prevent convergence by chance), converge to 1e-6
+		if (abs(current_length-last_length) < 0.0001 && iter >= min_iter) break;	// min 100 iterations (prevent convergence by chance), converge to 1e-6
 	}
 ///		cerr << nodes.size() << " nodes done after " << iter << " iterations" << endl;
+	if (debug_level > 0) cerr << "[DEBUG] " << iter << " / " << max_iter << " iterations required (error is " << abs(current_length-last_length) << ")" << endl;
 
 	double connectivity = (-current_length+2*max_degree)/(nodes.size());
 
@@ -622,13 +648,13 @@ void parse_file(string file) {
 
 				if (file_a == "file_a" && file_b == "file_b") continue;	// Init Header
 
-				// Map species a
+				// Map species a if not know so far
 				if (species2id.find(file_a) == species2id.end())	{
 						species.push_back(file_a);
 //						cerr << "Species " << species_counter << ": " << file_a << endl;
 						species2id[file_a] = species_counter++;
 				}
-				// Map species b
+				// Map species b if not know so far
 				if (species2id.find(file_b) == species2id.end())	{
 //						cerr << "Species " << species_counter << ": " << file_b << endl;
 						species.push_back(file_b);
@@ -644,17 +670,27 @@ void parse_file(string file) {
 
 //				cerr << protein_counter << ": " << fields[0] << " <-> " << fields[1] << endl;
 
+				// 5.16 deal with duplicated IDs by adding file ID to protein ID
+				string ida = fields[0];
+				string idb = fields[1];
+				fields[0] += " "; fields[0] += file_a_id;
+				fields[1] += " "; fields[1] += file_b_id;
+
+				// 5.16 do not point to yourself
+				if (!fields[0].compare(fields[1])) {continue;}
+
 				// A new protein
 				if (protein2id.find(fields[0]) == protein2id.end())	{
 					protein a;
-					a.full_name	= fields[0];
+					a.full_name	= ida;
 					a.species_id	= file_a_id;
 					protein2id[fields[0]] = protein_counter++;
 					graph.push_back(a);
 				}
 				if (protein2id.find(fields[1]) == protein2id.end())	{
 					protein b;
-					b.full_name	= fields[1];	b.species_id	= file_b_id;
+					b.full_name	= idb;
+					b.species_id	= file_b_id;
 					protein2id[fields[1]] = protein_counter++;
 					graph.push_back(b);
 				}


=====================================
test/C.faa
=====================================
The diff for this file was not included because it is too large.

=====================================
test/E.faa
=====================================
The diff for this file was not included because it is too large.

=====================================
test/L.faa
=====================================
--- a/test/L.faa
+++ b/test/L.faa
@@ -477,14 +477,6 @@ GRNCKAADQPGILTHNKMCQCGDFGLPHEKYGFAHLSRSSFGVSCTACSNPTIVSALHRA
 ESIHRSGAWLNSQDSTETGTNLFGPTSKESRESICGNQFSVTATIGQLGYVTDVLAHGMV
 LAFCDFAKTASAGPPKEKVNCDDGHGGLMLEANVFSEESDVRKHLGLSEKGYHCVLTSKT
 LVKFLIKNQTFHCGAEGCVPTGSGAFGD
->L_7
-KEDAIRWPQLGGVSLYGEDANMELGADGVPSTSALGMPWPVLFNANLSGKQCAQCRIFIV
-CHLTQPHFVGAMQVGMSGDDELKDQVKLNGGACKDEFRGNRTLMAMYPFARVLFATPSTV
-SFDKFILKEGFGFLGRCAAKVAATAPLNGVTTACPVQVVSKCCNKGKKELEPLFTAREHA
-ADGCGYASAFTVALEKDSYHCDYYSDHAQYASKSYAKSSRALASYFLIQFISCTKGLCSE
-SHECVKNEFLVKIWPGSKMGASIPTNYVMLSDPAYPYECDDHQNNHCSGEKMPKLQNHPG
-YSAFTQRQKSFTKHLTPAKGERKFDMNGVHLDIGKVTPTTSDEYVEALSLVHPTALNPAF
-GMKAIYYLKRAYKKGLFGLDVHVSKNNTEASKKDYHVVT
 >L_8
 CRLPGGVNERAFGHVNDKDLCMLPSPVFCTQKGKEHPPNKEYQMAGSKPPTWPAAQVVDC
 RHELRTYTGQLPMRSATDLGPNVKSIYTVSERGFKVGQTNHAVCFEAGNVEDKKWKMCHG


=====================================
tools/check_graph_duplicate_ids.pl
=====================================
--- /dev/null
+++ b/tools/check_graph_duplicate_ids.pl
@@ -0,0 +1,46 @@
+#!/usr/bin/perl
+
+use warnings;
+use strict;
+
+my %genehash;
+
+if (defined($ARGV[0])) {
+	print STDERR "Usage: remove_graph_duplicates.pl <GRAPH\n\nChecks a given graph input for duplicates and removes them (species only)\nReads from STDIN\n";
+	exit;
+}
+
+my $a = "unk";
+my $b = "unk";
+while (<STDIN>) {
+	my @row = split(/\s+/);
+	if ($_ =~ /^#/) {
+		unless ($row[1]) {die();}
+		unless ($row[1] eq "file_a" || $row[1] eq "a") {
+			$a = $row[1];
+			$b = $row[2];
+		}
+	}
+	else {
+		unless ($row[1]) {next;}
+
+#		print "$a: $row[0]\n$b: $row[1]\n";
+
+		if (defined($genehash{$row[0]})) {
+			if ($genehash{$row[0]} ne $a) {
+			print "$row[0] was used with $genehash{$row[0]} and $a\n";
+			}
+		}
+		else {
+			$genehash{$row[0]} = $a;
+		}
+		if (defined($genehash{$row[1]})) {
+			if ($genehash{$row[1]} ne $b) {
+			print "$row[1] was used with $genehash{$row[1]} and $b\n";
+			}
+		}
+		else {
+			$genehash{$row[1]} = $b;
+		}
+	}
+}


=====================================
tools/extract_from_graph.pl
=====================================
--- /dev/null
+++ b/tools/extract_from_graph.pl
@@ -0,0 +1,61 @@
+#!/usr/bin/perl
+
+use warnings;
+use strict;
+
+unless (defined($ARGV[0])) {
+	print STDERR "Usage: extract_from_graph.pl PROTEINORTHO_TABLE <GRAPH\n\nExtracts an orthology group from a given graph (STDIN)\n";
+	exit;
+}
+
+my %use_gene;
+my @species;
+open(PO,"<$ARGV[0]") || die($!);
+while (<PO>) {
+	chomp;
+	my @list = split(/\t+/);
+	if ($list[0] eq "# Species") {
+		@species = @list;
+		shift @species;
+		shift @species;
+		shift @species;
+	}
+	else {
+		shift @list;
+		shift @list;
+		shift @list;
+		for (my $i = 0; $i < scalar(@list); $i++) {
+			foreach my $gene (split(",",$list[$i])) {
+				my $id = $gene.' '.$species[$i];
+				$use_gene{$id}++;
+			}
+		}
+	}
+}
+close(PO);
+
+print STDERR scalar(keys %use_gene);
+print STDERR " genes found\n";
+
+my $file_a;
+my $file_b;
+while (<STDIN>) {
+	if ($_ =~ /^#/) {
+		print $_;
+		my @row = split(/\s+/);
+		unless ($row[1]) {die();}
+		unless ($row[1] eq "file_a" || $row[1] eq "a") {
+			$file_a = $row[1];
+			$file_b = $row[2];
+		}
+	}
+	else {
+		my @row = split(/\s+/);
+		unless ($row[1]) {next;}
+		my $ida = $row[0].' '.$file_a;
+		my $idb = $row[1].' '.$file_b;
+		if ($use_gene{$ida} || $use_gene{$idb}) {
+			print $_;
+		}
+	}
+}


=====================================
tools/graph_conv.pl
=====================================
--- /dev/null
+++ b/tools/graph_conv.pl
@@ -0,0 +1,15 @@
+#!/usr/bin/perl
+
+use warnings;
+use strict;
+
+print "graph G {\n";
+while (<STDIN>) {
+	chomp;
+	if ($_ =~ /#/) {next;}
+	my @x = split(/\s/);
+	if ($x[1]) {
+		print " $x[0] -- $x[1];\n"
+	}
+}
+print " }";


=====================================
tools/remove_graph_duplicates.pl
=====================================
--- /dev/null
+++ b/tools/remove_graph_duplicates.pl
@@ -0,0 +1,33 @@
+#!/usr/bin/perl
+
+use warnings;
+use strict;
+
+my %filehash;
+my $flag;
+
+if (defined($ARGV[0])) {
+	print STDERR "Usage: remove_graph_duplicates.pl <GRAPH\n\nChecks a given graph input for duplicates and removes them (species only)\nReads from STDIN\n";
+	exit;
+}
+
+while (<STDIN>) {
+	if ($_ =~ /^#/) {
+		$flag = 1;
+		my @row = split(/\s+/);
+		unless ($row[1]) {die();}
+		unless ($row[1] eq "file_a" || $row[1] eq "a") {
+			my $a = $row[1];
+			my $b = $row[2];
+			my $id = join(",",sort($a,$b));
+
+			if (++$filehash{$id} > 1) {
+				$flag = 0;
+				print STDERR "Found duplicate: $a vs $b\n";
+			}
+		}
+	}
+	if ($flag) {
+		print $_;
+	}
+}



View it on GitLab: https://salsa.debian.org/med-team/proteinortho/commit/6ecf7dce0bb2fb4706bc16e7989b138f814daab2

---
View it on GitLab: https://salsa.debian.org/med-team/proteinortho/commit/6ecf7dce0bb2fb4706bc16e7989b138f814daab2
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/debian-med-commit/attachments/20180221/ea1dd82e/attachment-0001.html>


More information about the debian-med-commit mailing list