[med-svn] [blobology] 01/02: New upstream version 0.0+20151216

Fri Oct 21 10:50:41 UTC 2016

This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository blobology.

commit 4a92238a5c1b51acf4b33059a52ffa6286981584
Author: Andreas Tille <tille at debian.org>
Date:   Fri Oct 21 12:11:52 2016 +0200

    New upstream version 0.0+20151216
---
 LICENSE                                            |  20 ++
 README.md                                          |  85 +++++
 adapters.fa                                        |   8 +
 blobology.bash                                     | 172 ++++++++++
 blobologyMethodOverview.png                        | Bin 0 -> 190751 bytes
 ...ie2_extract_reads_mapped_to_specific_contigs.pl | 139 +++++++++
 cleank61.blobplot.txt.taxlevel_order.01.png        | Bin 0 -> 313792 bytes
 dev/README.md                                      |  33 ++
 dev/gc_cov_annotate_eval.pl                        | 347 +++++++++++++++++++++
 fastaqual_select.pl                                | 189 +++++++++++
 fgrep.pl                                           |  37 +++
 gc_cov_annotate.pl                                 | 236 ++++++++++++++
 makeblobplot.R                                     |  86 +++++
 ...-scaffolds.fa.bowtie2.txt.taxlevel_order.01.png | Bin 0 -> 398402 bytes
 separate_reads.bash                                |  51 +++
 shuffleSequences_fastx.pl                          |  17 +
 16 files changed, 1420 insertions(+)

diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..23decbb
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,20 @@
+The MIT License (MIT)
+
+Copyright (c) 2013 blaxterlab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..349e795
--- /dev/null
+++ b/README.md
@@ -0,0 +1,85 @@
+Update
+------
+
+###2015-12-16: For more features and better plots please use Dominik Laetsch's blobtools at https://drl.github.io/blobtools/
+
+
+Blobology
+=========
+
+Blaxter Lab, Institute of Evolutionary Biology, University of Edinburgh
+
+**Goal**: To create blobplots or Taxon-Annotated-GC-Coverage plots (TAGC plots) to visualise the contents of genome assembly data sets as a QC step.
+
+This repository accompanies the paper:  
+**Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots.**
+*Sujai Kumar, Martin Jones, Georgios Koutsovoulos, Michael Clarke, Mark Blaxter*  
+(submitted 2013-10-01 to *Frontiers in Bioinformatics and Computational Biology special issue : Quality assessment and control of high-throughput sequencing data*).
+
+It contains bash/perl/R scripts for running the analysis presented in the paper to create a preliminary assembly, and to create and collate GC content, read coverage and taxon annotation for the preliminary assembly, which can be visualised, such as Figure 2a from the paper showing TAGC plots/blobplots for *Caenorhabditis* sp. 5:
+![Figure 2a. Caenorhabditis sp. 5 prelim assembly blobplot](https://raw.github.com/blaxterlab/blobology/master/pa61-scaffolds.fa.bowtie2.txt.taxlevel_order.01.png)
+
+**Note**: This is an update to the code at github.com/sujaikumar/assemblage which was used in my thesis. I could have updated the code in that repository, but enough things have changed (the basic file formats as well) that I thought it made sense to create a new repo. Please use this version from now on.
+
+Installation
+------------
+
+    git clone git://github.com/blaxterlab/blobology.git
+    # add this directory to your path, e.g.:
+    # export PATH=$PATH:/path/to/blobology
+
+You also need the following software in your path:
+
+1.  samtools (tested with version 0.1.19) http://sourceforge.net/projects/samtools/files/samtools/0.1.19/
+2.  R (tested with version 2.15.2)
+3.  ggplot2, an R graphics package (tested with ggplot2_0.9.3.1)
+4.  ABySS - optional if you already have a preliminary assembly from another assembler (tested with version 1.3.6, compiled with mpi) - http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.6
+5.  fastq-mcf - not needed if you already have quality and adapter trimmed reads (from the ea-utils suite, tested with version 1.1.2-537) - https://ea-utils.googlecode.com/files/ea-utils.1.1.2-537.tar.gz  
+
+And the following databases:
+
+1.  NCBI nt blasted (download from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ nt.??.tar.gz)
+
+        wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz"
+        for a in nt.*.tar.gz; do tar xzf $a; done
+        # or, if you have gnu parallel installed, highly recommended:
+        # parallel tar xzf ::: nt.*.tar.gz
+
+2.  NCBI taxonomy dump (download from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)
+
+        wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
+        tar xzf taxdump.tar.gz
+
+    Only the nodes.dmp and names.dmp files are needed (nodes.dmp stores the taxon ids and their parent-child relationships, whereas names.dmp stores their common and scientific names)
+
+Rerun example with Caenorhabditis sp. 5
+-----------------------------------------
+
+Run [blobology.bash](https://github.com/blaxterlab/blobology/blob/master/blobology.bash) from this repository. Comment out the lines that you don't need (e.g., if you prefer using a different assembler for the preliminary assembly, or a different alignment tool for mapping the reads)
+
+Broad overview of the pipeline (Figure 1 in the paper)
+
+<img src="blobologyMethodOverview.png" alt="Figure 1. Broad overview of pipeline" width="50%" height="50%"/>
+
+Run the blobology pipeline for your own sequence data
+-----------------------------------------------------
+
+Run [blobology.bash](https://github.com/blaxterlab/blobology/blob/master/blobology.bash) from this repository. The only things you should really need to change are the read files in the ABySS and Bowtie 2 steps. Even these steps won't be needed if you already have a preliminary assembly and BAM files from aligning your raw reads back to this assembly.
+
+A tab separated values (TSV) text file is created by [gc_cov_annotate.pl](https://github.com/blaxterlab/blobology/blob/master/gc_cov_annotate.pl) and the ggplot2 R
+
+Visualise data using Blobsplorer
+--------------------------------
+
+The Blobsplorer visualiser for the TSV file created above was coded by Martin Jones, and is available at 
+github.com/mojones/blobsplorer
+
+
+Separate contigs and reads based on blobplot visualisations
+-----------------------------------------------------------
+
+See [separate_reads.bash](https://github.com/blaxterlab/blobology/blob/master/separate_reads.bash) from this repository for 
+the commands that were used to remove contaminants from the *Caenorhabditis* sp. 5 preliminary assembly.
+
+Your own data set will require you to devise your own positive (to keep contigs, and reads, of interest) and negative (to discard 
+contigs and reads belonging to contaminants) filters.
diff --git a/adapters.fa b/adapters.fa
new file mode 100644
index 0000000..6fca4ed
--- /dev/null
+++ b/adapters.fa
@@ -0,0 +1,8 @@
+>peP1
+AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
+>peP2
+AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
+>peA1
+ACACTCTTTCCCTACACGACGCTCTTCCGATCT
+>peA2
+CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
diff --git a/blobology.bash b/blobology.bash
new file mode 100644
index 0000000..8963a25
--- /dev/null
+++ b/blobology.bash
@@ -0,0 +1,172 @@
+#!/bin/bash
+
+## Blobology pipeline. 2013-09-12 Sujai Kumar 
+## sujai.kumar at gmail.com github.com/blaxterlab/blobology
+
+## Needs:
+## Read files (unless you are providing assembly as fasta and
+## read alignments as BAM).
+## ABySS in path (unless you are providing your own assembly)
+## NCBI blast+ suite (2.2.28 or above, which provides taxids in hit)
+## NCBI formatted nt blast database (because it includes taxon information)
+## NCBI taxonomy dump (to calculate higher taxon levels from hit taxids)
+## samtools in path (to manipulate bam files)
+## github blaxterlab/blobology scripts in path
+## fastq-mcf from the ea-utils suite (version 1.1.2-537) if you want to
+## quality- and adapter- trim your raw reads
+
+####==========================================================================
+#### Download read files from SRA
+####==========================================================================
+
+## Uncomment the lines below if you are replicating the results in the
+## Blobology paper. Otherwise provide your own reads/assembly in the assembly/
+## alignment steps.
+## ERR138445 is a 300 bp paired-end library
+## ERR138446 is a 600 bp paired-end library
+
+# wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR138/ERR138445/ERR138445_1.fastq.gz
+# wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR138/ERR138445/ERR138445_2.fastq.gz
+# wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR138/ERR138446/ERR138446_1.fastq.gz
+# wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR138/ERR138446/ERR138446_2.fastq.gz
+
+## Uncomment if you want to do quality- and adapter- trimming using fastq-mcf
+
+#fastq-mcf -o >(gzip -c >ERR138445_1.mcf.fastq.gz) -o >(gzip -c >ERR138445_2.mcf.fastq.gz) -l 50 -q 20 --qual-mean 20 -R adapters.fa \
+#    ERR138445_1.fastq.gz ERR138445_2.fastq.gz &>ERR138445.mcf.err
+#fastq-mcf -o >(gzip -c >ERR138446_1.mcf.fastq.gz) -o >(gzip -c >ERR138446_2.mcf.fastq.gz) -l 50 -q 20 --qual-mean 20 -R adapters.fa \
+#    ERR138446_1.fastq.gz ERR138446_2.fastq.gz &>ERR138446.mcf.err
+
+####==========================================================================
+#### GENERAL CONFIG VARIABLES
+####==========================================================================
+
+## Num of processors. Or, uncomment the line below and enter manually.
+
+NUMPROC=`grep "^processor" /proc/cpuinfo | tail -n 1 | awk '{print $3}'`
+#NUMPROC=16
+
+## K-mer size for prelim assembly. Not needed if you are providing your own
+## assembly fasta file.
+## You can crudely estimate this for your data using
+## http://dna.med.monash.edu.au/~torsten/velvet_advisor/
+
+KMER=61
+
+## Location of local installation of nt blast database
+## (not needed if using blast remotely, which is slower).
+## The NCBI nt databases can be downloaded from
+## ftp://ftp.ncbi.nlm.nih.gov/blast/db/ using the following command:
+
+# wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz"
+# for a in nt.*.tar.gz; do tar xzf $a; done
+
+BLASTDB=~/scratch/blastdb
+
+## Location of NCBI tar gunzipped directory downloaded from
+## ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
+## Default: current directory
+
+TAXDUMP=.
+
+####==========================================================================
+#### STEP 1, run ABySS assembler, or provide assembly fasta file
+####==========================================================================
+
+## If providing own assembly fasta file, uncomment and insert filename,
+## and comment out the remaining commands in this section:
+
+#ASSEMBLY=assembly.fasta
+
+## If using ABySS, uncomment the commands below and insert your ownread files
+## in the section 'lib="..."'.
+## See http://www.bcgsc.ca/downloads/abyss/doc/#assemblingapaired-endlibrary
+## for details. We also recommend running the
+## command below on a cluster if possible (with np= the number of cores
+## that you can request and j= number of processors on a single machine.
+## e.g, if you have access to 6 nodes each with 8 processors, use np=48 j=8
+## Even if you have only one machine with many cores, compile ABySS with
+## mpi support so that it can run many processes in parallel.
+
+abyss-pe name=pa${KMER} k=${KMER} np=60 j=12 lib="lib1 lib2" lib1="ERR138445_1.mcf.fastq.gz ERR138445_2.mcf.fastq.gz" lib2="ERR138446_1.mcf.fastq.gz ERR138446_2.mcf.fastq.gz"
+
+## At the end of the abyss step, the assembled sequence will be in
+## ${NAME}-scaffolds.fa
+## ABySS does not automatically discard short contigs, so keep only contigs >=200 bp:
+
+fastaqual_select.pl -l 200 -f ${NAME}-scaffolds.fa >${NAME}-scaffolds.fa.l200 
+
+## Now, set the ASSEMBLY environment variable to that filename, so it can be used for subsequent steps
+
+ASSEMBLY=${NAME}-scaffolds.fa.l200
+
+####==========================================================================
+#### STEP 2, find best blast hits of a random sample of contigs
+####==========================================================================
+
+## the following command selects 10000 contigs at random:
+
+RND_ASSEMBLY=random10k.$ASSEMBLY
+fastaqual_select.pl -f $ASSEMBLY -s r -n 10000 >$RND_ASSEMBLY
+
+## Change $BLASTDB to match your own path location of the nt blast database
+## If running blast remotely on NCBI's servers, use -remote -db nt
+## instead of -db $BLASTDB/nt
+## -outfmt 6 qseqid staxids gives only a two-column output table with the
+## query id and the taxid of the best hit
+
+blastn -task megablast -query $RND_ASSEMBLY -db $BLASTDB/nt -evalue 1e-5 -num_threads $NUMPROC -max_target_seqs 1 -outfmt '6 qseqid staxids' -out $RND_ASSEMBLY.nt.1e-5.megablast
+
+####==========================================================================
+#### STEP 3, map reads back to assembly using bowtie2
+####==========================================================================
+
+## This step is not needed if you already have BAM files with the read alignments.
+## The BAM files should have interleaved reads (i.e. read 1 followed by read 2, ...)
+## even if the reads are mapped unpaired, because a later script will use that
+## ordering to pick out read pairs even if only one read matched a given contig.
+
+## By mapping the two libraries separately, we can check if the blobplots (TAGC plots)
+## are library specific, which is a useful QC feature
+
+bowtie2-build $ASSEMBLY $ASSEMBLY
+
+for LIBNAME in ERR138445 ERR138446
+do
+    bowtie2 -x $ASSEMBLY --very-fast-local -k 1 -t -p 12 --reorder --mm \
+        -U <(shuffleSequences_fastx.pl 4 <(zcat ${LIBNAME}_1.mcf.fastq.gz) <(zcat ${LIBNAME}_2.mcf.fastq.gz)) \
+        | samtools view -S -b -T $ASSEMBLY - >$ASSEMBLY.$LIBNAME.bowtie2.bam
+done
+
+# Note: shuffleSequences_fastx.pl works on gunzipped files, hence the need to zcat inline
+
+####==========================================================================
+#### STEP 4, combine BAM files and blast hits to create data file for plotting
+####==========================================================================
+
+## ABySS will create one or more BAM files. Replace *.bam with own BAM files
+## if you have aligned them separately without using ABySS
+
+gc_cov_annotate.pl \
+  --blasttaxid $RND_ASSEMBLY.nt.1e-5.megablast \
+  --assembly $ASSEMBLY --bam *.bam --out blobplot.txt \
+  --taxdump $TAXDUMP --taxlist species order phylum superkingdom
+
+# Note: change --taxdump . to point to the directory where you unpacked NCBI's taxdump.tar.gz
+# Note: change --taxlist to any combination of NCBI's taxonomy levels (lowercase) such as class genus family, etc.
+# The default is species order phylum superkingdom
+
+####==========================================================================
+#### STEP 4, create blobplots using R, or visualise using blobsplorer
+####==========================================================================
+
+## Tested with R 2.15.2 and ggplot2 0.9.3.1
+## If you used a different --out option for naming the output data file in
+## gc_cov_annotate.pl, use that in place of blobplot.txt below
+## 0.01 is the threshold - if fewer than this proportion of annotated contigs
+## have a particular annotation, they are not shown in the legend
+
+makeblobplot.R blobplot.txt 0.01 taxlevel_order
+
+## The output file blobplot.txt can also be loaded into Blobsplorer - see github.com/mojones/blobsplorer
+
diff --git a/blobologyMethodOverview.png b/blobologyMethodOverview.png
new file mode 100644
index 0000000..606794c
Binary files /dev/null and b/blobologyMethodOverview.png differ
diff --git a/bowtie2_extract_reads_mapped_to_specific_contigs.pl b/bowtie2_extract_reads_mapped_to_specific_contigs.pl
new file mode 100755
index 0000000..e3ae2b5
--- /dev/null
+++ b/bowtie2_extract_reads_mapped_to_specific_contigs.pl
@@ -0,0 +1,139 @@
+#!/usr/bin/env perl
+
+=head1 NAME
+
+bowtie2_extract_reads_mapped_to_specific_contigs.pl
+
+=head1 SYNOPSIS
+
+bowtie2_extract_reads_mapped_to_specific_contigs.pl -s samfile -id contigids1.txt -id contigids2.txt [-u]
+
+=head1 DESCRIPTION
+
+- Takes a .sam file as input (you can provide a bam file using BASH subprocesses or pipes)
+- Takes multiple files with contigids
+- Assumes the bowtie2 .bam/sam file has the full read stored inside it (i.e uses soft clipping)
+- Assumes reads are interleaved/paired in sam file
+- Reads are returned in pairs. If a /1 read maps to one file and /2 to another file, then put both in /1's file
+- If the same contigid if present in two files, both mapped read output files will contain those reads
+  (i.e. reads will be duplicated. so it is the user's responsibility to ensure contig ids are not duplicated) 
+- -u will also print unmapped.reads (reads that were either not mapped to ANY contig, or reads that were not mapped to
+  any of the contigid lists
+
+Needs:
+
+=head1 AUTHORS
+
+sujai.kumar at ed.ac.uk 2012.04.28
+
+=cut
+
+use strict;
+use warnings;
+use Getopt::Long qw(:config pass_through no_ignore_case);
+use IO::File;
+
+my $samfile;
+my @contigidfiles;
+my $unmapped = "";
+my $output_prefix = "";
+
+GetOptions (
+    "samfile=s"    => \$samfile,
+    "idfiles=s{,}" => \@contigidfiles,
+    "unmapped"     => \$unmapped,
+    "out=s"        => \$output_prefix,
+);
+
+if (not $samfile or not @contigidfiles) {
+    print STDERR "bowtie2_extract_reads_mapped_to_specific_contigs.pl -s samfile -id contigids1.txt -id contigids2.txt [-u]\n";
+    exit;
+}
+
+#-------------------------------
+
+#-------------------------------
+# load each contigidfile (that we want to extract reads for) into memory
+# create a filehandle to write to for each contigidfile called output_prefix.contigidfile.readnums
+
+my %include_ids; # two level hash 
+my %mapped_reads_fh;
+
+for my $contigidfile (@contigidfiles) {
+    $mapped_reads_fh{$contigidfile} = IO::File->new;
+    open $mapped_reads_fh{$contigidfile}, "| gzip >$output_prefix$contigidfile.mapped.reads.fq.gz" or die $!;
+    open ID, "<$contigidfile" or die $!;
+    while (<ID>) {
+        /^>?(\S+)/ and $include_ids{$contigidfile}{$1} = 1;
+    }
+}
+$mapped_reads_fh{unmapped} = IO::File->new;
+if ($unmapped) {
+    open $mapped_reads_fh{unmapped}, "| gzip >${output_prefix}unmapped.reads.fq.gz" or die $!;
+}
+
+#-------------------------------
+# Now, go through each line of the sam file
+# and each time you see a contig id, check which of the contigidfiles it belongs in,
+# and print the read to that file
+
+open SAM, "<$samfile" or die $!;
+my ($fr, $fc, $rr, $rc); # forward read, forward contig, etc...
+while (<SAM>) {
+    next if /^@/ or /^#/;
+    # get forward read, forward contig, reverse read, reverse contig;
+    my @F=split(/\t/);
+    die "Does not seem to be a valid SAM file" unless $F[1] =~ /^\d+$/;
+    if ($F[1] & 16) {
+        $fr = "\@$F[0]\n" . &revcomp($F[9]) . "\n\+\n" . reverse($F[10]) . "\n";
+    }
+    else {
+        $fr = "\@$F[0]\n$F[9]\n\+\n$F[10]\n";
+    }
+    $fc = $F[2];
+    $_ = <SAM>;
+    @F=split(/\t/);
+    if ($F[1] & 16) {
+        $rr = "\@$F[0]\n" . &revcomp($F[9]) . "\n\+\n" . reverse($F[10]) . "\n";
+    }
+    else {
+        $rr = "\@$F[0]\n$F[9]\n\+\n$F[10]\n";
+    }
+    $rc = $F[2];
+    if ($fc eq "*" and $rc eq "*" and $unmapped) {
+        print {$mapped_reads_fh{unmapped}} "$fr$rr";
+        next;
+    }
+    $fc = $rc if $fc eq "*";
+    my $found = 0;
+    for my $contigidfile (@contigidfiles) {
+        if ($include_ids{$contigidfile}{$fc}) {
+            print {$mapped_reads_fh{$contigidfile}} "$fr$rr";
+            $found = 1;
+        }
+    }
+   if (not $found and $unmapped) { 
+       print {$mapped_reads_fh{unmapped}} "$fr$rr";
+   }
+}
+close SAM;
+
+#############################################################################
+
+sub read_fh {
+    my $filename = shift @_;
+    my $filehandle;
+    if ($filename =~ /gz$/) {
+        open $filehandle, "gunzip -dc $filename |" or die $!;
+    }
+    else {
+        open $filehandle, "<$filename" or die $!;
+    }
+    return $filehandle;
+}
+
+sub revcomp {
+    $_ = shift @_;
+    tr/atgcATGC/tacgTACG/;
+    return scalar reverse $_;
+}
diff --git a/cleank61.blobplot.txt.taxlevel_order.01.png b/cleank61.blobplot.txt.taxlevel_order.01.png
new file mode 100644
index 0000000..9a5081c
Binary files /dev/null and b/cleank61.blobplot.txt.taxlevel_order.01.png differ
diff --git a/dev/README.md b/dev/README.md
new file mode 100644
index 0000000..da5cd42
--- /dev/null
+++ b/dev/README.md
@@ -0,0 +1,33 @@
+README.md of /blobology/dev
+
+––––––––––––––––––––––––––––––
+
+Input formats
+
+legacy format:
+
+| columns |    ID   |  len |   gc  |  cov1 |  cov2 |  cov3 |     taxlevel1    |    taxlevel2   |    taxlevel3    |       taxlevel4       |
+|:-------:|:-------:|:----:|:-----:|:-----:|:-----:|:-----:|:----------------:|:--------------:|:---------------:|:---------------------:|
+|   type  |   str   |  int | float | float | float | float |        str       |       str      |       str       |          str          |
+| example |    ID   |  len |   gc  |  cov1 |  cov2 |  cov3 | taxlevel_species | taxlevel_order | taxlevel_phylum | taxlevel_superkingdom |
+|         | contig1 | 1000 |  0.56 |  100  | 150.2 |  80.3 |    Danio rerio   |  Cypriniformes |     Chordata    |       Eukaryota       |
+|         | contig2 |  500 |  0.54 |  12.9 |  11.1 |  15.7 |   Not Annotated  |  Not Annotated |  Not Annotated  |     Not Annotated     |
+
+
+- eval format as in gc_cov_annotate_v1.pl --evalue (requires BLAST -outfmt '6 qseqid taxid std')
+
+| columns | ID      | len  | gc    | cov1  | cov2  | cov3  | taxlevel1        | taxlevel2      | taxlevel3       | taxlevel4             | eval       |
+|---------|---------|------|-------|-------|-------|-------|------------------|----------------|-----------------|-----------------------|------------|
+| type    | str     | int  | float | float | float | float | str              | str            | str             | str                   | scientific |
+| example | ID      | len  | gc    | cov1  | cov2  | cov3  | taxlevel_species | taxlevel_order | taxlevel_phylum | taxlevel_superkingdom | eval       |
+|         | contig1 | 1000 | 0.56  | 100   | 150.2 | 80.3  | Danio rerio      | Cypriniformes  | Chordata        | Eukaryota             | 1e-25      |
+|         | contig2 | 500  | 0.54  | 12.9  | 11.1  | 15.7  | Not Annotated    | Not Annotated  | Not Annotated   | Not Annotated         | N/A        |
+
+- novel format (to be inplemented)
+
+| columns |    ID   |  len |   gc  |              cov              |                                            tax                                            | eval       |
+|:-------:|:-------:|:----:|:-----:|:-----------------------------:|:-----------------------------------------------------------------------------------------:|------------|
+|   type  |   str   |  int | float |             float             |                                            str                                            | scientific |
+| example |    ID   |  len |   gc  |              cov              |                                            tax                                            | eval       |
+|         | contig1 | 1000 |  0.56 | cov1=100;cov2=150.2;cov3=80.3 |       species=Danio rerio;order=Cypriniformes;phylum=Chordata;superkingdom=Eukaryota      | 1e-25      |
+|         | contig2 |  500 |  0.54 | cov1=12.9;cov2=11.1;cov3=15.7 | species=Not Annotated;order=Not Annotated;phylum=Not Annotated;superkingdom=Not Annotated | N/A        |
diff --git a/dev/gc_cov_annotate_eval.pl b/dev/gc_cov_annotate_eval.pl
new file mode 100755
index 0000000..2ef1819
--- /dev/null
+++ b/dev/gc_cov_annotate_eval.pl
@@ -0,0 +1,347 @@
+#!/usr/bin/perl
+
+=head1 NAME
+gc_cov_annotate_eval.pl
+=head1 SYNOPSIS
+gc_cov_annotate_eval.pl [--evalue] [--newformat] --blasttaxid CONTIGTAXIDFILE --assembly ASSEMBLYFASTAFILE [--taxdump TAXDUMPDIR] [--cas BAMFILE...]
+=head1 AUTHORS
+sujai.kumar at zoo.ox.ac.uk 2013.09.15
+dominik.laetsch at google.com 2014.05.08
+=cut
+
+use strict;
+use warnings;
+use Getopt::Long qw(:config pass_through no_ignore_case);
+
+my ($blasttaxid_file, $taxdump_dir, $assembly_file, $output_file, $evalue, $newformat) = ("",".","","", 0, 0);
+my @tax_list;
+my @cas_files;
+my @bam_files;
+my @cov_files;
+
+GetOptions (
+    "blasttaxid=s" => \$blasttaxid_file,
+    "newformat" => \$newformat, #DRL
+    "evalue"    => \$evalue, #DRL
+    "assembly=s"   => \$assembly_file,
+    "out:s"        => \$output_file,
+    "taxdump:s"    => \$taxdump_dir,
+    "cas:s{,}"     => \@cas_files,
+    "bam:s{,}"     => \@bam_files,
+    "cov:s{,}"     => \@cov_files,
+    "taxlist:s{,}" => \@tax_list,
+);
+
+if (not @tax_list) {@tax_list = ("species","order","phylum","superkingdom")};
+my %tax_levels;
+foreach (@tax_list) {$tax_levels{$_}=1}
+
+if (not $output_file) {$output_file = $assembly_file . ".txt"};
+
+print "Usage: gc_cov_annotate.pl --evalue --newformat --blasttaxid CONTIGTAXIDFILE --assembly ASSEMBLYFASTAFILE [--taxdump TAXDUMPDIR] [--cas BAMFILE...] [--cov COVFILES...] [--taxlist species...]\n" .
+    "--newformat : newformat switch. Prints output in new format'.\n" . 
+    "--evalue : evalue switch. Puts evalue from blast output in column before taxonomy. If evalue is specified, the script expects the blast result to be in the format '6 qseqid staxids std'.\n" . 
+    "--taxdump is '.' by default, i.e. the files nodes.dmp and names.dmp from the NCBI taxonomy database ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz are expected to be in the current directory\n" . 
+    "--taxlist: default is: species order phylum superkingdom, but you can add any other NCBI taxlevel such as class family suborder etc\n" unless
+    (-r $blasttaxid_file or $blasttaxid_file eq "-") and -r "$taxdump_dir/nodes.dmp" and -r "$taxdump_dir/names.dmp" and 
+    (-r $assembly_file or $assembly_file eq "-");
+
+die "Please specify --blasttaxid\n" unless (-r $blasttaxid_file or $blasttaxid_file eq "-");
+die "Please specify --taxdump\n" unless -r "$taxdump_dir/nodes.dmp" and -r "$taxdump_dir/names.dmp";
+die "Please specify --assembly\n" unless (-r $assembly_file or $assembly_file eq "-");
+
+#-----------------------------------------------------------------
+# Get taxon annotation info from contig-taxid file
+#-----------------------------------------------------------------
+
+my (%taxid_has_parent, %taxid_has_taxlevel, %taxid_has_name);
+if ($evalue){
+    print STDERR scalar localtime() . " - Start : Evalue will be parsed from blasttaxid (expects '6 qseqid staxids std') ...\n";    
+}
+else{
+    print STDERR scalar localtime() . " - Start : No Evalue parsing ...\n";       
+}
+print STDERR scalar localtime() . " - Loading $taxdump_dir/nodes.dmp and $taxdump_dir/names.dmp into memory ...\n";
+&load_nodes_names ("$taxdump_dir/nodes.dmp","$taxdump_dir/names.dmp");
+print STDERR scalar localtime() . " - Loading $taxdump_dir/nodes.dmp and $taxdump_dir/names.dmp into memory ... DONE\n";
+
+my $blasttaxid_fh = &read_fh($blasttaxid_file);
+my %contig_taxinfo;
+my %contig_evalinfo; # DRL
+while (<$blasttaxid_fh>) {
+    if ($evalue){ # DRL
+        die "Contig-taxid file $blasttaxid_file does not seem to have the blast output format '6 qseqid staxids std'" unless 
+            /^(\S+)\t(\d+);*\d*\t\S+\t\S+\t\d+\t\d+\t\d+\t\d+\t\d+\t\d+\t\d+\t(\S+)/;
+        my $contig = $1;
+        my $taxid = $2;
+        my $eval = $3;
+        if (exists $contig_taxinfo{$contig}){
+            next;
+        }
+        else{
+            $contig_taxinfo{$contig} = &taxonomy_report($taxid);
+            $contig_evalinfo{$contig} = $eval;
+        }
+    }
+    else{
+        die "Contig-taxid file $blasttaxid_file does not seem to have the blast output format '6 qseqid staxids ...'" unless 
+                /^(\S+)\t(\d+)/;  
+        $contig_taxinfo{$1} = &taxonomy_report($2);
+    }
+} 
+close $blasttaxid_fh;
+
+#-----------------------------------------------------------------
+# Calculate GC, len, for assembly file, get coverage from casfiles
+#-----------------------------------------------------------------
+
+# load assembly file into memory:
+print STDERR scalar localtime() . " - Loading assembly fasta file $assembly_file into memory ...\n";
+my $fastahash = &fastafile2hash ($assembly_file);
+print STDERR scalar localtime() . " - Loading assembly fasta file $assembly_file into memory ... DONE\n";
+
+# calculate coverage for each cas file
+for my $cas_file (@cas_files) {
+    my $reads_processed = 0;
+    my ($refstart, $refend, $readid, @F); # declaring here to avoid having to declare in each instance of loop
+    open CAS, "clc_mapping_table -n $cas_file |" or die $!; 
+    print STDERR scalar localtime() . " - Reading $cas_file ...\n";
+    while (<CAS>) {
+        # ignore comment lines/ CAS headers
+        next if /^\s*$/ or /^@/ or /^#/;
+        chomp;
+        @F=split/\t/;
+        if (($F[9] & 4) != 4) { # DRL
+            $refstart =  $F[6];
+            $refend   =  $F[7] - 1;
+            while ($F[5] =~ /(\d+)[MIDNP]/g) { $refend += $1 } # calculate coverage on reference sequence from CIGAR
+            die "ContigID $F[2] in casfile $cas_file, but not in assembly file $assembly_file\n" if not exists $$fastahash{$F[2]};
+            $$fastahash{$F[2]}{$cas_file} += ($refend - $refstart + 1)/$$fastahash{$F[2]}{len};
+        }
+        $reads_processed++;
+        print STDERR "Processed $reads_processed reads by " . localtime() . "...\n" if $reads_processed % 1000000 == 0;
+    }
+    print STDERR scalar localtime() . " - Reading $cas_file ... DONE\n";
+}
+
+# calculate coverage for each bam file
+for my $bam_file (@bam_files) {
+    my $reads_processed = 0;
+    my ($refstart, $refend, $readid, @F); # declaring here to avoid having to declare in each instance of loop
+    open SAM, "samtools view $bam_file |" or die $!; 
+    print STDERR scalar localtime() . " - Reading $bam_file ...\n";
+    while (<SAM>) {
+        # ignore comment lines/ SAM headers
+        next if /^\s*$/ or /^@/ or /^#/;
+        chomp;
+        @F=split/\t/;
+        if (($F[1] & 4) != 4) {
+            $refstart =  $F[3];
+            $refend   =  $F[3] - 1;
+            while ($F[5] =~ /(\d+)[MIDNP]/g) { $refend += $1 } # calculate coverage on reference sequence from CIGAR
+            die "ContigID $F[2] in bamfile $bam_file, but not in assembly file $assembly_file\n" if not exists $$fastahash{$F[2]};
+            $$fastahash{$F[2]}{$bam_file} += ($refend - $refstart + 1)/$$fastahash{$F[2]}{len};
+        }
+        $reads_processed++;
+        print STDERR "Processed $reads_processed reads by " . localtime() . "...\n" if $reads_processed % 1000000 == 0;
+    }
+    print STDERR scalar localtime() . " - Reading $bam_file ... DONE\n";
+}
+# calculate coverage for each cov file (two col file with seqid in first col, average read coverage depth in second col)
+for my $cov_file (@cov_files) {
+    open COV, $cov_file or die $!;
+    print STDERR scalar localtime() . " - Reading $cov_file ...\n";
+    while (<COV>) {
+        /^(\S+)\s+(\S+)/ and $$fastahash{$1}{$cov_file} = $2;
+    }
+    print STDERR scalar localtime() . " - Reading $cov_file ... DONE\n";
+}
+
+# print len gc cov (one for each cas, and each cov) and taxon annotation info in one large file (that can be used for plotting by R)
+print STDERR scalar localtime() . " - Making len gc cov taxon annotation data file $output_file for plotting ...\n";
+open  LENCOVGC, ">$output_file" or die $!;
+
+# for each contig/seqid:
+my ($seqid, $length, $gccount, $nonatgc, $totalcov, $cov, $tax); # declared outside loop for efficiency
+if ($newformat){
+    # header row:
+    print LENCOVGC "id\tlen\tgc";
+    print LENCOVGC "\tcov";
+    #foreach (@cas_files) {print LENCOVGC "\tcov_$_"};
+    print LENCOVGC "\ttax";
+    #foreach (@tax_list)  {print LENCOVGC "\ttaxlevel_$_"};
+    if ($evalue){# DRL
+        print LENCOVGC "\teval"; 
+    }
+    print LENCOVGC "\n";
+    for $seqid (keys %{$fastahash}) { 
+        $length   = $$fastahash{$seqid}{len};
+        $gccount  = $$fastahash{$seqid}{gc};
+        $nonatgc  = $$fastahash{$seqid}{nonatgc};
+        print LENCOVGC "$seqid\t$length\t". ($gccount/($length-$nonatgc))."\t";
+        for my $cas_file (@cas_files) {
+            $cov = (exists($$fastahash{$seqid}{$cas_file}) ? $$fastahash{$seqid}{$cas_file} : 0);
+            print LENCOVGC $cas_file."=".$cov.";";
+        }
+        for my $cov_file (@cov_files) {
+            $cov = (exists($$fastahash{$seqid}{$cov_file}) ? $$fastahash{$seqid}{$cov_file} : 0);
+            print LENCOVGC $cov_file."=".$cov.";";
+        }
+        for my $bam_file (@bam_files) {
+            $cov = (exists($$fastahash{$seqid}{$bam_file}) ? $$fastahash{$seqid}{$bam_file} : 0);
+            print LENCOVGC $bam_file."=".$cov.";";
+        }
+        print LENCOVGC "\t";
+        for my $tax_level (@tax_list) {
+            $tax = (exists(${$contig_taxinfo{$seqid}}{$tax_level}) ? ${$contig_taxinfo{$seqid}}{$tax_level} : "Not annotated");
+            $tax =~ s/;|=|#//g;
+            print LENCOVGC $tax_level."=".$tax.";";
+        }
+        print LENCOVGC "\t";
+        if ($evalue){ # DRL
+            print LENCOVGC (exists($contig_evalinfo{$seqid}) ? $contig_evalinfo{$seqid} : "N/A"); 
+        }
+        print LENCOVGC "\n";
+    }
+}
+else{
+    # header row:
+    print LENCOVGC "seqid\tlen\tgc";
+    foreach (@cas_files) {print LENCOVGC "\tcov_$_"};
+    foreach (@cov_files) {print LENCOVGC "\tcov_$_"};
+    foreach (@bam_files) {print LENCOVGC "\tcov_$_"};
+    foreach (@tax_list)  {print LENCOVGC "\ttaxlevel_$_"};
+    if ($evalue){# DRL
+        print LENCOVGC "\teval"; 
+    }
+    print LENCOVGC "\n";
+    for $seqid (keys %{$fastahash}) { 
+        $length   = $$fastahash{$seqid}{len};
+        $gccount  = $$fastahash{$seqid}{gc};
+        $nonatgc  = $$fastahash{$seqid}{nonatgc};
+        print LENCOVGC "$seqid\t$length\t". ($gccount/($length-$nonatgc));
+        for my $cas_file (@cas_files) {
+            $cov = (exists($$fastahash{$seqid}{$cas_file}) ? $$fastahash{$seqid}{$cas_file} : 0);
+            print LENCOVGC "\t" . $cov;
+        }
+        for my $cov_file (@cov_files) {
+            $cov = (exists($$fastahash{$seqid}{$cov_file}) ? $$fastahash{$seqid}{$cov_file} : 0);
+            print LENCOVGC "\t" . $cov;
+        }
+        for my $bam_file (@bam_files) {
+            $cov = (exists($$fastahash{$seqid}{$bam_file}) ? $$fastahash{$seqid}{$bam_file} : 0);
+            print LENCOVGC "\t" . $cov;
+        }
+        #print LENCOVGC "\t" . $cov;
+        for my $tax_level (@tax_list) {
+            #print LENCOVGC "\t" . (exists(${$contig_taxinfo{$seqid}}{$tax_level}) ? ${$contig_taxinfo{$seqid}}{$tax_level} : "Not annotated"); 
+            $tax = (exists(${$contig_taxinfo{$seqid}}{$tax_level}) ? ${$contig_taxinfo{$seqid}}{$tax_level} : "Not annotated");
+            $tax =~ s/#//g;
+            print LENCOVGC "\t".$tax;
+        }
+        if ($evalue){ # DRL
+            print LENCOVGC "\t" . (exists($contig_evalinfo{$seqid}) ? $contig_evalinfo{$seqid} : "N/A"); 
+        }
+        print LENCOVGC "\n";
+    }
+}
+print STDERR scalar localtime() . " - Making len gc cov taxon annotation data file $output_file for plotting ... DONE\n";
+
+############################################################
+
+sub taxonomy_report {
+    my $hit_taxid = shift @_;
+    my @parents = &get_parents($hit_taxid);
+    # convert @parents to tax names:
+    my %taxinfo;
+    # my $taxonomy_report_string = "";
+    for my $parent (@parents) {
+        if (exists $taxid_has_taxlevel{$parent} and exists $tax_levels{$taxid_has_taxlevel{$parent}}) {
+            $taxinfo{$taxid_has_taxlevel{$parent}} = $taxid_has_name{$parent};
+        }
+    }
+    return \%taxinfo;
+    # for my $tax_level (keys %hit_counts) {
+        # for my $tax_name (keys %{$hit_counts{$tax_level}}) {
+            # $taxonomy_report_string .= "$tax_level\t$tax_name\t";
+        # }
+    # }
+    # return $taxonomy_report_string . "\n";
+}
+
+############################################################
+
+sub get_parents {
+    my @all = @_;
+    my $current_id = $all[0];
+    if (exists $taxid_has_parent{$current_id} and $current_id ne $taxid_has_parent{$current_id}) {
+        unshift @all, $taxid_has_parent{$current_id};
+        @all = &get_parents(@all);
+    }
+    return @all;
+}
+
+############################################################
+
+sub load_nodes_names {
+    my $fh;
+    my $nodesfile = shift @_;
+    my $namesfile = shift @_;
+    $fh = &read_fh($nodesfile);
+    while (my $line = <$fh>) {
+        # line in nodes.dmp should match the regexp below.
+        # Change the regexp if NCBI changes their file format
+        next if $line !~ /^(\d+)\s*\|\s*(\d+)\s*\|\s*(.+?)\s*\|/;
+        $taxid_has_parent{$1} = $2;
+        $taxid_has_taxlevel{$1} = $3;
+    }
+    close $fh;
+    
+    $fh = &read_fh($namesfile);
+    while (my $line = <$fh>) {
+        next unless $line =~ /^(\d+)\s*\|\s*(.+?)\s*\|.+scientific name/;
+        $taxid_has_name{$1} = $2;
+    }
+}
+
+############################################################
+
+sub fastafile2hash {
+    my $fastafile = shift @_;
+    my %sequences;
+    my $fh = &read_fh($fastafile);
+    my $seqid;
+    while (my $line = <$fh>) {
+        if ($line =~ /^>(\S+)(.*)/) {
+            $seqid = $1;
+            $sequences{$seqid}{desc} = $2;
+        }
+        else {
+            chomp $line;
+            $sequences{$seqid}{seq}     .= $line;
+            $sequences{$seqid}{len}     += length $line;
+            $sequences{$seqid}{gc}      += ($line =~ tr/gcGC/gcGC/);
+            $line =~ s/[^atgc]/N/ig;
+            $sequences{$seqid}{nonatgc} += ($line =~ tr/N/N/);
+        }
+    }
+    close $fh;
+    return \%sequences;
+}
+
+############################################################
+
+sub read_fh {
+    my $filename = shift @_;
+    my $filehandle;
+    if ($filename =~ /gz$/) {
+        open $filehandle, "gunzip -dc $filename |" or die $!;
+    }
+    else {
+        open $filehandle, "<$filename" or die $!;
+    }
+    return $filehandle;
+}
+
+############################################################
+
diff --git a/fastaqual_select.pl b/fastaqual_select.pl
new file mode 100755
index 0000000..4d32282
--- /dev/null
+++ b/fastaqual_select.pl
@@ -0,0 +1,189 @@
+#!/usr/bin/perl
+
+use strict;
+use warnings;
+use Getopt::Long qw(:config pass_through no_ignore_case no_auto_abbrev);
+
+my ($fastafile,$sort,$numfasta,$length,$prefix,$regexp,$headerfile,$includefile,$excludefile,$delimiter,$case,$interval,$rename) = ("","S","","","","","","",""," ","N","","");
+GetOptions (
+  "f|fastafile:s" => \$fastafile,
+  "s|sort:s" => \$sort,
+  "n|numfasta:i" => \$numfasta,
+  "l|length:i" => \$length,
+  "prefix:s" => \$prefix,
+  "regexp:s" => \$regexp,
+  "i|inc|include|includefile:s" => \$includefile,
+  "e|ex|exc|exclude|excludefile:s" => \$excludefile,
+  "d|delimiter:s" => \$delimiter,
+  "c|case:s" => \$case,
+  "int|interval" => \$interval,
+  "rename:s" => \$rename,
+);
+
+if (not $fastafile) {
+    print STDERR "Usage:
+  -f|fastafile <fastafile> MANDATORY
+  -s|sort [S|R|L|I]        default S (selection); R random; L length; I input file order
+  -n|numfasta <number>     number of fasta sequences (default all)
+  -l|length <number>       minimum length of fasta sequence to select (default 0)
+  -prefix <string>         prefix to add before sequence id (default none)
+  -regexp <string>         only select those sequences that match this regexp (seq id or sequence)
+  -i|inc|include <file>    file with list of seqids to include/select - one per line
+  -e|ex|exc|exclude <file> file with list of seqids to exclude - one per line
+  -d|delimiter <char>      character for delimiting seqid start end, in case of intervals
+  -c|case [U|L]            convert to upper or lower case
+  -int|interval            use if the include file is specified as seqid start end (i.e. if you want to extract intervals)
+  -rename <string>         use to rename each sequence with a five digit suffix. eg, -rename LSIG will create >LSIG00001 >LSIG00002 etc
+";
+   exit;
+}
+
+$case = uc(substr($case,0,1)) if $case;
+
+my (%include_headers, @include_headers_ordering, %exclude_headers);
+if ($includefile)
+{
+    open FILE,"<$includefile" or die "Couldn't open includes file $includefile\n";
+    while (<FILE>)
+    {
+        chomp;
+        if ($interval and (/^>?(\S+)\s(\d+)\s(\d+)\b/ or /^>?(\S+)_(\d+)_(\d+)\b/))
+        {
+            push @{$include_headers{$1}}, [$2,$3];
+            push @include_headers_ordering, $1;
+        } elsif (/^>?(\S+)/)
+        {
+            $include_headers{$1}=1;
+            push @include_headers_ordering, $1;
+        }
+    }
+    close FILE;
+}
+if ($excludefile)
+{
+    open FILE,"<$excludefile" or die "Couldn't open excludes file $excludefile\n";
+    while (<FILE>)
+    {
+        if (/^>?(\S+)/)
+        {
+            $exclude_headers{$1}=1
+        }
+    }
+    close FILE;
+}
+
+my $seqs = &fastafile2hash($fastafile,$case,$sort);
+my @sortkeys;
+
+if (uc($sort) =~ /^[RS]$/) {
+    @sortkeys = sort {$$seqs{$a}{order} <=> $$seqs{$b}{order}} keys %{$seqs};
+} elsif (uc($sort) eq "A") {
+    @sortkeys = sort keys %{$seqs};
+} elsif (uc($sort) eq "I") {
+    @sortkeys = @include_headers_ordering;
+} else {
+    @sortkeys = sort {length($$seqs{$b}{seq}) <=> length($$seqs{$a}{seq})} keys %{$seqs};
+}
+
+my $num_printed = 0;
+my $toprint_index = 1;
+for my $chrontig (@sortkeys)
+{
+    last if $numfasta and $numfasta == $num_printed;
+    if ($$seqs{$chrontig}{seq}=~ /\d+/) {
+        # it's a qual file
+        my @qual = split(/\s+/,$$seqs{$chrontig}{seq});
+        next if $length and scalar @qual < $length;
+    }
+    else {
+        next if $length and length($$seqs{$chrontig}{seq}) < $length;
+    }   
+    next if $excludefile and exists $exclude_headers{$chrontig};
+    next if $includefile and not exists $include_headers{$chrontig};
+    my $toprint;
+    if (%include_headers and $include_headers{$chrontig} =~ /array/i)
+    {
+        for my $interval (@{$include_headers{$chrontig}})
+        {
+            $toprint  = ">$prefix$chrontig";
+            $toprint  = ">$rename" . sprintf("%05d",$toprint_index) if $rename;
+            $toprint_index++;
+            my ($start, $stop) = @{$interval};
+            $toprint .= "$delimiter$start$delimiter$stop";
+            $toprint .= " $$seqs{$chrontig}{desc}" if $$seqs{$chrontig}{desc};
+            my $seq = substr($$seqs{$chrontig}{seq}, $start - 1, $stop - $start + 1);
+            if ($start > $stop) {
+                $seq = substr($$seqs{$chrontig}{seq}, $stop - 1, $start - $stop + 1);
+                $seq =~ tr/ACGTacgt/TGCAtgca/;
+                $seq = (scalar reverse $seq);
+            }
+            $toprint .= "\n$seq\n";
+            next if $regexp and $toprint !~ /$regexp/;
+            print $toprint;
+            $num_printed++;
+        }
+    }
+    else 
+    {
+        $toprint  = ">$prefix$chrontig";
+        $toprint  = ">$rename" . sprintf("%05d",$toprint_index) if $rename;
+        $toprint_index++;
+        $toprint .= " $$seqs{$chrontig}{desc}" if $$seqs{$chrontig}{desc};
+        $$seqs{$chrontig}{seq} = lc($$seqs{$chrontig}{seq}) if $case eq "L";
+        $$seqs{$chrontig}{seq} = uc($$seqs{$chrontig}{seq}) if $case eq "U";
+        $$seqs{$chrontig}{seq} =~ tr/[A-Z][a-z]/[a-z][A-Z]/ if $case eq "I";
+        $toprint .= "\n$$seqs{$chrontig}{seq}\n";
+        next if $regexp and $toprint !~ /$regexp/;
+        print $toprint;
+        $num_printed++;
+    }
+}
+
+#############################################################################
+
+sub fastafile2hash
+{
+	my $fastafile  = shift @_;
+	my $changecase = "N";
+	my $order      = "S"; # S = same as input, or R = random
+	$changecase    = substr(uc(shift @_),0,1) if @_;
+	$order         = substr(uc(shift @_),0,1) if @_;
+	my %sequences;
+	my $fh = &read_fh($fastafile);
+	my $seqid;
+	my $seq_counter;
+	while (<$fh>)
+	{
+		if (/^>(\S+)(.*)/) {
+		    $seqid = $1;
+		    $sequences{$seqid}{desc} = $2;
+		    $sequences{$seqid}{order} = $order eq "S" ? $seq_counter++ : rand;
+		}
+		else {
+		    if (/\d/) {
+		        chomp($sequences{$seqid}{seq} .= " $_"); # add space to sep qual values
+		        $sequences{$seqid}{seq} =~ s/^\s+//;
+		        $sequences{$seqid}{seq} =~ s/\s+$//;
+		        next;
+		    }
+		    chomp($sequences{$seqid}{seq} .= lc($_)) if $changecase eq "L";
+		    chomp($sequences{$seqid}{seq} .= uc($_)) if $changecase eq "U";
+		    chomp($sequences{$seqid}{seq} .= $_    ) if $changecase eq "N";
+		}
+	}
+	return \%sequences;
+}
+
+#############################################################################
+
+sub read_fh {
+    my $filename = shift @_;
+    my $filehandle;
+    if ($filename =~ /gz$/) {
+        open $filehandle, "gunzip -dc $filename |" or die $!;
+    }
+    else {
+        open $filehandle, "<$filename" or die $!;
+    }
+    return $filehandle;
+}
diff --git a/fgrep.pl b/fgrep.pl
new file mode 100755
index 0000000..aa4f422
--- /dev/null
+++ b/fgrep.pl
@@ -0,0 +1,37 @@
+#!/usr/bin/perl
+
+use strict;
+use warnings;
+use Getopt::Long qw(:config pass_through no_ignore_case);
+
+my ($file, $remove, $delim) = ("", 0, "\t");
+GetOptions (
+  "file:s" => \$file,
+  "v|remove" => \$remove,
+  "delim:s" => \$delim,
+);
+
+die "Usage: fgrep.pl [-v] [-d <delimiter] -f <filewithpatterns> <file to be searched>\n" unless $file;
+
+open FILE, "<$file" or die "Could not open file with patterns\n";
+my %hash;
+while (<FILE>)
+{
+	chomp;
+	$hash{$_}++
+}
+while (my $line = <>)
+{
+	chomp $line;
+	my $found = 0;
+	foreach (split(/$delim/,$line))
+	{
+		 if (exists $hash{$_})
+		 {
+		 	$found = 1;
+		 	last
+		 }
+	}
+	print "$line\n" if $found and not $remove;
+	print "$line\n" if not $found and $remove;
+}
diff --git a/gc_cov_annotate.pl b/gc_cov_annotate.pl
new file mode 100755
index 0000000..026373c
--- /dev/null
+++ b/gc_cov_annotate.pl
@@ -0,0 +1,236 @@
+#!/usr/bin/perl
+
+=head1 NAME
+gc_cov_annotate.pl
+=head1 SYNOPSIS
+gc_cov_annotate.pl --blasttaxid CONTIGTAXIDFILE --assembly ASSEMBLYFASTAFILE [--taxdump TAXDUMPDIR] [--bam BAMFILE...]
+=head1 AUTHORS
+sujai.kumar at zoo.ox.ac.uk 2013.09.15
+=cut
+
+use strict;
+use warnings;
+use Getopt::Long qw(:config pass_through no_ignore_case);
+
+my ($blasttaxid_file, $taxdump_dir, $assembly_file, $output_file) = ("",".","","");
+my @tax_list;
+my @bam_files;
+my @cov_files;
+
+GetOptions (
+    "blasttaxid=s" => \$blasttaxid_file,
+    "assembly=s"   => \$assembly_file,
+    "out:s"        => \$output_file,
+    "taxdump:s"    => \$taxdump_dir,
+    "bam:s{,}"     => \@bam_files,
+    "cov:s{,}"     => \@cov_files,
+    "taxlist:s{,}" => \@tax_list,
+);
+
+if (not @tax_list) {@tax_list = ("species","order","phylum","superkingdom")};
+my %tax_levels;
+foreach (@tax_list) {$tax_levels{$_}=1}
+
+if (not $output_file) {$output_file = $assembly_file . ".txt"};
+
+print "Usage: gc_cov_annotate.pl --blasttaxid CONTIGTAXIDFILE --assembly ASSEMBLYFASTAFILE [--taxdump TAXDUMPDIR] [--bam BAMFILE...] [--cov COVFILES...] [--taxlist species...]\n" .
+    "--taxdump is '.' by default, i.e. the files nodes.dmp and names.dmp from the NCBI taxonomy database ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz are expected to be in the current directory\n" . 
+    "--taxlist: default is: species order phylum superkingdom, but you can add any other NCBI taxlevel such as class family suborder etc\n" unless
+    (-r $blasttaxid_file or $blasttaxid_file eq "-") and -r "$taxdump_dir/nodes.dmp" and -r "$taxdump_dir/names.dmp" and 
+    (-r $assembly_file or $assembly_file eq "-");
+
+die "Please specify --blasttaxid\n" unless (-r $blasttaxid_file or $blasttaxid_file eq "-");
+die "Please specify --taxdump\n" unless -r "$taxdump_dir/nodes.dmp" and -r "$taxdump_dir/names.dmp";
+die "Please specify --assembly\n" unless (-r $assembly_file or $assembly_file eq "-");
+
+#-----------------------------------------------------------------
+# Get taxon annotation info from contig-taxid file
+#-----------------------------------------------------------------
+
+my (%taxid_has_parent, %taxid_has_taxlevel, %taxid_has_name);
+print STDERR scalar localtime() . " - Loading $taxdump_dir/nodes.dmp and $taxdump_dir/names.dmp into memory ...\n";
+&load_nodes_names ("$taxdump_dir/nodes.dmp","$taxdump_dir/names.dmp");
+print STDERR scalar localtime() . " - Loading $taxdump_dir/nodes.dmp and $taxdump_dir/names.dmp into memory ... DONE\n";
+
+my $blasttaxid_fh = &read_fh($blasttaxid_file);
+my %contig_taxinfo;
+while (<$blasttaxid_fh>) {
+    die "Contig-taxid file $blasttaxid_file does not seem to have two cols with the seqid in the first col and taxid in the second col" unless 
+        /^(\S+)\t(\d+)/;
+    $contig_taxinfo{$1} = &taxonomy_report($2);
+}
+close $blasttaxid_fh;
+
+#-----------------------------------------------------------------
+# Calculate GC, len, for assembly file, get coverage from bamfiles
+#-----------------------------------------------------------------
+
+# load assembly file into memory:
+print STDERR scalar localtime() . " - Loading assembly fasta file $assembly_file into memory ...\n";
+my $fastahash = &fastafile2hash ($assembly_file);
+print STDERR scalar localtime() . " - Loading assembly fasta file $assembly_file into memory ... DONE\n";
+
+# calculate coverage for each bam file
+for my $bam_file (@bam_files) {
+    my $reads_processed = 0;
+    my ($refstart, $refend, $readid, @F); # declaring here to avoid having to declare in each instance of loop
+    open SAM, "samtools view $bam_file |" or die $!; 
+    print STDERR scalar localtime() . " - Reading $bam_file ...\n";
+    while (<SAM>) {
+        # ignore comment lines/ SAM headers
+        next if /^\s*$/ or /^@/ or /^#/;
+        chomp;
+        @F=split/\t/;
+        if (($F[1] & 4) != 4) {
+            $refstart =  $F[3];
+            $refend   =  $F[3] - 1;
+            while ($F[5] =~ /(\d+)[MIDNP]/g) { $refend += $1 } # calculate coverage on reference sequence from CIGAR
+            die "ContigID $F[2] in bamfile $bam_file, but not in assembly file $assembly_file\n" if not exists $$fastahash{$F[2]};
+            $$fastahash{$F[2]}{$bam_file} += ($refend - $refstart + 1)/$$fastahash{$F[2]}{len};
+        }
+        $reads_processed++;
+        print STDERR "Processed $reads_processed reads by " . localtime() . "...\n" if $reads_processed % 1000000 == 0;
+    }
+    print STDERR scalar localtime() . " - Reading $bam_file ... DONE\n";
+}
+
+# calculate coverage for each cov file (two col file with seqid in first col, average read coverage depth in second col)
+for my $cov_file (@cov_files) {
+    open COV, $cov_file or die $!;
+    print STDERR scalar localtime() . " - Reading $cov_file ...\n";
+    while (<COV>) {
+        /^(\S+)\s+(\S+)/ and $$fastahash{$1}{$cov_file} = $2;
+    }
+    print STDERR scalar localtime() . " - Reading $cov_file ... DONE\n";
+}
+
+# print len gc cov (one for each bam, and each cov) and taxon annotation info in one large file (that can be used for plotting by R)
+print STDERR scalar localtime() . " - Making len gc cov taxon annotation data file $assembly_file.txt for R ...\n";
+open  LENCOVGC, ">$output_file" or die $!;
+
+# header row:
+print LENCOVGC "seqid\tlen\tgc";
+foreach (@bam_files) {print LENCOVGC "\tcov_$_"};
+foreach (@cov_files) {print LENCOVGC "\tcov_$_"};
+foreach (@tax_list)  {print LENCOVGC "\ttaxlevel_$_"};
+print LENCOVGC "\n";
+
+# for each contig/seqid:
+my ($seqid, $length, $gccount, $nonatgc, $totalcov, $cov); # declared outside loop for efficiency
+for $seqid (keys %{$fastahash}) { 
+    $length   = $$fastahash{$seqid}{len};
+    $gccount  = $$fastahash{$seqid}{gc};
+    $nonatgc  = $$fastahash{$seqid}{nonatgc};
+    print LENCOVGC "$seqid\t$length\t". ($gccount/($length-$nonatgc));
+    for my $bam_file (@bam_files) {
+        $cov = (exists($$fastahash{$seqid}{$bam_file}) ? $$fastahash{$seqid}{$bam_file} : 0);
+        print LENCOVGC "\t" . $cov;
+    }
+    for my $cov_file (@cov_files) {
+        $cov = (exists($$fastahash{$seqid}{$cov_file}) ? $$fastahash{$seqid}{$cov_file} : 0);
+        print LENCOVGC "\t" . $cov;
+    }
+    for my $tax_level (@tax_list) {
+        print LENCOVGC "\t" . (exists(${$contig_taxinfo{$seqid}}{$tax_level}) ? ${$contig_taxinfo{$seqid}}{$tax_level} : "Not annotated"); 
+    }
+    print LENCOVGC "\n";
+}
+print STDERR scalar localtime() . " - Making len gc cov taxon annotation data file $assembly_file.txt for R ... DONE\n";
+
+############################################################
+
+sub taxonomy_report {
+    my $hit_taxid = shift @_;
+    my @parents = &get_parents($hit_taxid);
+    # convert @parents to tax names:
+    my %taxinfo;
+    # my $taxonomy_report_string = "";
+    for my $parent (@parents) {
+        if (exists $taxid_has_taxlevel{$parent} and exists $tax_levels{$taxid_has_taxlevel{$parent}}) {
+            $taxinfo{$taxid_has_taxlevel{$parent}} = $taxid_has_name{$parent};
+        }
+    }
+    return \%taxinfo;
+    # for my $tax_level (keys %hit_counts) {
+        # for my $tax_name (keys %{$hit_counts{$tax_level}}) {
+            # $taxonomy_report_string .= "$tax_level\t$tax_name\t";
+        # }
+    # }
+    # return $taxonomy_report_string . "\n";
+}
+
+############################################################
+
+sub get_parents {
+    my @all = @_;
+    my $current_id = $all[0];
+    if (exists $taxid_has_parent{$current_id} and $current_id ne $taxid_has_parent{$current_id}) {
+        unshift @all, $taxid_has_parent{$current_id};
+        @all = &get_parents(@all);
+    }
+    return @all;
+}
+
+############################################################
+
+sub load_nodes_names {
+    my $fh;
+    my $nodesfile = shift @_;
+    my $namesfile = shift @_;
+    $fh = &read_fh($nodesfile);
+    while (my $line = <$fh>) {
+        # line in nodes.dmp should match the regexp below.
+        # Change the regexp if NCBI changes their file format
+        next if $line !~ /^(\d+)\s*\|\s*(\d+)\s*\|\s*(.+?)\s*\|/;
+        $taxid_has_parent{$1} = $2;
+        $taxid_has_taxlevel{$1} = $3;
+    }
+    close $fh;
+    
+    $fh = &read_fh($namesfile);
+    while (my $line = <$fh>) {
+        next unless $line =~ /^(\d+)\s*\|\s*(.+?)\s*\|.+scientific name/;
+        $taxid_has_name{$1} = $2;
+    }
+}
+
+############################################################
+
+sub fastafile2hash {
+    my $fastafile = shift @_;
+    my %sequences;
+    my $fh = &read_fh($fastafile);
+    my $seqid;
+    while (my $line = <$fh>) {
+        if ($line =~ /^>(\S+)(.*)/) {
+            $seqid = $1;
+            $sequences{$seqid}{desc} = $2;
+        }
+        else {
+            chomp $line;
+            $sequences{$seqid}{seq}     .= $line;
+            $sequences{$seqid}{len}     += length $line;
+            $sequences{$seqid}{gc}      += ($line =~ tr/gcGC/gcGC/);
+            $line =~ s/[^atgc]/N/ig;
+            $sequences{$seqid}{nonatgc} += ($line =~ tr/N/N/);
+        }
+    }
+    close $fh;
+    return \%sequences;
+}
+
+############################################################
+
+sub read_fh {
+    my $filename = shift @_;
+    my $filehandle;
+    if ($filename =~ /gz$/) {
+        open $filehandle, "gunzip -dc $filename |" or die $!;
+    }
+    else {
+        open $filehandle, "<$filename" or die $!;
+    }
+    return $filehandle;
+}
+
+############################################################
diff --git a/makeblobplot.R b/makeblobplot.R
new file mode 100755
index 0000000..3132583
--- /dev/null
+++ b/makeblobplot.R
@@ -0,0 +1,86 @@
+#!/usr/bin/env Rscript
+
+# 2013-09-16
+# works with R 2.15.2 and ggplot 0.9.3.1
+# Check ggplot2 help forums or contact sujai.kumar at gmail.com if something doesn't run
+# because of updated programs/packages
+
+#Function to ignore low frequency annotations:
+
+clean.blobs<-function(d,threshold,taxlevel) {
+    annotated<-d[d[,taxlevel]!="Not annotated",]
+    total<-dim(annotated)[1]
+    levels(d[,taxlevel])[which(table(d[,taxlevel])<threshold*total)]<-"Not annotated"
+    return(d)
+}
+
+#########################################################################
+
+#Load data from file and generate plot:
+
+library(ggplot2)
+library(reshape2)
+
+subplotwidth=1000;
+subplotheight=1000;
+
+args <- commandArgs(trailingOnly = TRUE)
+arg_input_file <- args[1]
+arg_ignore_below_prop=as.numeric(args[2])
+arg_taxlevel=args[3]
+
+orig <- read.delim(arg_input_file,header=TRUE,sep="\t")
+orig <- orig[orig$len>=200,]
+
+# if cov_colnames >1, then create a new column cov_total = sum of all cov columns
+if (length(grep("^cov_",colnames(orig))) >1) orig$cov_total = rowSums(orig[,grep("^cov_",colnames(orig))])
+
+cov_colnames=colnames(orig)[grep("^cov_",colnames(orig))]
+tax_colnames=colnames(orig)[grep("^taxlevel_",colnames(orig))]
+
+numcols=length(cov_colnames)
+ 
+taxlevel=arg_taxlevel;
+
+m<-melt(orig,id.vars=c("seqid","len","gc",taxlevel),measure.vars=cov_colnames, variable.name="read_set", value.name="cov")
+
+# there aren't many colours available, so to restrict the plot to only 13 colours:
+# (thanks to https://github.com/hobrien for the fix)
+if (length(levels(m[,taxlevel])) > 14) {
+  levels(m[,taxlevel])[which(table(m[,taxlevel])<=sort(as.numeric(table(m[,taxlevel])), decreasing=T)[13])]<-"other"
+}
+
+mfilt<-clean.blobs(m,arg_ignore_below_prop,taxlevel)
+taxnames=names(sort(table(mfilt[,taxlevel]),decreasing=TRUE))
+taxnames=c("Not annotated", taxnames[taxnames != "Not annotated"])
+mfilt[,taxlevel] <- factor(mfilt[,taxlevel], levels = taxnames)
+
+png(paste(arg_input_file,taxlevel,"png",sep="."), (numcols * subplotwidth), (1 * subplotheight) + 300, units="px",res=100)
+
+theme_set(theme_bw())
+# Paul Tol scheme is well documented at http://www.sron.nl/~pault/colourschemes.pdf - Thank you Paul! Added DDDDDD and 777777 to it
+paultol=list(c("#DDDDDD") ,c("#DDDDDD","#4477AA"), c("#DDDDDD","#4477AA","#CC6677"), c("#DDDDDD","#4477AA","#DDCC77","#CC6677"), c("#DDDDDD","#4477AA","#117733","#DDCC77","#CC6677"), c("#DDDDDD","#332288","#88CCEE","#117733","#DDCC77","#CC6677"), c("#DDDDDD","#332288","#88CCEE","#117733","#DDCC77","#CC6677","#AA4499"), c("#DDDDDD","#332288","#88CCEE","#44AA99","#117733","#DDCC77","#CC6677","#AA4499"), c("#DDDDDD","#332288","#88CCEE","#44AA99","#117733","#999933","#DDCC77","#CC6677","#AA4 [...]
+
+g<-ggplot() + scale_colour_manual(values=paultol[[length(levels(mfilt[,taxlevel]))]], name="Taxonomic\nClassification", limits=levels(mfilt[,taxlevel]) )
+
+for (t in levels(mfilt[,taxlevel])) {
+  g <- g + geom_point(data=mfilt[mfilt[,taxlevel]==t,],aes_string(x="gc", y="cov", colour=taxlevel), size=2, alpha=I(1/3))
+}
+#y_axis_breaks = c(1,2,5,10,20,50,100,200,500,1000);
+g<-g +
+  facet_wrap(~read_set, ncol=numcols) + 
+  scale_y_log10() + scale_x_continuous(limits=c(0, 1),breaks = seq(0,1,.1)) +
+  labs(x="GC content", y="Read coverage") + 
+  guides(colour = guide_legend(nrow=3, override.aes = list(alpha = 1,size=10))) + 
+  theme (
+    strip.text.x = element_text(colour = "black", size = 25, vjust = 0.5),
+    axis.text.x  = element_text(colour = "black", size = 25, vjust = 1),
+    axis.text.y  = element_text(colour = "black", size = 25, vjust = 0.5),
+    axis.title.x = element_text(colour = "black", size = 25, vjust = 0),
+    axis.title.y = element_text(colour = "black", size = 25, hjust = 0.5, vjust = 0.5, angle=90),
+    legend.text  = element_text(colour = "black", size = 25, vjust = 0),
+    legend.title = element_text(colour = "black", size = 25, vjust = 0, hjust = 0, lineheight=1),
+    legend.justification=c(1,1), legend.position="bottom", legend.direction="horizontal"
+  )
+g
+dev.off()
diff --git a/pa61-scaffolds.fa.bowtie2.txt.taxlevel_order.01.png b/pa61-scaffolds.fa.bowtie2.txt.taxlevel_order.01.png
new file mode 100644
index 0000000..ab847e4
Binary files /dev/null and b/pa61-scaffolds.fa.bowtie2.txt.taxlevel_order.01.png differ
diff --git a/separate_reads.bash b/separate_reads.bash
new file mode 100644
index 0000000..075de48
--- /dev/null
+++ b/separate_reads.bash
@@ -0,0 +1,51 @@
+###########################################################################
+## Step 1. Create contaminant databases
+###########################################################################
+
+## After identifying the contaminants as being Bacterial, we can create a subset of the NCBI database that is limited to Bacteria only
+## By doing a more specific blast against this database, and picking only those contigs that hit this database better than a nematode database,
+## we can conservatively remove Bacterial contigs (and the reads that map to them)
+
+# Create Bacteria database (taxon id 2):
+
+curl "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&db=nuccore&dopt=gilist&qty=20000000&filter=all&term=txid2\[Organism:exp\]" >txid2.gids
+blastdb_aliastool -dbtype nucl -gilist txid2.gids -db nt -out nt_Bacteria
+
+# Create nematoda database (taxon id 6231):
+
+curl "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&db=nuccore&dopt=gilist&qty=2000000&filter=all&term=txid6231\[Organism:exp\]" >txid6231.gids
+blastdb_aliastool -dbtype nucl -gilist txid6231.gids -db nt -out nt_Nematoda
+
+# Do the blasts:
+
+blastn -task megablast -query $ASSEMBLY -db $BLASTDB/nt_Bacteria -outfmt 6 -evalue 1e-10 -max_target_seqs 1 -out $ASSEMBLY.nt_Bacteria.1e-10
+blastn -task megablast -query $ASSEMBLY -db $BLASTDB/nt_Nematoda -outfmt 6 -evalue 1e-10 -max_target_seqs 1 -out $ASSEMBLY.nt_Nematoda.1e-10
+
+# Separate by best blast hit:
+
+blast_separate_taxa.pl -b1 $ASSEMBLY.nt_Nematoda.1e-10 -b2 $ASSEMBLY.nt_Bacteria.1e-10
+
+# Make list of contigids to be removed:
+
+# based on megablast 1e-10 hit:
+
+cut -f1 $ASSEMBLY.nt_Bacteria.1e-10.only > toremove.contigids
+
+# based on gc (col 3) and total cov (sum of cols 4 and 5):
+
+awk '$3>=.55 && ($4+$5)<=50'  blobplot.txt | cut -f1 >>toremove.contigids
+awk '$3>=.60 && ($4+$5)<=100' blobplot.txt | cut -f1 >>toremove.contigids
+awk '$3>=.65'                 blobplot.txt | cut -f1 >>toremove.contigids
+sort toremove.contigids | uniq | grep -v seqid >toremove.contigids.uniq; mv toremove.contigids.uniq toremove.contigids
+
+# get list of contigs to keep by removing contaminant contigs from total list of all contig ids:
+grep -v seqid blobplot.txt | cut -f1 | fgrep.pl - -v -f toremove.contigids >tokeep.contigids
+
+# NOTE: fgrep.pl is like fgrep, but seems to work faster than the unix tool even though it's been written in perl
+
+# Extract reads mapping to these contigs:
+
+for LIBNAME in ERR138445 ERR138446
+do
+    bowtie2_extract_reads_mapped_to_specific_contigs.pl -s <(samtools view $ASSEMBLY.$LIBNAME.bowtie2.bam) -id tokeep.contigids -out $LIBNAME.
+done
diff --git a/shuffleSequences_fastx.pl b/shuffleSequences_fastx.pl
new file mode 100755
index 0000000..1de4780
--- /dev/null
+++ b/shuffleSequences_fastx.pl
@@ -0,0 +1,17 @@
+#!/usr/bin/perl
+
+die "Usage: shuffleSequences_fastx.pl [2|4] <forward_fasta_or_fastq_file> <reverse_fasta_or_fastq_file> > <output_file_name>\n" unless scalar @ARGV == 3;
+$lines_per_seq = shift @ARGV;
+die "Expecting first argument to be 2 or 4" unless $lines_per_seq =~ /^2|4/;
+$filename1 = shift @ARGV;
+$filename2 = shift @ARGV;
+
+open IN1, "<$filename1" or die "$filename1 not readable\n";
+open IN2, "<$filename2" or die "$filename1 not readable\n";
+
+while(<IN1>) {
+	die "Does not seem to be a fasta/fastq file" unless /^[@>]/;
+	print $_;
+	for (2..$lines_per_seq) { $_ = <IN1>; print $_ };
+	for (1..$lines_per_seq) { $_ = <IN2>; print $_ };
+}

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/blobology.git