[med-svn] [Git][med-team/nanopolish][master] 5 commits: New upstream version 0.9.0

Afif Elghraoui gitlab at salsa.debian.org
Sat Feb 17 23:17:27 UTC 2018


Afif Elghraoui pushed to branch master at Debian Med / nanopolish


Commits:
146fe3ea by Afif Elghraoui at 2018-02-17T18:03:36-05:00
New upstream version 0.9.0
- - - - -
e2f2b3d4 by Afif Elghraoui at 2018-02-17T18:03:37-05:00
Merge tag 'upstream/0.9.0'

Upstream version 0.9.0

- - - - -
dc85d227 by Afif Elghraoui at 2018-02-17T18:05:44-05:00
Refresh patches

- - - - -
551567ae by Afif Elghraoui at 2018-02-17T18:05:57-05:00
Bump upstream copyright year

- - - - -
326887ce by Afif Elghraoui at 2018-02-17T18:14:17-05:00
Update changelog (waiting for new release of fast5)

Cannot complete the upload until a new fast5 release is made.

- - - - -


23 changed files:

- Makefile
- README.md
- debian/changelog
- debian/copyright
- debian/patches/reproducible.patch
- docs/source/quickstart_call_methylation.rst
- docs/source/quickstart_consensus.rst
- scripts/nanopolish_merge.py
- src/alignment/nanopolish_alignment_db.cpp
- src/alignment/nanopolish_eventalign.cpp
- src/common/nanopolish_common.h
- + src/common/nanopolish_fast5_io.cpp
- + src/common/nanopolish_fast5_io.h
- src/hmm/nanopolish_profile_hmm.cpp
- src/main/nanopolish.cpp
- src/nanopolish_call_methylation.cpp
- src/nanopolish_call_variants.cpp
- src/nanopolish_index.cpp
- src/nanopolish_phase_reads.cpp
- src/nanopolish_read_db.cpp
- src/nanopolish_read_db.h
- src/nanopolish_squiggle_read.cpp
- src/nanopolish_squiggle_read.h


Changes:

=====================================
Makefile
=====================================
--- a/Makefile
+++ b/Makefile
@@ -52,7 +52,9 @@ ifeq ($(HTS), install)
     HTS_INCLUDE=-I./htslib
 else
     # Use system-wide htslib
-    HTS_LIB=-lhts
+    HTS_LIB=
+    HTS_INCLUDE=
+    LIBS += -lhts
 endif
 
 # Include the header-only fast5 library


=====================================
README.md
=====================================
--- a/README.md
+++ b/README.md
@@ -59,22 +59,15 @@ nanopolish eventalign: align signal-level events to k-mers of a reference genome
 Nanopolish needs access to the signal-level data measured by the nanopore sequencer. The first step of any nanopolish workflow is to prepare the input data by telling nanopolish where to find the signal files. If you ran Albacore 2.0 on your data you should run `nanopolish index` on your input reads (-d can be specified more than once if using multiple runs):
 
 ```
-# Only run this if you used Albacore 2.0 or later
-nanopolish index -d /path/to/raw_fast5s/ albacore_output.fastq
+# Index the output of the albacore basecaller
+nanopolish index -d /path/to/raw_fast5s/ -s sequencing_summary.txt albacore_output.fastq
 ```
 
-If you basecalled your reads with Albacore 1.2 or earlier, you should run `nanopolish extract` on your input reads instead:
-
-```
-# Only run this if you used Albacore 1.2 or later
-nanopolish extract --type template directory/pass/ > reads.fa
-```
-
-Note these two commands are mutually exclusive - you only need to run one of them. You need to decide what command to run depending on the version of Albacore that you used. In the following sections we assume you have preprocessed the data by following the instructions above and that your reads are in a file named `reads.fa`.
+The `-s` option tells nanopolish to read the `sequencing_summary.txt` file from Albacore to speed up indexing. Without this option `nanopolish index` is extremely slow as it needs to read every fast5 file individually. If you basecalled your run in parallel, so you have multiple `sequencing_summary.txt` files, you can use the `-f` option to pass in a file containing the paths to the sequencing summary files (one per line).
 
 ### Computing a new consensus sequence for a draft assembly
 
-The original purpose of nanopolish was to compute an improved consensus sequence for a draft genome assembly produced by a long-read assembly like [canu](https://github.com/marbl/canu). This section describes how to do this, starting with your draft assembly which should have megabase-sized contigs.
+The original purpose of nanopolish was to compute an improved consensus sequence for a draft genome assembly produced by a long-read assembly like [canu](https://github.com/marbl/canu). This section describes how to do this, starting with your draft assembly which should have megabase-sized contigs. We've also posted a tutorial including example data [here](http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html).
 
 ```
 # Index the draft genome
@@ -102,24 +95,7 @@ python nanopolish_merge.py polished.*.fa > polished_genome.fa
 
 ## Calling Methylation
 
-nanopolish can use the signal-level information measured by the sequencer to detect 5-mC as described [here](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4184.html). Here's how you run it:
-
-```
-# Align the basecalled reads to a reference genome
-bwa mem -x ont2d -t 8 reference.fa reads.fa | samtools sort -o reads.sorted.bam -T reads.tmp -
-samtools index reads.sorted.bam
-
-# Run the methylation caller
-nanopolish call-methylation -t 8 -r reads.fa -g reference.fa -b reads.sorted.bam > methylation.tsv
-```
-
-The output of call-methylation is a tab-separated file containing per-read log-likelihood ratios (positive values indicate more evidence for 5-mC, negative values indicate more evidence for C). Each line contains the name of the read that covered the CpG site, which allows methylation calls to be phased. We have provided a script to calculate per-site methylation frequencies using call-methylation's output:
-
-```
-python /path/to/nanopolish/scripts/calculate_methylation_frequency -c 2.5 -i methylation.tsv > frequencies.tsv
-```
-
-The output of this script is a tab-seperated file containing the genomic position of the CpG site, the number of reads that covered the site, and the percentage of those reads that were predicted to be methylated. The `-c 2.5` option requires the absolute value of the log-likelihood ratio to be at least 2.5 to make a call, otherwise the read will be ignored. This helps reduce calling errors as only sites with sufficient evidence will be included in the calculation.
+nanopolish can use the signal-level information measured by the sequencer to detect 5-mC as described [here](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4184.html). We've posted a tutorial on how to call methylation [here](http://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html).
 
 ## To run using docker
 


=====================================
debian/changelog
=====================================
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,11 @@
+nanopolish (0.9.0-1) UNRELEASED; urgency=medium
+
+  * New upstream version 0.9.0
+  * Refresh patches
+  * Bump upstream copyright year
+
+ -- Afif Elghraoui <afif at debian.org>  Sat, 17 Feb 2018 18:06:03 -0500
+
 nanopolish (0.8.5-2) unstable; urgency=low
 
   * Remove dependency on nonexistent package sse-support [i386]


=====================================
debian/copyright
=====================================
--- a/debian/copyright
+++ b/debian/copyright
@@ -4,7 +4,7 @@ Upstream-Contact: Jared Simpson <jared.simpson at oicr.on.ca>
 Source: https://github.com/jts/nanopolish
 
 Files: *
-Copyright: 2015-2017 Ontario Institute for Cancer Research
+Copyright: 2015-2018 Ontario Institute for Cancer Research
 License: MIT
 
 Files: src/test/catch.hpp


=====================================
debian/patches/reproducible.patch
=====================================
--- a/debian/patches/reproducible.patch
+++ b/debian/patches/reproducible.patch
@@ -4,7 +4,7 @@ Description: make build reproducible
 Author: Sascha Steinbiss <satta at debian.org>
 --- nanopolish.orig/Makefile
 +++ nanopolish/Makefile
-@@ -99,8 +99,8 @@
+@@ -101,8 +101,8 @@
  #
  
  # Find the source files by searching subdirectories


=====================================
docs/source/quickstart_call_methylation.rst
=====================================
--- a/docs/source/quickstart_call_methylation.rst
+++ b/docs/source/quickstart_call_methylation.rst
@@ -12,7 +12,7 @@ Oxford Nanopore sequencers are sensitive to base modifications. Here we provide 
 **Requirements**:
 
 * `nanopolish v0.8.4 <installation.html>`_
-* `samtools v1.2 <http://samtools.sourceforge.net/>`_
+* `samtools v1.2 <https://htslib.org>`_
 * `minimap2 <https://github.com/lh3/minimap2>`_
 
 Download example dataset


=====================================
docs/source/quickstart_consensus.rst
=====================================
--- a/docs/source/quickstart_consensus.rst
+++ b/docs/source/quickstart_consensus.rst
@@ -8,7 +8,7 @@ The original purpose for nanopolish was to improve the consensus assembly accura
 **Requirements**:
 
 * `nanopolish v0.8.4 <installation.html>`_
-* `samtools v1.2 <http://samtools.sourceforge.net/>`_
+* `samtools v1.2 <https://htslib.org>`_
 * `bwa v0.7.12 <https://github.com/lh3/bwa>`_
 * `MUMmer <https://github.com/mummer4/mummer>`_
 


=====================================
scripts/nanopolish_merge.py
=====================================
--- a/scripts/nanopolish_merge.py
+++ b/scripts/nanopolish_merge.py
@@ -85,7 +85,7 @@ for contig_name in sorted(segments_by_name.keys()):
 
         # Ensure the segments overlap
         if not( prev_segment is None or prev_segment + SEGMENT_LENGTH + OVERLAP_LENGTH > segment_start ):
-            sys.stderr.write("error: segment starting at %s:%d is missing\n" % (contig, prev_segment + SEGMENT_LENGTH + 40))
+            sys.stderr.write("error: segment starting at %s:%d is missing\n" % (contig_name, prev_segment + SEGMENT_LENGTH + 40))
             all_segments_found = False
 
         sequence = segments_by_name[contig_name][segment_start]
@@ -98,5 +98,5 @@ for contig_name in sorted(segments_by_name.keys()):
     if all_segments_found:
         print(">%s\n%s" % (contig_name, assembly))
     else:
-        sys.stderr.write("error: some segments are missing, could not merge contig %s\n" % (contig))
+        sys.stderr.write("error: some segments are missing, could not merge contig %s\n" % (contig_name))
         sys.exit(1)


=====================================
src/alignment/nanopolish_alignment_db.cpp
=====================================
--- a/src/alignment/nanopolish_alignment_db.cpp
+++ b/src/alignment/nanopolish_alignment_db.cpp
@@ -73,12 +73,21 @@ EventAlignmentRecord::EventAlignmentRecord(SquiggleRead* sr,
         size_t kmer_pos_ref_strand = seq_record.aligned_bases[i].read_pos;
         size_t kmer_pos_read_strand = seq_record.rc ? this->sr->flip_k_strand(kmer_pos_ref_strand, k) : kmer_pos_ref_strand;
         size_t event_idx = this->sr->get_closest_event_to(kmer_pos_read_strand, strand_idx);
-        assert(event_idx != -1);
         this->aligned_events.push_back( { seq_record.aligned_bases[i].ref_pos, (int)event_idx });
     }
     this->rc = strand_idx == 0 ? seq_record.rc : !seq_record.rc;
     this->strand = strand_idx;
-    this->stride = this->aligned_events.front().read_pos < this->aligned_events.back().read_pos ? 1 : -1;
+
+    if(!this->aligned_events.empty()) {
+        this->stride = this->aligned_events.front().read_pos < this->aligned_events.back().read_pos ? 1 : -1;
+
+        // check for a degenerate alignment and discard the events if so
+        if(this->aligned_events.front().read_pos == this->aligned_events.back().read_pos) {
+            this->aligned_events.clear();
+        }
+    } else {
+        this->stride = 1;
+    }
 }
 
 //
@@ -385,6 +394,9 @@ void AlignmentDB::load_region(const std::string& contig,
     char* ref_segment = faidx_fetch_seq(fai, m_region_contig.c_str(), m_region_start, m_region_end, &fetched_len);
     m_region_ref_sequence = ref_segment;
     
+    // ensure reference sequence is upper case
+    std::transform(m_region_ref_sequence.begin(), m_region_ref_sequence.end(), m_region_ref_sequence.begin(), ::toupper);
+
     // load base-space alignments
     m_sequence_records = _load_sequence_by_region(m_sequence_bam);
 
@@ -548,11 +560,14 @@ std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_bam(c
 std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_read(const std::vector<SequenceAlignmentRecord>& sequence_records)
 {
     std::vector<EventAlignmentRecord> records;
+
+    #pragma omp parallel for
     for(size_t i = 0; i < sequence_records.size(); ++i) {
         const SequenceAlignmentRecord& seq_record = sequence_records[i];
 
         // conditionally load the squiggle read if it hasn't been loaded already
         _load_squiggle_read(seq_record.read_name);
+
         for(size_t si = 0; si < NUM_STRANDS; ++si) {
             
             // skip complement
@@ -566,6 +581,7 @@ std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_read(
                 continue;
             }
 
+            #pragma omp critical
             records.emplace_back(sr, si, seq_record);
         }
     }
@@ -610,7 +626,11 @@ void AlignmentDB::_load_squiggle_read(const std::string& read_name)
 {
     // Do we need to load this fast5 file?
     if(m_squiggle_read_map.find(read_name) == m_squiggle_read_map.end()) {
+        
+        // Allow the load to happen in parallel but lock access to adding it into the map
         SquiggleRead* sr = new SquiggleRead(read_name, m_read_db);
+        
+        #pragma omp critical
         m_squiggle_read_map[read_name] = sr;
     }
 }


=====================================
src/alignment/nanopolish_eventalign.cpp
=====================================
--- a/src/alignment/nanopolish_eventalign.cpp
+++ b/src/alignment/nanopolish_eventalign.cpp
@@ -599,6 +599,9 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
     std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset, 
                                                   bam_endpos(params.record), &fetched_len);
 
+    // convert to upper case
+    std::transform(ref_seq.begin(), ref_seq.end(), ref_seq.begin(), ::toupper);
+    
     // k from read pore model
     const uint32_t k = params.sr->get_model_k(params.strand_idx);
 


=====================================
src/common/nanopolish_common.h
=====================================
--- a/src/common/nanopolish_common.h
+++ b/src/common/nanopolish_common.h
@@ -18,7 +18,7 @@
 #include "logsum.h"
 
 #define PACKAGE_NAME "nanopolish"
-#define PACKAGE_VERSION "0.8.5"
+#define PACKAGE_VERSION "0.9.0"
 #define PACKAGE_BUGREPORT "https://github.com/jts/nanopolish/issues"
 
 //


=====================================
src/common/nanopolish_fast5_io.cpp
=====================================
--- /dev/null
+++ b/src/common/nanopolish_fast5_io.cpp
@@ -0,0 +1,265 @@
+//---------------------------------------------------------
+// Copyright 2018 Ontario Institute for Cancer Research
+// Written by Jared Simpson (jared.simpson at oicr.on.ca)
+//---------------------------------------------------------
+//
+// nanopolish_fast5_io -- lightweight functions
+// to read specific data from fast5 files
+//
+#include <string.h>
+#include <math.h>
+#include "nanopolish_fast5_io.h"
+
+//#define DEBUG_FAST5_IO 1
+
+#define RAW_ROOT "/Raw/Reads/"
+int verbose = 0;
+
+//
+hid_t fast5_open(const std::string& filename)
+{
+    hid_t hdf5file = H5Fopen(filename.c_str(), H5F_ACC_RDONLY, H5P_DEFAULT);
+    return hdf5file;
+}
+
+//
+void fast5_close(hid_t hdf5_file)
+{
+    H5Fclose(hdf5_file);
+}
+
+//
+std::string fast5_get_raw_read_group(hid_t hdf5_file)
+{
+    std::string read_name = fast5_get_raw_read_name(hdf5_file);
+    if(read_name != "") {
+        return std::string(RAW_ROOT) + read_name;
+    } else {
+        return "";
+    }
+}
+
+//
+std::string fast5_get_raw_read_name(hid_t hdf5_file)
+{
+    // This code is From scrappie's fast5_interface
+
+    // retrieve the size of the read name
+    ssize_t size =
+        H5Lget_name_by_idx(hdf5_file, RAW_ROOT, H5_INDEX_NAME, H5_ITER_INC, 0, NULL,
+                           0, H5P_DEFAULT);
+
+    if (size < 0) {
+        return "";
+    }
+
+    // copy the read name out of the fast5
+    char* name = (char*)calloc(1 + size, sizeof(char));
+    H5Lget_name_by_idx(hdf5_file, RAW_ROOT, H5_INDEX_NAME, H5_ITER_INC, 0, name,
+                       1 + size, H5P_DEFAULT);
+
+    // cleanup
+    std::string out(name);
+    free(name);
+    return out;
+}
+
+//
+std::string fast5_get_read_id(hid_t hdf5_file)
+{
+    int ret;
+    hid_t read_name_attribute, raw_group, attribute_type;
+    size_t storage_size = 0;
+    char* read_name_str = NULL;
+
+    std::string out = "";
+    
+    // Get the path to the raw read group
+    std::string raw_read_group = fast5_get_raw_read_group(hdf5_file);
+    if(raw_read_group == "") {
+        return out;
+    }
+
+    return fast5_get_fixed_string_attribute(hdf5_file, raw_read_group, "read_id");
+}
+
+//
+raw_table fast5_get_raw_samples(hid_t hdf5_file, fast5_raw_scaling scaling)
+{
+    float* rawptr = NULL;
+    hid_t space;
+    hsize_t nsample;
+    herr_t status;
+    float raw_unit;
+    raw_table rawtbl = { 0, 0, 0, NULL };
+
+    // mostly from scrappie
+    std::string raw_read_group = fast5_get_raw_read_group(hdf5_file);
+
+    // Create data set name
+    std::string signal_path = raw_read_group + "/Signal";
+
+    hid_t dset = H5Dopen(hdf5_file, signal_path.c_str(), H5P_DEFAULT);
+    if (dset < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "Failed to open dataset '%s' to read raw signal from.\n", signal_path.c_str());
+#endif
+        goto cleanup2;
+    }
+
+    space = H5Dget_space(dset);
+    if (space < 0) {
+        fprintf(stderr, "Failed to create copy of dataspace for raw signal %s.\n", signal_path.c_str());
+        goto cleanup3;
+    }
+
+    H5Sget_simple_extent_dims(space, &nsample, NULL);
+    rawptr = (float*)calloc(nsample, sizeof(float));
+    status = H5Dread(dset, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, rawptr);
+
+    if (status < 0) {
+        free(rawptr);
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "Failed to read raw data from dataset %s.\n", signal_path.c_str());
+#endif
+        goto cleanup4;
+    }
+
+    // convert to pA
+    rawtbl = (raw_table) { nsample, 0, nsample, rawptr };
+    raw_unit = scaling.range / scaling.digitisation;
+    for (size_t i = 0; i < nsample; i++) {
+        rawptr[i] = (rawptr[i] + scaling.offset) * raw_unit;
+    }
+
+ cleanup4:
+    H5Sclose(space);
+ cleanup3:
+    H5Dclose(dset);
+ cleanup2:
+    return rawtbl;
+}
+
+//
+std::string fast5_get_experiment_type(hid_t hdf5_file)
+{
+    return fast5_get_fixed_string_attribute(hdf5_file, "/UniqueGlobalKey/context_tags", "experiment_type");
+}
+
+// from scrappie
+float fast5_read_float_attribute(hid_t group, const char *attribute) {
+    float val = NAN;
+    if (group < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "Invalid group passed to %s:%d.", __FILE__, __LINE__);
+#endif
+        return val;
+    }
+
+    hid_t attr = H5Aopen(group, attribute, H5P_DEFAULT);
+    if (attr < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "Failed to open attribute '%s' for reading.", attribute);
+#endif
+        return val;
+    }
+
+    H5Aread(attr, H5T_NATIVE_FLOAT, &val);
+    H5Aclose(attr);
+
+    return val;
+}
+
+//
+fast5_raw_scaling fast5_get_channel_params(hid_t hdf5_file)
+{
+    // from scrappie
+    fast5_raw_scaling scaling = { NAN, NAN, NAN, NAN };
+    const char *scaling_path = "/UniqueGlobalKey/channel_id";
+
+    hid_t scaling_group = H5Gopen(hdf5_file, scaling_path, H5P_DEFAULT);
+    if (scaling_group < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "Failed to group %s.", scaling_path);
+#endif
+        return scaling;
+    }
+
+    scaling.digitisation = fast5_read_float_attribute(scaling_group, "digitisation");
+    scaling.offset = fast5_read_float_attribute(scaling_group, "offset");
+    scaling.range = fast5_read_float_attribute(scaling_group, "range");
+    scaling.sample_rate = fast5_read_float_attribute(scaling_group, "sampling_rate");
+
+    H5Gclose(scaling_group);
+
+    return scaling;
+}
+
+//
+// Internal functions
+//
+
+//
+std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string& group_name, const std::string& attribute_name)
+{
+    size_t storage_size;
+    char* buffer;
+    hid_t group, attribute, attribute_type;
+    int ret;
+    std::string out;
+
+    // Open the group /Raw/Reads/Read_nnn
+    group = H5Gopen(hdf5_file, group_name.c_str(), H5P_DEFAULT);
+    if(group < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "could not open group %s\n", group_name.c_str());
+#endif
+        goto close_group;
+    }
+
+    // Ensure attribute exists
+    ret = H5Aexists(group, attribute_name.c_str());
+    if(ret <= 0) {
+        goto close_group;
+    }
+
+    // Open the attribute
+    attribute = H5Aopen(group, attribute_name.c_str(), H5P_DEFAULT);
+    if(attribute < 0) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "could not open attribute: %s\n", attribute_name.c_str());
+#endif
+        goto close_attr;
+    }
+
+    // Get data type and check it is a fixed-length string
+    attribute_type = H5Aget_type(attribute);
+    if(H5Tis_variable_str(attribute_type)) {
+#ifdef DEBUG_FAST5_IO
+        fprintf(stderr, "variable length string detected -- ignoring attribute\n");
+#endif
+        goto close_type;
+    }
+
+    // Get the storage size and allocate
+    storage_size = H5Aget_storage_size(attribute);
+    buffer = (char*)calloc(storage_size + 1, sizeof(char));
+
+    // finally read the attribute
+    ret = H5Aread(attribute, attribute_type, buffer);
+    if(ret >= 0) {
+        out = buffer;
+    }
+
+    // clean up
+    free(buffer);
+close_type:
+    H5Tclose(attribute_type);
+close_attr:
+    H5Aclose(attribute);
+close_group:
+    H5Gclose(group);
+
+    return out;
+}
+


=====================================
src/common/nanopolish_fast5_io.h
=====================================
--- /dev/null
+++ b/src/common/nanopolish_fast5_io.h
@@ -0,0 +1,65 @@
+//---------------------------------------------------------
+// Copyright 2018 Ontario Institute for Cancer Research
+// Written by Jared Simpson (jared.simpson at oicr.on.ca)
+//---------------------------------------------------------
+//
+// nanopolish_fast5_io -- lightweight functions
+// to read specific data from fast5 files. Some functions
+// ported from scrappie's fast5_interface.c
+//
+#ifndef NANOPOLISH_FAST5_IO_H
+#define NANOPOLISH_FAST5_IO_H
+
+#include <string>
+#include <vector>
+#include <hdf5.h>
+
+extern "C" {
+#include "event_detection.h"
+#include "scrappie_common.h"
+}
+
+// From scrappie
+typedef struct {
+    //  Information for scaling raw data from ADC values to pA
+    float digitisation;
+    float offset;
+    float range;
+    float sample_rate;
+} fast5_raw_scaling;
+
+
+//
+// External API
+//
+
+// open the file and return the hdf ID
+hid_t fast5_open(const std::string& filename);
+
+// close the file
+void fast5_close(hid_t hdf5_file);
+
+// get the raw samples from this file
+raw_table fast5_get_raw_samples(hid_t hdf5_file, fast5_raw_scaling scaling);
+
+// get the name of the raw read in the file (eg Read_1234)
+std::string fast5_get_raw_read_name(hid_t hdf5_file);
+
+// get the name of the raw read group (eg /Raw/Read/Read_1234)
+std::string fast5_get_raw_read_group(hid_t hdf5_file);
+
+// Get the identifier of a read from the hdf5 file
+std::string fast5_get_read_id(hid_t hdf5_file);
+
+// Get the experiment type attribute
+std::string fast5_get_experiment_type(hid_t hdf5_file);
+
+// Get sample rate, and ADC-to-pA scalings
+fast5_raw_scaling fast5_get_channel_params(hid_t hdf5_file);
+
+//
+// Internal utility functions
+//
+std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string& group_name, const std::string& attribute_name);
+
+#endif


=====================================
src/hmm/nanopolish_profile_hmm.cpp
=====================================
--- a/src/hmm/nanopolish_profile_hmm.cpp
+++ b/src/hmm/nanopolish_profile_hmm.cpp
@@ -32,8 +32,8 @@ float profile_hmm_score(const HMMInputSequence& sequence, const HMMInputData& da
 float profile_hmm_score_set(const std::vector<HMMInputSequence>& sequences, const HMMInputData& data, const uint32_t flags)
 {
     assert(!sequences.empty());
-    assert(sequences[0].get_alphabet()->get_name() == "nucleotide");
-    assert(data.pore_model->pmalphabet->get_name() == "nucleotide");
+    assert(std::string(sequences[0].get_alphabet()->get_name()) == "nucleotide");
+    assert(std::string(data.pore_model->pmalphabet->get_name()) == "nucleotide");
     
     HMMInputData alt_data = data;
     size_t num_models = sequences.size();


=====================================
src/main/nanopolish.cpp
=====================================
--- a/src/main/nanopolish.cpp
+++ b/src/main/nanopolish.cpp
@@ -63,7 +63,7 @@ int print_version(int, char **)
 int main(int argc, char** argv)
 {
     // Turn off HDF's exception printing, which is generally unhelpful for users
-    hdf5::H5Eset_auto(0, NULL, NULL);
+    H5Eset_auto(0, NULL, NULL);
 
     int ret = 0;
     if(argc <= 1) {


=====================================
src/nanopolish_call_methylation.cpp
=====================================
--- a/src/nanopolish_call_methylation.cpp
+++ b/src/nanopolish_call_methylation.cpp
@@ -95,7 +95,7 @@ static const char *CALL_METHYLATION_USAGE_MESSAGE =
 "  -v, --verbose                        display verbose output\n"
 "      --version                        display version\n"
 "      --help                           display this help and exit\n"
-"  -r, --reads=FILE                     the 2D ONT reads are in fasta FILE\n"
+"  -r, --reads=FILE                     the ONT reads are in fasta FILE\n"
 "  -b, --bam=FILE                       the reads aligned to the genome assembly are in bam FILE\n"
 "  -g, --genome=FILE                    the genome we are computing a consensus for is in FILE\n"
 "  -t, --threads=NUM                    use NUM threads (default: 1)\n"


=====================================
src/nanopolish_call_variants.cpp
=====================================
--- a/src/nanopolish_call_variants.cpp
+++ b/src/nanopolish_call_variants.cpp
@@ -414,9 +414,14 @@ std::vector<Variant> expand_variants(const AlignmentDB& alignments,
         // deletion
         Variant v = candidate_variants[vi];
 
-        if(alignments.are_coordinates_valid(v.ref_name, v.ref_position, v.ref_position + v.ref_seq.size())) {
-            v.ref_seq = alignments.get_reference_substring(v.ref_name, v.ref_position, v.ref_position + v.ref_seq.size());
+        // Do not allow deletions to extend within opt::min_flanking_sequence of the end of the haplotype
+        int deletion_end = v.ref_position + v.ref_seq.size();
+        if(alignments.are_coordinates_valid(v.ref_name, v.ref_position, deletion_end) &&
+           alignments.get_region_end() - deletion_end > opt::min_flanking_sequence ) {
+
+            v.ref_seq = alignments.get_reference_substring(v.ref_name, v.ref_position, deletion_end);
             assert(v.ref_seq != candidate_variants[vi].ref_seq);
+            assert(v.ref_seq.length() == candidate_variants[vi].ref_seq.length() + 1);
             assert(v.ref_seq.substr(0, candidate_variants[vi].ref_seq.size()) == candidate_variants[vi].ref_seq);
             out_variants.push_back(v);
         }
@@ -783,6 +788,7 @@ Haplotype call_haplotype_from_candidates(const AlignmentDB& alignments,
         int calling_end = candidate_variants[end_variant_idx - 1].ref_position +
                           candidate_variants[end_variant_idx - 1].ref_seq.length() +
                           opt::min_flanking_sequence;
+
         int calling_size = calling_end - calling_start;
 
         if(opt::verbose > 2) {
@@ -1097,6 +1103,11 @@ void parse_call_variants_options(int argc, char** argv)
     }
 }
 
+void print_invalid_window_error(int start_base, int end_base)
+{
+    fprintf(stderr, "[error] Invalid polishing window: [%d %d] - please adjust -w parameter.\n", start_base, end_base);
+}
+
 int call_variants_main(int argc, char** argv)
 {
     parse_call_variants_options(argc, argv);
@@ -1121,9 +1132,16 @@ int call_variants_main(int argc, char** argv)
         end_base = contig_length - 1;
     }
 
+    // Verify window coordinates are correct
+    if(start_base > end_base) {
+        print_invalid_window_error(start_base, end_base);
+        fprintf(stderr, "The starting coordinate of the polishing window must be less than or equal to the end coordinate\n");
+        exit(EXIT_FAILURE);
+    }
+
     int MIN_DISTANCE_TO_END = 40;
     if(contig_length - start_base < MIN_DISTANCE_TO_END) {
-        fprintf(stderr, "Invalid polishing window: [%d %d] - please adjust -w parameter.\n", start_base, end_base);
+        print_invalid_window_error(start_base, end_base);
         fprintf(stderr, "The starting coordinate of the polishing window must be at least %dbp from the contig end\n", MIN_DISTANCE_TO_END);
         exit(EXIT_FAILURE);
     }


=====================================
src/nanopolish_index.cpp
=====================================
--- a/src/nanopolish_index.cpp
+++ b/src/nanopolish_index.cpp
@@ -14,6 +14,7 @@
 #define SUBPROGRAM "index"
 
 #include <iostream>
+#include <fstream>
 #include <sstream>
 #include <getopt.h>
 
@@ -24,6 +25,7 @@
 #include "fs_support.hpp"
 #include "logger.hpp"
 #include "profiler.h"
+#include "nanopolish_fast5_io.h"
 
 static const char *INDEX_VERSION_MESSAGE =
 SUBPROGRAM " Version " PACKAGE_VERSION "\n"
@@ -39,39 +41,59 @@ static const char *INDEX_USAGE_MESSAGE =
 "      --version                        display version\n"
 "  -v, --verbose                        display verbose output\n"
 "  -d, --directory                      path to the directory containing the raw ONT signal files. This option can be given multiple times.\n"
-"  -f, --fast5-fofn                     file containing the paths to each fast5 file for the run\n"
+"  -s, --sequencing-summary             the sequencing summary file from albacore, providing this option will make indexing much faster\n"
+"  -f, --summary-fofn                   file containing the paths to the sequencing summary files (one per line)\n"
 "\nReport bugs to " PACKAGE_BUGREPORT "\n\n";
 
 namespace opt
 {
     static unsigned int verbose = 0;
     static std::vector<std::string> raw_file_directories;
-    static std::string fast5_fofn;
     static std::string reads_file;
+    static std::vector<std::string> sequencing_summary_files;
+    static std::string sequencing_summary_fofn;
 }
 static std::ostream* os_p;
 
-void index_file(ReadDB& read_db, const std::string& fn)
+void index_file_from_map(ReadDB& read_db, const std::string& fn, const std::map<std::string, std::string>& fast5_to_read_name_map)
 {
-    PROFILE_FUNC("index_file")
-
-    fast5::File* fp = NULL;
-    try {
-        fp = new fast5::File(fn);
-        if(fp->is_open()) {
-            fast5::Raw_Samples_Params params = fp->get_raw_samples_params();
-            std::string read_id = params.read_id;
-            read_db.add_signal_path(read_id, fn);
+    PROFILE_FUNC("index_file_from_map")
+
+    // Check if this fast5 file is in the map
+    size_t last_dir_pos = fn.find_last_of("/");
+    std::string fast5_basename = last_dir_pos == std::string::npos ? fn : fn.substr(last_dir_pos + 1);
+
+    const auto& iter = fast5_to_read_name_map.find(fast5_basename);
+    if(iter != fast5_to_read_name_map.end()) {
+        if(read_db.has_read(iter->second)) {
+            read_db.add_signal_path(iter->second.c_str(), fn);
+        }
+    } else {
+        if(opt::verbose > 0) {
+            fprintf(stderr, "Could not find read %s in sequencing summary file\n", fn.c_str());
         }
-    } catch(hdf5_tools::Exception e) {
-        fprintf(stderr, "skipping invalid fast5 file: %s\n", fn.c_str());
     }
-    delete fp;
 } // process_file
 
-void index_path(ReadDB& read_db, const std::string& path)
+void index_file_from_fast5(ReadDB& read_db, const std::string& fn, const std::map<std::string, std::string>& fast5_to_read_name_map)
+{
+    PROFILE_FUNC("index_file_from_fast5")
+
+    hid_t hdf5_file = fast5_open(fn);
+    if(hdf5_file < 0) {
+        fprintf(stderr, "could not open fast5 file: %s\n", fn.c_str());
+    }
+
+    std::string read_id = fast5_get_read_id(hdf5_file);
+    if(read_id != "") {
+        read_db.add_signal_path(read_id, fn);
+    }
+    fast5_close(hdf5_file);
+} // process_file
+
+void index_path(ReadDB& read_db, const std::string& path, const std::map<std::string, std::string>& fast5_to_read_name_map)
 {
-    fprintf(stderr, "Indexing %s\n", path.c_str());
+    fprintf(stderr, "[readdb] indexing %s\n", path.c_str());
     if (is_directory(path)) {
         auto dir_list = list_directory(path);
         for (const auto& fn : dir_list) {
@@ -82,36 +104,94 @@ void index_path(ReadDB& read_db, const std::string& path)
             std::string full_fn = path + "/" + fn;
             if(is_directory(full_fn)) {
                 // recurse
-                index_path(read_db, full_fn);
-                read_db.print_stats();
+                index_path(read_db, full_fn, fast5_to_read_name_map);
             } else if (full_fn.find(".fast5") != -1) {
-                index_file(read_db, full_fn);
+                if(!fast5_to_read_name_map.empty()) {
+                    index_file_from_map(read_db, full_fn, fast5_to_read_name_map);
+                } else {
+                    index_file_from_fast5(read_db, full_fn, fast5_to_read_name_map);
+                }
             }
         }
     }
 } // process_path
 
-void process_fast5_fofn(ReadDB& read_db, const std::string& fast5_fofn)
+// read sequencing summary files from the fofn and add them to the list
+void process_summary_fofn()
 {
-    //
-    std::ifstream in_file(fast5_fofn.c_str());
+    if(opt::sequencing_summary_fofn.empty()) {
+        return;
+    }
+
+    // open
+    std::ifstream in_file(opt::sequencing_summary_fofn.c_str());
     if(in_file.bad()) {
-        fprintf(stderr, "error: could not file %s\n", fast5_fofn.c_str());
+        fprintf(stderr, "error: could not file %s\n", opt::sequencing_summary_fofn.c_str());
         exit(EXIT_FAILURE);
     }
 
     // read
-    size_t count = 0;
     std::string filename;
     while(getline(in_file, filename)) {
-        if(count++ % 1000 == 0) {
-            fprintf(stderr, "Loaded %zu from fofn\n", count);
+        opt::sequencing_summary_files.push_back(filename);
+    }
+}
+
+void exit_bad_header(const std::string& str, const std::string& filename)
+{
+    fprintf(stderr, "Could not find %s column in the header of %s\n", str.c_str(), filename.c_str());
+    exit(EXIT_FAILURE);
+}
+
+//
+void parse_sequencing_summary(const std::string& filename, std::map<std::string, std::string>& out_map)
+{
+    // open
+    std::ifstream in_file(filename.c_str());
+    if(in_file.bad()) {
+        fprintf(stderr, "error: could not file %s\n", filename.c_str());
+        exit(EXIT_FAILURE);
+    }
+
+    // read header to get the column index of the read and file name
+    std::string header;
+    getline(in_file, header);
+    std::vector<std::string> fields = split(header, '\t');
+
+    const std::string READ_NAME_STR = "read_id";
+    const std::string FILENAME_STR = "filename";
+    size_t filename_idx = -1;
+    size_t read_name_idx = -1;
+
+    for(size_t i = 0; i < fields.size(); ++i) {
+        if(fields[i] == READ_NAME_STR) {
+            read_name_idx = i;
         }
-        index_file(read_db, filename);
+
+        if(fields[i] == FILENAME_STR) {
+            filename_idx = i;
+        }
+    }
+
+    if(filename_idx == -1) {
+        exit_bad_header(FILENAME_STR, filename);
+    }
+
+    if(read_name_idx == -1) {
+        exit_bad_header(READ_NAME_STR, filename);
+    }
+
+    // read records and add to map
+    std::string line;
+    while(getline(in_file, line)) {
+        fields = split(line, '\t');
+        std::string fast5_filename = fields[filename_idx];
+        std::string read_name = fields[read_name_idx];
+        out_map[fast5_filename] = read_name;
     }
 }
 
-static const char* shortopts = "vd:f:";
+static const char* shortopts = "vd:f:s:";
 
 enum {
     OPT_HELP = 1,
@@ -120,12 +200,13 @@ enum {
 };
 
 static const struct option longopts[] = {
-    { "help",               no_argument,       NULL, OPT_HELP },
-    { "version",            no_argument,       NULL, OPT_VERSION },
-    { "log-level",          required_argument, NULL, OPT_LOG_LEVEL },
-    { "verbose",            no_argument,       NULL, 'v' },
-    { "directory",          required_argument, NULL, 'd' },
-    { "fast5-fofn",         required_argument, NULL, 'f' },
+    { "help",                      no_argument,       NULL, OPT_HELP },
+    { "version",                   no_argument,       NULL, OPT_VERSION },
+    { "log-level",                 required_argument, NULL, OPT_LOG_LEVEL },
+    { "verbose",                   no_argument,       NULL, 'v' },
+    { "directory",                 required_argument, NULL, 'd' },
+    { "sequencing-summary-file",   required_argument, NULL, 's' },
+    { "summary-fofn",              required_argument, NULL, 'f' },
     { NULL, 0, NULL, 0 }
 };
 
@@ -146,8 +227,9 @@ void parse_index_options(int argc, char** argv)
                 log_level.push_back(arg.str());
                 break;
             case 'v': opt::verbose++; break;
+            case 's': opt::sequencing_summary_files.push_back(arg.str()); break;
             case 'd': opt::raw_file_directories.push_back(arg.str()); break;
-            case 'f': arg >> opt::fast5_fofn; break;
+            case 'f': arg >> opt::sequencing_summary_fofn; break;
         }
     }
 
@@ -165,8 +247,8 @@ void parse_index_options(int argc, char** argv)
         std::cerr << SUBPROGRAM ": too many arguments\n";
         die = true;
     }
-    
-    if (die) 
+
+    if (die)
     {
         std::cout << "\n" << INDEX_USAGE_MESSAGE;
         exit(EXIT_FAILURE);
@@ -178,11 +260,21 @@ void parse_index_options(int argc, char** argv)
 int index_main(int argc, char** argv)
 {
     parse_index_options(argc, argv);
-    
+
+
+    // Read a map from fast5 files to read name from the sequencing summary files (if any)
+    process_summary_fofn();
+    std::map<std::string, std::string> fast5_to_read_name_map;
+    for(const auto& ss_filename : opt::sequencing_summary_files) {
+        if(opt::verbose > 2) {
+            fprintf(stderr, "summary: %s\n", ss_filename.c_str());
+        }
+        parse_sequencing_summary(ss_filename, fast5_to_read_name_map);
+    }
+
     // import read names, and possibly fast5 paths, from the fasta/fastq file
     ReadDB read_db;
     read_db.build(opt::reads_file);
-
     bool all_reads_have_paths = read_db.check_signal_paths();
 
     // if the input fastq did not contain a complete set of paths
@@ -190,15 +282,17 @@ int index_main(int argc, char** argv)
     if(!all_reads_have_paths) {
 
         for(const auto& dir_name : opt::raw_file_directories) {
-            index_path(read_db, dir_name);
-        }
-
-        if(!opt::fast5_fofn.empty()) {
-            process_fast5_fofn(read_db, opt::fast5_fofn);
+            index_path(read_db, dir_name, fast5_to_read_name_map);
         }
     }
 
-    read_db.print_stats();
-    read_db.save();
+    size_t num_with_path = read_db.get_num_reads_with_path();
+    if(num_with_path == 0) {
+        fprintf(stderr, "Error: no fast5 files found\n");
+        exit(EXIT_FAILURE);
+    } else {
+        read_db.print_stats();
+        read_db.save();
+    }
     return 0;
 }


=====================================
src/nanopolish_phase_reads.cpp
=====================================
--- a/src/nanopolish_phase_reads.cpp
+++ b/src/nanopolish_phase_reads.cpp
@@ -197,7 +197,7 @@ void phase_single_read(const ReadDB& read_db,
     std::string ref_name = hdr->target_name[record->core.tid];
     int alignment_start_pos = record->core.pos;
     int alignment_end_pos = bam_endpos(record);
-    
+
     // Search the variant collection for the index of the first/last variants to phase
     Variant lower_search;
     lower_search.ref_name = ref_name;
@@ -209,7 +209,9 @@ void phase_single_read(const ReadDB& read_db,
     upper_search.ref_position = alignment_end_pos;
     auto upper_iter = std::upper_bound(variants.begin(), variants.end(), upper_search, sortByPosition);
 
-    fprintf(stderr, "%s %s:%u-%u %zu\n", read_name.c_str(), ref_name.c_str(), alignment_start_pos, alignment_end_pos, upper_iter - lower_iter);
+    if(opt::verbose >= 1) {
+        fprintf(stderr, "Phasing read %s %s:%u-%u %zu\n", read_name.c_str(), ref_name.c_str(), alignment_start_pos, alignment_end_pos, upper_iter - lower_iter);
+    }
 
     // no variants to phase?
     if(lower_iter == variants.end()) {
@@ -217,11 +219,14 @@ void phase_single_read(const ReadDB& read_db,
     }
 
     int fetched_len;
-    std::string reference_seq = get_reference_region_ts(fai, 
-                                                        ref_name.c_str(), 
-                                                        alignment_start_pos, 
-                                                        alignment_end_pos, 
+    std::string reference_seq = get_reference_region_ts(fai,
+                                                        ref_name.c_str(),
+                                                        alignment_start_pos,
+                                                        alignment_end_pos,
                                                         &fetched_len);
+    
+    // convert to upper case to avoid calling c>C as variants
+    std::transform(reference_seq.begin(), reference_seq.end(), reference_seq.begin(), ::toupper);
 
     std::string read_outseq = reference_seq;
     std::string read_outqual(reference_seq.length(), MAX_Q_SCORE + BAM_Q_OFFSET);
@@ -242,7 +247,7 @@ void phase_single_read(const ReadDB& read_db,
         SequenceAlignmentRecord seq_align_record(record);
         EventAlignmentRecord event_align_record(&sr, strand_idx, seq_align_record);
 
-        // 
+        //
         for(; lower_iter < upper_iter; ++lower_iter) {
 
             const Variant& v = *lower_iter;
@@ -259,12 +264,13 @@ void phase_single_read(const ReadDB& read_db,
             data.strand = event_align_record.strand;
             data.rc = event_align_record.rc;
             data.event_stride = event_align_record.stride;
-            
+            data.pore_model = data.read->get_base_model(data.strand);
+
             int e1,e2;
-            bool bounded = AlignmentDB::_find_by_ref_bounds(event_align_record.aligned_events, 
+            bool bounded = AlignmentDB::_find_by_ref_bounds(event_align_record.aligned_events,
                                                             calling_start,
                                                             calling_end,
-                                                            e1, 
+                                                            e1,
                                                             e2);
 
             // The events of this read do not span the calling window, skip
@@ -277,7 +283,7 @@ void phase_single_read(const ReadDB& read_db,
 
             Haplotype calling_haplotype =
                 reference_haplotype.substr_by_reference(calling_start, calling_end);
-        
+
             double ref_score = profile_hmm_score(calling_haplotype.get_sequence(), data, alignment_flags);
             bool good_haplotype = calling_haplotype.apply_variant(v);
             if(good_haplotype) {
@@ -301,7 +307,10 @@ void phase_single_read(const ReadDB& read_db,
                 //fprintf(stderr, "\t%s score: %.2lf %.2lf %c p_wrong: %.3lf Q: %d QC: %c\n", v.key().c_str(), ref_score, alt_score, call, log_p_wrong, (int)q_score, q_char);
 
                 int out_position = v.ref_position - alignment_start_pos;
-                assert(read_outseq[out_position] == v.ref_seq[0]);
+                if(read_outseq[out_position] != v.ref_seq[0]) {
+                    fprintf(stderr, "warning: reference base at position %d does not match variant record (%c != %c)\n",
+                        v.ref_position, v.ref_seq[0], read_outseq[out_position]);
+                }
                 read_outseq[out_position] = call;
                 read_outqual[out_position] = q_char;
             }
@@ -309,13 +318,13 @@ void phase_single_read(const ReadDB& read_db,
 
         // Construct the output bam record
         bam1_t* out_record = bam_init1();
-        
+
         // basic stats
         out_record->core.tid = record->core.tid;
         out_record->core.pos = alignment_start_pos;
         out_record->core.qual = record->core.qual;
-        out_record->core.flag = record->core.flag & BAM_FSUPPLEMENTARY;
-        
+        out_record->core.flag = record->core.flag;
+
         // no read pairs
         out_record->core.mtid = -1;
         out_record->core.mpos = -1;
@@ -342,11 +351,14 @@ int phase_reads_main(int argc, char** argv)
 
     ReadDB read_db;
     read_db.load(opt::reads_file);
-    
+
     // load reference fai file
     faidx_t *fai = fai_load(opt::genome_file.c_str());
-  
-    std::vector<Variant> variants;  
+    if(fai == NULL) {
+        exit(EXIT_FAILURE);
+    }
+
+    std::vector<Variant> variants;
     if(!opt::region.empty()) {
         std::string contig;
         int start_base;
@@ -361,26 +373,28 @@ int phase_reads_main(int argc, char** argv)
 
     // Sort variants by reference coordinate
     std::sort(variants.begin(), variants.end(), sortByPosition);
-    
+
     // remove hom reference
     auto new_end = std::remove_if(variants.begin(), variants.end(), [](Variant v) { return v.genotype == "0/0"; });
     variants.erase( new_end, variants.end());
-    
+
     samFile* sam_out = sam_open("-", "w");
 
-    // the BamProcessor framework calls the input function with the 
+    // the BamProcessor framework calls the input function with the
     // bam record, read index, etc passed as parameters
     // bind the other parameters the worker function needs here
     auto f = std::bind(phase_single_read, std::ref(read_db), std::ref(fai), std::ref(variants), sam_out, _1, _2, _3, _4, _5);
     BamProcessor processor(opt::bam_file, opt::region, opt::num_threads);
-    
+
     // Copy the bam header to std
-    sam_hdr_write(sam_out, processor.get_bam_header());
-    
+    int ret = sam_hdr_write(sam_out, processor.get_bam_header());
+    if(ret != 0) {
+        fprintf(stderr, "[warning] sam_hdr_write returned %d\n", ret);
+    }
     processor.parallel_run(f);
-    
+
     fai_destroy(fai);
     sam_close(sam_out);
-    
+
     return EXIT_SUCCESS;
 }


=====================================
src/nanopolish_read_db.cpp
=====================================
--- a/src/nanopolish_read_db.cpp
+++ b/src/nanopolish_read_db.cpp
@@ -190,6 +190,12 @@ void ReadDB::add_signal_path(const std::string& read_id, const std::string& path
     m_data[read_id].signal_data_path = path;
 }
 
+bool ReadDB::has_read(const std::string& read_id) const
+{
+    const auto& iter = m_data.find(read_id);
+    return iter != m_data.end();
+}
+
 //
 std::string ReadDB::get_signal_path(const std::string& read_id) const
 {
@@ -238,14 +244,21 @@ void ReadDB::save() const
 
 
 //
-bool ReadDB::check_signal_paths() const
+size_t ReadDB::get_num_reads_with_path() const
 {
+    size_t num_reads_with_path = 0;
     for(const auto& iter : m_data) {
-        if(iter.second.signal_data_path == "") {
-            return false;
+        if(iter.second.signal_data_path != "") {
+            num_reads_with_path += 1;
         }
     }
-    return true;
+    return num_reads_with_path;
+}
+
+bool ReadDB::check_signal_paths() const
+{
+    size_t num_reads_with_path = get_num_reads_with_path();
+    return num_reads_with_path == m_data.size();
 }
 
 //
@@ -255,5 +268,5 @@ void ReadDB::print_stats() const
     for(const auto& iter : m_data) {
         num_reads_with_path += iter.second.signal_data_path != "";
     }
-    fprintf(stderr, "[readdb] num reads: %zu, num reads with path: %zu\n", m_data.size(), num_reads_with_path);
+    fprintf(stderr, "[readdb] num reads: %zu, num reads with path to fast5: %zu\n", m_data.size(), num_reads_with_path);
 }


=====================================
src/nanopolish_read_db.h
=====================================
--- a/src/nanopolish_read_db.h
+++ b/src/nanopolish_read_db.h
@@ -46,6 +46,9 @@ class ReadDB
         
         // returns the path to the signal data for the given read
         std::string get_signal_path(const std::string& read_id) const;
+        
+        // returns true if a read with this ID is in the DB
+        bool has_read(const std::string& read_id) const;
 
         // returns the basecalled sequence for the given read
         std::string get_read_sequence(const std::string& read_id) const;
@@ -57,6 +60,9 @@ class ReadDB
         // Summaries and sanity checks
         //
 
+        // return the number of reads with a fast5 file
+        size_t get_num_reads_with_path() const;
+
         // returns true if all reads in the database have paths to their signal-level data
         bool check_signal_paths() const;
 


=====================================
src/nanopolish_squiggle_read.cpp
=====================================
--- a/src/nanopolish_squiggle_read.cpp
+++ b/src/nanopolish_squiggle_read.cpp
@@ -13,6 +13,7 @@
 #include "nanopolish_methyltrain.h"
 #include "nanopolish_extract.h"
 #include "nanopolish_raw_loader.h"
+#include "nanopolish_fast5_io.h"
 
 extern "C" {
 #include "event_detection.h"
@@ -71,46 +72,57 @@ SquiggleRead::SquiggleRead(const std::string& name, const ReadDB& read_db, const
     pore_type(PT_UNKNOWN),
     f_p(nullptr)
 {
+
     this->events_per_base[0] = events_per_base[1] = 0.0f;
     this->base_model[0] = this->base_model[1] = NULL;
     this->fast5_path = read_db.get_signal_path(this->read_name);
+    g_total_reads += 1;
     if(this->fast5_path == "") {
+        g_bad_fast5_file += 1;
         return;
     }
 
-    #pragma omp critical(sr_load_fast5)
-    {
-        // If for some reason the fast5s become unreadable in between indexing
-        // and executing the fast5::File constructor may throw. For example
-        // see issue #273. We catch here and skip the read in such cases
-        try {
-            this->f_p = new fast5::File(fast5_path);
-            assert(this->f_p->is_open());
-
-            // Try to detect whether this read is DNA or RNA
-            this->nucleotide_type = SRNT_DNA;
-            if(this->f_p->have_context_tags_params()) {
-                fast5::Context_Tags_Params context_tags = this->f_p->get_context_tags_params();
-                std::string experiment_type = context_tags["experiment_type"];
-                if(experiment_type == "rna") {
-                    this->nucleotide_type = SRNT_RNA;
+
+    // Get the read type from the fast5 file
+    hid_t hdf5_file = fast5_open(fast5_path);
+    if(hdf5_file >= 0) {
+
+        //fprintf(stderr, "file: %s\n", fast5_path.c_str());
+        std::string experiment_type = fast5_get_experiment_type(hdf5_file);
+        //fprintf(stderr, "type: %s\n", experiment_type.c_str());
+
+        // Try to detect whether this read is DNA or RNA
+        this->nucleotide_type = experiment_type == "rna" ? SRNT_RNA : SRNT_DNA;
+
+        // Did this read come from nanopolish extract?
+        bool is_event_read = is_extract_read_name(this->read_name);
+
+        // Use the legacy loader for DNA reads from extract, otherwise use the new loader
+        if(this->nucleotide_type == SRNT_DNA && is_event_read) {
+            try {
+                #pragma omp critical(sr_load_fast5)
+                {
+                    this->f_p = new fast5::File(fast5_path);
+                    assert(this->f_p->is_open());
+                    load_from_events(flags);
                 }
+            } catch(hdf5_tools::Exception e) {
+                fprintf(stderr, "[warning] fast5 file is unreadable and will be skipped: %s\n", fast5_path.c_str());
+                g_bad_fast5_file += 1;
             }
 
-            bool is_event_read = is_extract_read_name(this->read_name);
-            if(this->nucleotide_type == SRNT_DNA && is_event_read) {
-                load_from_events(flags);
-            } else {
-                this->read_sequence = read_db.get_read_sequence(read_name);
-                load_from_raw(flags);
-            }
-        } catch(hdf5_tools::Exception e) {
-            fprintf(stderr, "[warning] fast5 file is unreadable and will be skipped: %s\n", fast5_path.c_str());
-            g_bad_fast5_file += 1;
+            delete this->f_p;
+            this->f_p = nullptr;
+        } else {
+            this->read_sequence = read_db.get_read_sequence(read_name);
+            load_from_raw(hdf5_file, flags);
         }
 
-        delete this->f_p;
-        this->f_p = nullptr;
+        fast5_close(hdf5_file);
+
+    } else {
+        fprintf(stderr, "[warning] fast5 file is unreadable and will be skipped: %s\n", fast5_path.c_str());
+        g_bad_fast5_file += 1;
     }
 
     if(!this->events[0].empty()) {
@@ -259,7 +271,7 @@ void SquiggleRead::load_from_events(const uint32_t flags)
 }
 
 //
-void SquiggleRead::load_from_raw(const uint32_t flags)
+void SquiggleRead::load_from_raw(hid_t hdf5_file, const uint32_t flags)
 {
     // File not in db, can't load
     if(this->fast5_path == "" || this->read_sequence == "") {
@@ -291,33 +303,12 @@ void SquiggleRead::load_from_raw(const uint32_t flags)
     this->base_model[strand_idx] = PoreModelSet::get_model(kit, alphabet, strand_str, k);
     assert(this->base_model[strand_idx] != NULL);
 
-    // Ensure signal file is available for read
-    assert(f_p->is_open());
-
     // Read the sample rate
-    auto channel_params = f_p->get_channel_id_params();
-    this->sample_rate = channel_params.sampling_rate;
+    auto channel_params = fast5_get_channel_params(hdf5_file);
+    this->sample_rate = channel_params.sample_rate;
 
     // Read the actual samples
-    auto& sample_read_names = f_p->get_raw_samples_read_name_list();
-    if(sample_read_names.empty()) {
-        fprintf(stderr, "Error, no raw samples found\n");
-        exit(EXIT_FAILURE);
-    }
-
-    // we assume the first raw sample read is the one we're after
-    std::string sample_read_name = sample_read_names.front();
-    std::vector<float> samples = f_p->get_raw_samples(sample_read_name);
-
-    // convert samples to scrappie's format (for event detection)
-    raw_table rt;
-    rt.n = samples.size();
-    rt.start = 0;
-    rt.end = samples.size() - 1;
-    rt.raw = (float*)malloc(sizeof(float) * samples.size());
-    for(size_t i = 0; i < samples.size(); ++i) {
-        rt.raw[i] = samples[i];
-    }
+    raw_table rt = fast5_get_raw_samples(hdf5_file, channel_params);
 
     // trim using scrappie's internal method
     // parameters taken directly from scrappie defaults
@@ -344,6 +335,14 @@ void SquiggleRead::load_from_raw(const uint32_t flags)
         start_time += length_in_seconds;
     }
 
+    if(flags & SRF_LOAD_RAW_SAMPLES) {
+        this->samples.resize(rt.n);
+        for(size_t i = 0; i < this->samples.size(); ++i) {
+            assert(rt.start + i < rt.n);
+            this->samples[i] = rt.raw[rt.start + i];
+        }
+    }
+
     // If sequencing RNA, reverse the events to be 3'->5'
     if(this->nucleotide_type == SRNT_RNA) {
         std::reverse(this->events[strand_idx].begin(), this->events[strand_idx].end());
@@ -416,7 +415,6 @@ void SquiggleRead::load_from_raw(const uint32_t flags)
         this->events_per_base[strand_idx] = 0.0f;
         g_failed_alignment_reads += 1;
     }
-    g_total_reads += 1;
 
     // Filter poor quality reads that have too many "stays"
     if(!this->events[strand_idx].empty() && this->events_per_base[strand_idx] > 5.0) {
@@ -850,7 +848,6 @@ std::vector<EventRangeForBase> SquiggleRead::read_reconstruction(const std::stri
     std::string classification = !out_event_map.empty() ? "goodread" : "badread";
     fprintf(stderr, "[final] %s %zu out of %zu kmers mismatch (%.2lf) classification: %s\n", this->fast5_path.c_str(), distinct_mismatches, n_read_kmers, mismatch_rate, classification.c_str());
 #endif
-    g_total_reads += 1;
     return out_event_map;
 }
 


=====================================
src/nanopolish_squiggle_read.h
=====================================
--- a/src/nanopolish_squiggle_read.h
+++ b/src/nanopolish_squiggle_read.h
@@ -302,7 +302,7 @@ class SquiggleRead
         void load_from_events(const uint32_t flags);
 
         // Load all read data from raw samples
-        void load_from_raw(const uint32_t flags);
+        void load_from_raw(hid_t hdf5_file, const uint32_t flags);
 
         // Version-specific intialization functions
         void _load_R7(uint32_t si);



View it on GitLab: https://salsa.debian.org/med-team/nanopolish/compare/64b18c354b34ff0af1072ed7991e65d907d008e8...326887cec4e0a418b5b9089c13910c75334d181b

---
View it on GitLab: https://salsa.debian.org/med-team/nanopolish/compare/64b18c354b34ff0af1072ed7991e65d907d008e8...326887cec4e0a418b5b9089c13910c75334d181b
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/debian-med-commit/attachments/20180217/33e8c93d/attachment-0001.html>


More information about the debian-med-commit mailing list