Skip to content

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

License

Notifications You must be signed in to change notification settings

SAMtoBAM/fusemblr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zenodo DOI Anaconda_version Anaconda_platforms Anaconda_downloads Anaconda-Server Badge

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

fusemblr was designed for the Fusarium oxysporum assembly project (hence the name)
The pipeline only requires Nanopore reads (the longer and higher coverage the better) and an estimation of genome size
Paired-end illumina reads and PacBio is optional

Notably: Providing illumina and/or PacBio Hifi had very little impact on the resulting assemblies using our Fusarium oxysporum datasets as we used recent ONT basecalled data, had high coverage and a good subset of long reads.

Easy installation

conda install samtobam::fusemblr

Container image

docker pull ghcr.io/samtobam/fusemblr:latest

How to run

fusemblr.sh -n nanopore.fq.gz -g 70000000

Required inputs:
-n | --nanopore		Nanopore long reads used for assembly in fastq or fasta format (*.fastq / *.fq) and can be gzipped (*.gz)
-g | --genomesize	Estimation of genome size, required for downsampling and assembly

Recommended inputs:
-1 | --pair1		Paired end illumina reads in fastq format; first pair. Used for Rataosk polishing. Can be gzipped (*.gz)
-2 | --pair2		Paired end illumina reads in fastq format; second pair. Used for Rataosk polishing. Can be gzipped (*.gz)	
-h | --hifi		Pacbio HiFi reads required for assembly polishing with NextPolish2 (Recommended if available)
-t | --threads		Number of threads for tools that accept this option (Default: 1)

PAQman specific paramters:
-b | --buscodb			BUSCO database used for assembly validation (Default: Eukaryota)
-r | --telomererepeat	Single telomeric repeat used to caluclate telomerality (Default: TTAGGG)

Optional parameters:
-m | --minsize		Minimum size of reads to keep during downsampling (Default: 5000)
-x | --coverage		The amount of coverage for downsampling (X), based on genome size, i.e. coverage*genomesize (Default: 100)
-v | --minovl		Minimum overlap for Flye assembly (Default: Calculated during run as N90 of reads used for assembly)
-w | --weight		The weighting used by Filtlong for selecting reads; balancing the length vs the quality (Default: 5)
-p | --prefix		Prefix for output (Default: name of nanopore reads file (-a) before the fastq suffix)
-o | --output		Name of output folder for all results (Default: fusemblr_output)
-c | --cleanup		Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help		Print this help message

Pipeline in 6 steps:

1. Downsampling of reads to a designated coverage using Filtlong

    -default is set to 100X (-x); which provided better assemblies compared to the typical 30-50X

2. Optional: Polishing of downsampled reads with the paired-end illumina reads using Ratatosk

    -uses a baseline quality score (-Q) of 90 and therefore assumes mildly recent ONT data (e.g. R10 or high-accuracy basecalling)

3. Genome Assembly

3.a. Assembly with Flye

    -removed the hard coded maximium value for the minimum overlap threshold (previously 10kb)
    -by default the minimum overlap value is automatically provided as the read N90 after polishing

3.b. Assembly with Hifiasm

    -if Hifi reads are provided: uses the --ul option, with both polished ONT and Hifi reads
    -without Hifi: uses the --ont option, with only the polished ONT reads

4. 'Patch' the Flye assembly (target) using the the Hifiasm assembly (query) with Ragtag

    -uses a minimum unique alignment length (-f) of 25000 to be conservative during patching

5. Optional: Polishing of assembly with PacBio Hifi and paired-end illumina reads using NextPolish2

6. Filtering (minimum contig length 10kb), reordering and renaming using Seqkit and awk

7. Comprehensive evaluation of all assemblies using PAQman

Schematic

About

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages