PNG DNA logo
This page details the steps I followed to re-align my Whole Genome Sequencing (WGS) I ordered from Dante Labs into format better suited for processing by YFull

NOTE: Much of the relevant and useful advise for the procedures followed on this page come from:
After the Dante Labs DNA sequencing was complete, I downloaded all of the available files, but the ones of the most value and interest were these:

-r--r--r-- 1 darren darren 49967793271 Jan 25 05:45 xxx.bam
-r--r--r-- 1 darren darren     8887488 Jan 24 08:44 xxx.bam.bai

-rw-r--r-- 1 darren darren 29787997552 Mar 19 06:11 xxx_SA_L001_R1_001.fastq.gz
-rw-r--r-- 1 darren darren 29787997552 Mar 27 21:18 xxx_SA_L001_R2_001.fastq.gz

Unfortunately, as I was to determine later, my 'R2' fastq file was improperly compressed by Dante Labs, and there was a delay while I requested the file to be re-created properly.  The easy way to test this success eventually was this way:
$ gzip --test xxx_SA_L001_R2_001.fastq.gz
Now that I had FASTQ files, I opted to process them through

$ apt install fastq

$ time fastp -i xxx_SA_L001_R1_001.fastq.gz -I xxx_SA_L001_R2_001.fastq.gz -o out.R1.fq.gz -O out.R2.fq.gz
Read1 before filtering:
total reads: 374028594
total bases: 55498491846
Q20 bases: 53576268480(96.5364%)
Q30 bases: 50562803705(91.1066%)

Read1 after filtering:
total reads: 371183026
total bases: 55099763618
Q20 bases: 53370656021(96.8619%)
Q30 bases: 50412293351(91.4928%)

Read2 before filtering:
total reads: 374028594
total bases: 55498491846
Q20 bases: 53576268480(96.5364%)
Q30 bases: 50562803705(91.1066%)

Read2 aftering filtering:
total reads: 371183026
total bases: 55099763618
Q20 bases: 53370656021(96.8619%)
Q30 bases: 50412293351(91.4928%)

Filtering result:
reads passed filter: 742366052
reads failed due to low quality: 5404220
reads failed due to too many N: 33930
reads failed due to too short: 252986
reads with adapter trimmed: 11926
bases trimmed due to adapters: 920082

Duplication rate: 22.3908%

Insert size peak (evaluated by paired-end reads): 35

JSON report: fastp.json
HTML report: fastp.html

fastp -i xxx_SA_L001_R1_001.fastq.gz -I xxx_SA_L001_R2_001.fastq.gz -o out.R1.fq.gz -O out.R2.fq.gz
fastp v0.19.6, time used: 8342 seconds

real    139m1.732s
user    417m34.937s
sys    7m1.393s

The generated output HTML page can be seen here and the JSON data can be found here.

From comments online, I suspected that the BAM file created by Dante Labs was in an older format known as hg19/GRCh37.  In order to verify this, I ran the following:

$ samtools view -H xxx.bam

This produced a report with references to 'grch37':
...
@PG    ID: Hash Table Build    VN: 01.003.044.3.4.11-hv-7    CL: dragen --build-hash-table true --ht-reference /staging/human_g1k_v37_decoy.fasta --output-directory /staging/grch37/ --enable-cnv true --enable-rna true    DS: digest_type: 1 digest: 0xF2543D4A ref_digest: 0xC2311E75 ref_index_digest: 0xB4307AF0 hash_digest: 0x7BF2A3E5
@PG    ID: DRAGEN SW build    VN: 05.021.408.3.4.11    CL: /opt/edico/bin/dragen -f -r /references/grch37 --sv-reference /references/grch37.fasta -1 /ephemeral/xxx/xxx_SA_L001_R1_001.fastq.gz -2 /ephemeral/xxx/xxx_SA_L001_R2_001.fastq.gz --output-directory /ephemeral/xxx --vc-sample-name xxx...
...


In order to get maximal accuracy and value from what I eventually intended to send to YFull, I had to create a new BAM file aligned with the newest standard known as hg38/GRCh38p13.  I could either upload my large FASTQ files to various conversion services (for a small cost), or attempt to use my moderately-powerful Linux desktop to do this on my own (with free open-source tools).  I opted for the latter :)

The first step in the process of making my own alignment was to download the reference assembly and index it.  I got the file using instructions here and using the link here.  This is the command that I used:

$ time bwa index GRCh38.p13.genome.fa.gz
[bwa_index] Pack FASTA... 47.58 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6534235976, availableWord=471772820
[BWTIncConstructFromPacked] 10 iterations done. 99999992 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999992 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999992 characters processed.
...
[BWTIncConstructFromPacked] 690 iterations done. 6374612482 characters processed.
[BWTIncConstructFromPacked] 700 iterations done. 6398716386 characters processed.
[BWTIncConstructFromPacked] 710 iterations done. 6418572210 characters processed.
[bwt_gen] Finished constructing BWT in 710 iterations.
[bwa_index] 2418.96 seconds elapse.
[bwa_index] Update BWT... 14.75 sec
[bwa_index] Pack forward-only FASTA... 19.15 sec
[bwa_index] Construct SA from BWT and Occ... 819.65 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index hg38.fa
[main] Real time: 3477.408 sec; CPU: 3299.025 sec

real    57m57.638s
user    54m46.199s
sys    0m12.827s


The next step was to install the newest version of the 'bwa' and 'samtools' utilities.  For 'bwa', using LMDE4 (Debian 10 "Buster") this was as easy as:

$ apt install bwa

For 'samtools', getting the source from GitHub and compiling it was the only option.  Some preliminaries were needed first:

$ apt install zlib1g-dev libbz2-dev liblzma-dev autoconf libncurses5-dev

Next, I had to install hstlib-develop:

- Get 'htslib-develop' from https://github.com/samtools/htslib
- Unzip it
- CD into the unzipped directory
- Then:

autoconf
./configure
make
sudo make install

Then I had to install samtools-develop:

- Get 'samtools-develop' from https://github.com/samtools/samtools
- Unzip it
- CD into the unzipped directory
- Then:

autoconf
./configure
make
sudo make install


Actually re-aligning my FASTQ files into a new BAM file was a very resource-intensive process.  I determined that the 8gb of RAM in my desktop was not sufficient, so I ordered, waited, and installed 16gb as a replacement (I later reduced my installed memory to 12gb, since there were suspicions that my machine could not properly handle this -- despite the fact that none of the memory tests that I ran revealed any problems).

During most of the actual run, over 10gb of RAM was used, 44gb of temporary BAM files were produced (352 files!), and the 4 CPU cores in my i7-4600U Dell Chromebox were running full-tilt.  Here is what 'top' looked like while I was doing this:

top - 15:41:30 up 1 day, 41 min,  0 users,  load average: 4.24, 4.20, 4.18
Tasks: 122 total,   1 running, 121 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.3 us,  1.4 sy,  0.0 ni,  0.2 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  11950.9 total,    167.8 free,   9384.9 used,   2398.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2232.7 avail Mem
   
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                             
1289 darren 20 0 6271004 5.8g 2328 S 384.7 49.4 5831:46 bwa
1290 darren 20 0 4213524 3.2g 2720 S 13.0 27.4 66:12.34 samtools

The actual 'bwa/samtools' run was as follows:

$ time bwa mem -t 4 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz xxx_SA_L001_R1_001.fastq.gz xxx_SA_L001_R2_001.fastq.gz |
samtools sort -@4 -o xxx_darren_enns_grch38.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 269556 sequences (40000012 bp)...
[M::process] read 269810 sequences (40000038 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 269556 reads in 167.144 CPU sec, 41.907 real sec
...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 246530 reads in 161.221 CPU sec, 40.460 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz xxx_SA_L001_R1_001.fastq.gz xxx_SA_L001_R2_001.fastq.gz
[main] Real time: 113629.660 sec; CPU: 448658.147 sec
[bam_sort_core] merging from 352 files and 4 in-memory blocks...

real    1966m29.925s
user    7708m9.538s
sys    12m19.457s

So this took 32.8 hours on my relatively fast machine (!), and produced a new GRCh38-aligned BAM file:
-rw-r--r-- 1 darren darren 28169047131 Mar 29 23:50 xxx_darren_enns_grch38.bam
Due to failed previous attempts, I was careful to test the integrity of the newly-generated BAM file:

$ time samtools quickcheck xxx_darren_enns_grch38.bam

real    0m0.476s
user    0m0.005s
sys    0m0.000s

$ time zcat xxx_darren_enns_grch38.bam > /dev/null

real    20m38.380s
user    18m49.310s
sys    0m15.023s


I then needed to create an index file for the new BAM file:
$ time samtools index xxx_darren_enns_grch38.bam xxx_darren_enns_grch38.bam.bai

real	14m27.191s
user	12m38.902s
sys	0m14.695s
This produced an index for the new GRCh38-aligned BAM file, which I renamed so that I could also use a different utility (bamtools) to also create an index file for comparison:
-rw-r--r-- 1 darren darren     8795176 Mar 30 08:46 xxx_darren_enns_grch38.bam.bai
At this point I had a newly-aligned BAM file that could (?) have been submitted to YFull, but the suggestion was to submit a greatly-reduced BAM file containing only Y and MT chromosome data.  This new smaller BAM file was created this way:

$ time samtools view -h -b xxx_darren_enns_grch38.bam chrY chrM > xxx_darren_enns_grch38.chrY.chrM.bam

real    0m55.136s
user    0m51.998s
sys     0m0.945s

This created the following file:

-rw-r--r-- 1 darren darren   272344643 Mar 30 10:09 xxx_darren_enns_grch38.chrY.chrM.bam

This much smaller BAM file is the one that I submitted to YFull for further analysis.  Initial assessment from them confirms that my paternal haplogroup is R-S16361, a subclade of the R-U106 Y-chromosome haplogroup. For more information on my genetic ancestry, see here.

Quick and useful analysis reports of the original (hg19) and newly-aligned (hg38) BAM files from qual.iobio.io:

PNG bam file hg19
PNG bam file hg38