dare dna

This page details the steps I followed to re-align my Whole Genome Sequencing (WGS) I ordered from Dante Labs into format better suited for processing by YFull.

NOTE: Much of the relevant and useful advise for the procedures followed on this page come from:

After the Dante Labs DNA sequencing was complete, I downloaded all of the available files, but the ones of the most value and interest were these:

-r--r--r-- 1 darren darren 49967793271 Jan 25 05:45 xxx.bam-r--r--r-- 1 darren darren 8887488 Jan 24 08:44 xxx.bam.bai-rw-r--r-- 1 darren darren 29787997552 Mar 19 06:11 xxx_SA_L001_R1_001.fastq.gz -rw-r--r-- 1 darren darren 29787997552 Mar 27 21:18 xxx_SA_L001_R2_001.fastq.gz Unfortunately, as I was to determine later, my 'R2' fastq file was improperly compressed by Dante Labs, and there was a delay while I requested the file to be re-created properly. The easy way to test this success eventually was this way:$ gzip --test xxx_SA_L001_R2_001.fastq.gzNow that I had FASTQ files, I opted to process them through
$ apt install fastq $ time fastp -i xxx_SA_L001_R1_001.fastq.gz -I xxx_SA_L001_R2_001.fastq.gz -o out.R1.fq.gz -O out.R2.fq.gz Read1 before filtering: total reads: 374028594 total bases: 55498491846 Q20 bases: 53576268480(96.5364%) Q30 bases: 50562803705(91.1066%) Read1 after filtering: total reads: 371183026 total bases: 55099763618 Q20 bases: 53370656021(96.8619%) Q30 bases: 50412293351(91.4928%) Read2 before filtering: total reads: 374028594 total bases: 55498491846 Q20 bases: 53576268480(96.5364%) Q30 bases: 50562803705(91.1066%) Read2 aftering filtering: total reads: 371183026 total bases: 55099763618 Q20 bases: 53370656021(96.8619%) Q30 bases: 50412293351(91.4928%) Filtering result: reads passed filter: 742366052 reads failed due to low quality: 5404220 reads failed due to too many N: 33930 reads failed due to too short: 252986 reads with adapter trimmed: 11926 bases trimmed due to adapters: 920082 Duplication rate: 22.3908% Insert size peak (evaluated by paired-end reads): 35 JSON report: fastp.json HTML report: fastp.html fastp -i xxx_SA_L001_R1_001.fastq.gz -I xxx_SA_L001_R2_001.fastq.gz -o out.R1.fq.gz -O out.R2.fq.gz fastp v0.19.6, time used: 8342 seconds real 139m1.732s user 417m34.937s sys 7m1.393sThe generated output HTML page can be seen here and the JSON data can be found here.

From comments online, I suspected that the BAM file created by Dante Labs was in an older format known as hg19/GRCh37. In order to verify this, I ran the following:

$ samtools view -H xxx.bam

This produced a report with references to 'grch37':
...@PG ID: Hash Table Build VN: 01.003.044.3.4.11-hv-7 CL: dragen --build-hash-table true --ht-reference /staging/human_g1k_v37_decoy.fasta --output-directory /staging/grch37/ --enable-cnv true --enable-rna true DS: digest_type: 1 digest: 0xF2543D4A ref_digest: 0xC2311E75 ref_index_digest: 0xB4307AF0 hash_digest: 0x7BF2A3E5@PG ID: DRAGEN SW build VN: 05.021.408.3.4.11 CL: /opt/edico/bin/dragen -f -r /references/grch37 --sv-reference /references/grch37.fasta -1 /ephemeral/xxx/xxx_SA_L001_R1_001.fastq.gz -2 /ephemeral/xxx/xxx_SA_L001_R2_001.fastq.gz --output-directory /ephemeral/xxx --vc-sample-name xxx......

In order to get maximal accuracy and value from what I eventually intended to send to YFull, I had to create a new BAM file aligned with the newest standard known as hg38/GRCh38p13. I could either upload my large FASTQ files to various conversion services (for a small cost), or attempt to use my moderately-powerful Linux desktop to do this on my own (with free open-source tools). I opted for the latter :)

The first step in the process of making my own alignment was to download the reference assembly and index it. I got the file using instructions here and using the link here. This is the command that I used:

$ time bwa index GRCh38.p13.genome.fa.gz[bwa_index] Pack FASTA... 47.58 sec[bwa_index] Construct BWT for the packed sequence...[BWTIncCreate] textLength=6534235976, availableWord=471772820[BWTIncConstructFromPacked] 10 iterations done. 99999992 characters processed.[BWTIncConstructFromPacked] 20 iterations done. 199999992 characters processed.[BWTIncConstructFromPacked] 30 iterations done. 299999992 characters processed....[BWTIncConstructFromPacked] 690 iterations done. 6374612482 characters processed.[BWTIncConstructFromPacked] 700 iterations done. 6398716386 characters processed.[BWTIncConstructFromPacked] 710 iterations done. 6418572210 characters processed.[bwt_gen] Finished constructing BWT in 710 iterations.[bwa_index] 2418.96 seconds elapse.[bwa_index] Update BWT... 14.75 sec[bwa_index] Pack forward-only FASTA... 19.15 sec[bwa_index] Construct SA from BWT and Occ... 819.65 sec[main] Version: 0.7.17-r1188[main] CMD: bwa index hg38.fa[main] Real time: 3477.408 sec; CPU: 3299.025 secreal 57m57.638suser 54m46.199ssys 0m12.827s

The next step was to install the newest version of the 'bwa' and 'samtools' utilities. For 'bwa', using LMDE4 (Debian 10 "Buster") this was as easy as:

$ apt install bwa

For 'samtools', getting the source from GitHub and compiling it was the only option. Some preliminaries were needed first:

$ apt install zlib1g-dev libbz2-dev liblzma-dev autoconf libncurses5-dev
Next, I had to install hstlib-develop:

- Get 'htslib-develop' from https://github.com/samtools/htslib
- Unzip it
- CD into the unzipped directory
- Then:

autoconf./configuremakesudo make install
Then I had to install samtools-develop:

- Get 'samtools-develop' from https://github.com/samtools/samtools
- Unzip it
- CD into the unzipped directory
- Then:

autoconf./configuremakesudo make install

Actually re-aligning my FASTQ files into a new BAM file was a very resource-intensive process. I determined that the 8gb of RAM in my desktop was not sufficient, so I ordered, waited, and installed 16gb as a replacement (I later reduced my installed memory to 12gb, since there were suspicions that my machine could not properly handle this -- despite the fact that none of the memory tests that I ran revealed any problems).

During most of the actual run, over 10gb of RAM was used, 44gb of temporary BAM files were produced (352 files!), and the 4 CPU cores in my i7-4600U Dell Chromebox were running full-tilt. Here is what 'top' looked like while I was doing this:

top - 15:41:30 up 1 day, 41 min, 0 users, load average: 4.24, 4.20, 4.18Tasks: 122 total, 1 running, 121 sleeping, 0 stopped, 0 zombie%Cpu(s): 98.3 us, 1.4 sy, 0.0 ni, 0.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 stMiB Mem : 11950.9 total, 167.8 free, 9384.9 used, 2398.2 buff/cacheMiB Swap: 0.0 total, 0.0 free, 0.0 used. 2232.7 avail Mem


                
                  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                             
 1289 darren    20   0 6271004   5.8g   2328 S 384.7  49.4   5831:46 bwa                                                                                                 
 1290 darren    20   0 4213524   3.2g   2720 S  13.0  27.4  66:12.34 samtools   

                The actual 'bwa/samtools' run was
                    as follows:

                    

                    $ time bwa mem -t 4
                    GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
                    xxx_SA_L001_R1_001.fastq.gz
                    xxx_SA_L001_R2_001.fastq.gz | 

                    samtools sort -@4 -o xxx_darren_enns_grch38.bam

                    

                    [M::bwa_idx_load_from_disk] read 0 ALT contigs

                    [M::process] read 269556 sequences (40000012 bp)...

                    [M::process] read 269810 sequences (40000038 bp)...

                    [M::mem_pestat] # candidate unique pairs for (FF,
                    FR, RF, RR): (0, 0, 0, 0)

                    [M::mem_pestat] skip orientation FF as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation FR as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation RF as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation RR as there are not
                    enough pairs

                    [M::mem_process_seqs] Processed 269556 reads in
                    167.144 CPU sec, 41.907 real sec

                    ...

                    [M::mem_pestat] # candidate unique pairs for (FF,
                    FR, RF, RR): (0, 0, 0, 0)

                    [M::mem_pestat] skip orientation FF as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation FR as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation RF as there are not
                    enough pairs

                    [M::mem_pestat] skip orientation RR as there are not
                    enough pairs

                    [M::mem_process_seqs] Processed 246530 reads in
                    161.221 CPU sec, 40.460 real sec

                    [main] Version: 0.7.17-r1188

                    [main] CMD: bwa mem -t 4
                    GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
                    xxx_SA_L001_R1_001.fastq.gz
                    xxx_SA_L001_R2_001.fastq.gz

                    [main] Real time: 113629.660 sec; CPU: 448658.147
                    sec

                    [bam_sort_core] merging from 352 files and 4
                    in-memory blocks...

                    

                    real    1966m29.925s

                    user    7708m9.538s

                    sys    12m19.457s

So this took 32.8 hours on my relatively fast machine (!), and produced a new GRCh38-aligned BAM file:
-rw-r--r-- 1 darren darren 28169047131 Mar 29 23:50 xxx_darren_enns_grch38.bam

Due to failed previous attempts, I was careful to test the integrity of the newly-generated BAM file:

$ time samtools quickcheck xxx_darren_enns_grch38.bamreal 0m0.476suser 0m0.005ssys 0m0.000s$ time zcat xxx_darren_enns_grch38.bam > /dev/nullreal 20m38.380suser 18m49.310ssys 0m15.023s

I then needed to create an index file for the new BAM file:


                
                $ time samtools index xxx_darren_enns_grch38.bam xxx_darren_enns_grch38.bam.bai

real	14m27.191s
user	12m38.902s
sys	0m14.695s

This produced an index for the new GRCh38-aligned BAM file, which I renamed so that I could also use a different utility (bamtools) to also create an index file for comparison:
-rw-r--r-- 1 darren darren 8795176 Mar 30 08:46 xxx_darren_enns_grch38.bam.bai

At this point I had a newly-aligned BAM file that could (?) have been submitted to YFull, but the suggestion was to submit a greatly-reduced BAM file containing only Y and MT chromosome data. This new smaller BAM file was created this way:

$ time samtools view -h -b xxx_darren_enns_grch38.bam chrY chrM > xxx_darren_enns_grch38.chrY.chrM.bam real 0m55.136s user 0m51.998s sys 0m0.945s
This created the following file:

-rw-r--r-- 1 darren darren 272344643 Mar 30 10:09 xxx_darren_enns_grch38.chrY.chrM.bamThis much smaller BAM file is the one that I submitted to YFull for further analysis. Initial assessment from them confirms that my paternal haplogroup is R-S16361, a subclade of the R-U106 Y-chromosome haplogroup. For more information on my genetic ancestry, see here.

Quick and useful analysis reports of the original (hg19) and newly-aligned (hg38) BAM files from qual.iobio.io: