I cannot help but quoting this great excuse for my not doing BQSR. I do not have definitive dbSNP for parasites.
- You should definitely not run base quality score recalibration without a dbSNP reference. BQSR works as follows:
- 1) Run through all mapped reads looking for reads mismatching the reference genome at a position not listed in dbSNP. (GATK assumes that mismatches that occur in dbSNP are real variations that are being sequenced correctly, and mismatches that aren’t in dbSNP are sequencing errors. This is a decent approximation for statistics like these.)
- 2) Compute statistics on where these mismatches occur (i.e. do they occur near the ends of reads, do they occur in certain dinucleotide pairs, do they occur on bases with low quality scores, etc.)
- 3) Using the statistics gathered above, the quality scores for all bases in your reads are rewritten with new empirical quality scores. To give an example, take the set of all CT dinucleotides with quality 20 at read position 10. A quality score of 20 indicates a 1/100 chance of error. But let’s say these dinucleotides actually only mismatch the reference at a rate of 1/500. This would mean their ’empirical’ score is a higher value than the quality score reported by the machine, and so the quality score for those bases is overwritten with a higher score (27, in this example).
- Basically, this process serves to eliminate systemic biases in quality score assignment from sequencers. It’s quite helpful, but it’s absolutely dependent on having an accurate database of polymorphisms for your organism. If you don’t have that database, GATK can’t tell the difference between polymorphisms and sequencing errors, and so it’ll assume all mismatches are sequencing errors, which will cause it to assign incredibly low quality scores to all your bases. This will ruin all downstream analysis. Don’t do it!