Determination of wfmash mapping identity (-p)
Explanation
According to the PGGB documentation, the mapping identity parameter should be determined using the divergence between input sequences.
In our case, the divergence between chromosomes sequences across haplotypes.
Using mash
, one can get the maximum divergence from a set of chromosome sequences.
The mapping identity should be set lower or equal to 100 - max_divergence * 100
, but the documentation authors also recommends to set it lower to account for possible underestimation of the divergence.
Thus, the mapping identity will be rounded down to the closest multiple of 5 : 95 for anything between 95-100 for example.
Code
Source : see the PGGB documentation
## Computing sequence divergence
mash triangle <chrInput>.fa.gz > chr<id>.mash_triangle.txt
## Getting the maximum divergence
sed 1,1d chr<id>.mash_triangle.txt | tr '\t' '\n' | grep chr -v | LC_ALL=C sort -g -k 1nr | uniq | head -n 1
## Computing the mapping identity
python -c "print(int((100-0.06*100)/5)*5)"