diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index 828b4de38372893b62b056b337a7f89695056f93..259e6b29aa55f7b3714bc89450cf435e53498192 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -23,7 +23,7 @@ shallow_run_workflow: - cp example/*.fa.gz data/haplotypes - HOME=$(pwd) - rm config.yaml && mv example/config_CICD.yaml config.yaml - - apptainer run --bind $HOME:/root appimgs/snakebox.sif snakemake -c1 --dag > workflow.dot + - apptainer run --bind $HOME:/root appimgs/snakebox.sif snakemake --debug -c1 --dag > workflow.dot artifacts: paths: - workflow.dot diff --git a/README.md b/README.md index 04529fb976ad417e3687aeeb912b73fc1c1548c8..6915e8b3e7c6228b093cf849b071f0faab8c1e1b 100644 --- a/README.md +++ b/README.md @@ -3,50 +3,81 @@ Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale. The workflow use a set of apptainer images : - PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools -- PanGraTools (PGGB): https://forgemia.inra.fr/alexis.mergez/pangratools - Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps > An example of input files and a config file is available in `example/`. +# Minimum image version + +- PanGeTools >= v1.10.10 +- Pan1c-env >= v1.1.1 +- Pan1c-box >= v1.1.2 +- minigraph-cactus >= v2.1.4b + # Prepare your data + This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly. -**Fasta files need to be compressed** using **bgzip2** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)). -Sequences names of the reference **must follow this pattern** : `<sample>#<haplotype>#<contig or chromosome name>`. +**Fasta files need to be compressed** using **bgzip** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)). +Sequences names of the reference **must follow this pattern** : +`<sample>#<haplotype>#<contig or chromosome name>` For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names. -Fasta files **must also follow a pattern** : `<sample>.hap<haplotype>.fa.gz`. Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`. +Fasta files **must also follow a pattern** : +`<sample>.hap<haplotype>.fa.gz` +Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`. See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming. You should only provide chromosome-level assemblies, but, as the haplotypes are renamed using RagTag, it is possible to give scaffold or contig-level assemblies. Since RagTag scaffolds each assemblies using the "reference" haplotype, it can scaffold chromosome-level assemblies that also contains non-placed scaffold/contigs. If you don't want this behavior, prune your FASTAs from any non-chromosome-level sequences **before** providing them to Pan1c. # Download apptainer images + Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so : ``` ./getApps.sh -a <apps directory> ``` -> Make sure to use the latest version or the workflow might return you errors ! +> Make sure to use the latest version or the workflow might return you errors. # Running the workflow + Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes. Update the reference name and the apptainer image directory in `config.yaml`. Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.). + +## Single machine mode + Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`! +The default script uses a single node and runs everything on it. This method only require apptainer to run but isn't the most efficient for job distribution. + +## Cluster execution + +To execute each steps as a job with SLURM, install a custom conda environement with this command : +``` +conda create -n Pan1c -c conda-forge -c bioconda snakemake=8.4.7 snakemake-executor-plugin-slurm +``` +This works by having a job that runs snakemake which will submit other jobs. To do so, configure `runSnakemakeSLURM.sh` and submit it using `sbatch`. +> If you get OOM errors, use the mem_multiplier in `config.yaml` to allocate more memory for jobs. + +# Pan1c_View + +[Pan1c_View](https://forgemia.inra.fr/philippe.bardou/pan1c_view) is an interface developped by Philippe Bardou, used to visualize different statistics generated from Pan1c pangenome graphs. +To use it, extract the Pan1c_View tarball generated by the workflow in the `project` folder of Pan1c_View, then follow [these](https://forgemia.inra.fr/philippe.bardou/pan1c_view#installation) instructions. + +# Main outputs -# Outputs The workflow generates several key files : -- Aggregated graph including every chromosome scale graphs (`output/pan1c.<panname>.gfa`) -- Chromosome scale graphs (`data/chrGraphs/chr<id>.gfa`) -- Panacus html reports for each chromosome level graph (`output/panacus.reports/chr<id>.histgrowth.html`) +- Aggregated graph including every chromosome scale graphs (`output/Pan1c.<gtool>.<pangenome_name>.gfa.gz`) +- Chromosome scale graphs (`data/chrGraphs/Pan1c.<gtool>.<pangenome_name>.<chromosome>.gfa.gz`) +- Panacus html reports for each chromosome level graph (`output/panacus.reports/Pan1c.<gtool>.<pangenome_name>.<chromosome>.histgrowth.html`) - Statistics on input sequences, graphs and resources used by the workflow (`output/stats`) - Odgi 1D visualization of chromosome level graphs (`output/chrGraphs.figs`) -- (Optional) SyRI structural variant figures (`output/asm.syri.figs`) -- (Optional) Quast results on your input haplotypes (`output/quast`) -- (Optional) Contig composition of chromosomes of your input haplotypes (`output/hap.contig`) -- (optional) PAV matrices for each chromosome graph (`output/pav.matrices/chr<id>.pav.matrix.tsv`) +- (Optional) Pan1c-View tarball (`output/<pangenome_name>.Pan1c-View.data.tar.gz`) +- (Optional) SyRI structural variant figures (`output/asm.syri.figs`, `chrInput.syri.figs`) +- (Optional) Quast results on your input haplotypes (`output/Pan1c.<pangenome_name>.quast.report.html`) +- (Optional) Contig composition of chromosomes of your input haplotypes (`output/chr.contig`) # File architecture -## Before running the workflow + ``` Pan1c/ ├── config.yaml @@ -60,60 +91,26 @@ Pan1c/ ├── getApps.sh ├── README.md ├── runSnakemake.sh +├── runSnakemakeSLURM.sh ├── scripts │  └── ... └── Snakefile ``` -## After the workflow (Arabidopsis Thaliana example) -The following tree is non-exhaustive for clarity. Temporary files are not listed, but key files are included. -The name of the pangenome is `06AT-v3`. -``` -Pan1c-06AT-v3 -├── chrInputs -│ -├── config.yaml -├── data -│ ├── chrGraphs -│ │ ├── chr<id> -│ │ ├── chr<id>.gfa -│ │ └── graphsList.txt -│ ├── chrInputs -│ │ └── chr<id>.fa.gz -│ ├── haplotypes -│ └── hap.ragtagged -│ ├── <sample>.hap<hid> -│ └── <sample>.hap<hid>.ragtagged.fa.gz -├── logs -│ ├── pan1c.pggb.06AT-v3.logs.tar.gz -│ └── pggb -│ ├── chr<id>.pggb.cmd.log -│ └── chr<id>.pggb.time.log -├── output -│ ├── figures -│ │ ├── chr<id>.1Dviz.png -│ │ └── chr<id>.pcov.png -│ ├── stats -│ │ ├── pan1c.pggb.06AT-v3.core.stats.tsv -│ │ ├── pan1c.pggb.06AT-v3.chrGraph.general.stats.tsv -│ │ └── pan1c.pggb.06AT-v3.chrGraph.path.stats.tsv -│ ├── pan1c.pggb.06AT-v3.gfa -│ ├── panacus.reports -│ │ └── chr<id>.histgrowth.html -│ └── chrGraphs.stats -│ └── chr<id>.stats.tsv -├── Pan1c-06AT-v3.log -├── README.md -├── runSnakemake.sh -├── scripts -│  └── ... -├── Snakefile -└── workflow.svg -``` # Example DAG (Saccharomyces cerevisiae example) + This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.  +# Authors and acknowledgment + +- Alexis Mergez +- Martin Racoupeau +- Christophe Klopp +- Christine Gaspin +- Fabrice Legeai + # Contact -[alexis.mergez@inrae.fr](mailto:alexis.mergez@inrae.fr) + +[pan1c@inrae.fr](mailto:pan1c@inrae.fr) diff --git a/Snakefile b/Snakefile index 71c2bb3b207ac9f8491c82da8a83fe7a1c67fd55..5b2642e4c48c088a14c500252b011d29d0a7a365 100644 --- a/Snakefile +++ b/Snakefile @@ -2,6 +2,7 @@ configfile: "config.yaml" include: "rules/tools.smk" +ruleorder: odgi_postprocessing > graph_squeeze > run_bgzip ## Modules import os @@ -35,41 +36,48 @@ nHAP = len(SAMPLES) with gzip.open("data/haplotypes/"+config['reference'], "r") as handle: CHRLIST = [line.decode().split("#")[-1].split('\n')[0] for line in handle.readlines() if line.decode()[0] == ">"] +graph_tools = (config["run_PGGB"] == "True")*["PGGB"] + (config["run_MC"] == "True")*["MC"] + # Adding optionnal output based on config.yaml, using the following function def which_analysis(): ## Default analysis analysis_inputs = [ - "output/stats/pan1c."+config['name']+".core.stats.tsv", # core stats - expand("output/panacus.reports/"+config['name']+".{chromosome}.histgrowth.html", chromosome=CHRLIST), # panacus histgrowth - expand("output/chrGraphs.figs/"+config['name']+".{chromosome}.1Dviz.png", chromosome=CHRLIST), # visualizations from odgi on chromosome graphs - "output/stats/pan1c."+config['name']+".chrGraph.general.stats.tsv" # chromosomes graph statistics + expand("output/panacus.reports/Pan1c.{gtool}."+config['name']+".{chromosome}.histgrowth.html", chromosome=CHRLIST, gtool=graph_tools), # panacus histgrowth + expand("output/chrGraphs.figs/Pan1c.{gtool}."+config['name']+".{chromosome}.1Dviz.png", chromosome=CHRLIST, gtool=graph_tools), # visualizations from odgi on chromosome graphs + expand("output/stats/Pan1c.{gtool}."+config['name']+".chrGraph.general.stats.tsv", gtool=graph_tools) # chromosomes graph statistics ] ## Optionals analysis steps - if config["get_PAV"] == "True": # Adding PAV matrix creation - analysis_inputs.append("output/pav.matrices") - if config["get_ASMs_SyRI"] == "True": # Creating SyRI for each input assembly analysis_inputs.append( - expand("output/asm.syri.figs/pan1c."+config['name']+".{haplotype}.syri.png", haplotype=SAMPLES_NOREF) + expand("output/asm.syri.figs/Pan1c."+config['name']+".{haplotype}.syri_{tool}.png", haplotype=SAMPLES_NOREF, tool=["mm2"]) ) if config["get_chrInputs_SyRI"] == "True": # Creating SyRI figures for each PGGB input analysis_inputs.append( - expand("output/chrInput.syri.figs/"+config['name']+".{chromosome}.asm.syri.png", chromosome=CHRLIST) + expand("output/chrInput.syri.figs/Pan1c."+config['name']+".{chromosome}.syri_mm2.png", chromosome=CHRLIST) ) if config["run_Quast"] == "True": # Running Quast on input haplotypes analysis_inputs.append( - "output/quast/"+config['name']+".quast.report.html" + "output/Pan1c."+config['name']+".quast.report.html" ) if config["get_contig_pos"] == "True": # Chromosome decomposition into its contig figure analysis_inputs.append( - expand("output/chr.contig/{haplotype}.contig.png", haplotype=CHRLIST) + expand("output/chr.contig/Pan1c."+config['name']+".{chromosome}.contig.png", chromosome=CHRLIST) ) - if config["create_report"] == "True": # Creating report (need contig) + if config["Pan1c-View_jsons"] == "True": # Creates JSONs for Pan1c-View + analysis_inputs.append("output/report_data/Pan1c."+config['name']+".assembly.json") + analysis_inputs.append("output/report_data/Pan1c."+config['name']+".graph.json") + + if config["get_VCF"] == "True": + analysis_inputs.append("output/report_data/Pan1c."+config['name']+".var.json") + + analysis_inputs.append("output/"+config['name']+".Pan1c_View.tar.gz") + + if config["get_VCF"] == "True": analysis_inputs.append( - "output/pan1c."+config['name']+".report.md" + expand("output/Pan1c.{gtool}."+config['name']+".vcf.gz", gtool=graph_tools) ) return analysis_inputs @@ -81,8 +89,8 @@ Rules ------------------------------------------------------------------------ # Main target rule rule all: input: - "output/pan1c."+config['name']+".gfa", # Final graph (main output) - "output/pan1c."+config['name']+".gfa.metadata", # Metadata for the final (also in top of gfa files as # line) + expand("output/Pan1c.{gtool}."+config['name']+".gfa.gz", gtool=graph_tools), # Final graph (main output) + "output/Pan1c."+config['name']+".gfa.metadata", # Metadata for the final (also in top of gfa files as # line) which_analysis() """ @@ -90,7 +98,27 @@ Pre-processing section : preparing pggb inputs -------------------------------- """ rule ragtag_scaffolding: - # Scaffold input haplotype against the reference to infer chromosome scale sequences + """ + Scaffold an haplotype against the reference assembly set in the config file. + + Input : + - Reference assembly (.fa.gz) + - Haplotype assembly (<haplotype>.fa.gz) + - Fasta index of the reference (.fa.gz.fai and .fa.gz.gzi) + + Output : + - Scaffolded haplotype (<haplotype>.ragtagged.fa) [Temporary] + - Tarball with RagTag temporary files (<haplotype>.tar.gz) + + Threads : 8 + + Memory : n_threads * mem_multiplier * 4Gb + + Parameters : + - app.path + - ragtag_args + - ragtag_mm2_conf + """ input: ref="data/haplotypes/"+config['reference'], reffai="data/haplotypes/"+config['reference']+".fai", @@ -100,10 +128,16 @@ rule ragtag_scaffolding: fa=temp("data/hap.ragtagged/{haplotype}.ragtagged.fa"), tar="data/hap.ragtagged/{haplotype}.tar.gz" threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 4000 retries: 1 priority: 100 + benchmark: + "statistics/ragtag.{haplotype}.txt" params: - app_path=config["app.path"] + app_path=config["app.path"], + mm2_config=config["ragtag_mm2_conf"], + rt_config=config["ragtag_args"] log: cmd="logs/ragtag/{haplotype}.ragtag.cmd.log", time="logs/ragtag/{haplotype}.ragtag.time.log" @@ -116,35 +150,63 @@ rule ragtag_scaffolding: -t {threads} \ -r {input.ref} \ -q {input.fa} \ + -c "{params.rt_config}" \ + -m "{params.mm2_config}" \ -o {output.fa} 2>&1 | \ tee {log.cmd} - #if [[ -z $(zgrep '[^[:space:]]' {output.fa}) ]] ; then + #if [[ -z $(grep '[^[:space:]]' {output.fa}) ]] ; then # echo "Error : Empty final fasta" # exit 1 #fi """ rule quast_stats: - # Run Quast on ragtagged genomes + """ + Run Quast on raw input assemblies. [Optional] + + Input : + - Reference assembly (.fa.gz) + - All haplotypes (.fa.gz) + + Output : + - Quast HTML report + + Threads : 16 + + Memory : n_threads * mem_multiplier * 4Gb + + Parameters : + - app.path + - Pangenome name + """ input: fas=expand("data/haplotypes/{haplotype}.fa.gz", haplotype=SAMPLES_NOREF), ref="data/haplotypes/"+config['reference'] output: - report="output/quast/"+config['name']+".quast.report.html" - threads: 8 + report="output/Pan1c."+config['name']+".quast.report.html" + threads: 16 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 4000 params: app_path=config["app.path"], - pan_name=config["name"] + pan_name=config["name"], + tmp_dir="output/quast" + benchmark: + "statistics/quast.txt" log: cmd="logs/quast/quast.cmd.log", time="logs/quast/quast.time.log" shell: """ + if [ ! -d {params.tmp_dir} ]; then + mkdir {params.tmp_dir} + fi + /usr/bin/time -v -o {log.time} \ apptainer run {params.app_path}/pan1c-env.sif quast.py \ -t {threads} \ - -o "$(dirname {output.report})" \ + -o {params.tmp_dir} \ -r {input.ref} \ --plots-format png \ --no-read-stats \ @@ -154,24 +216,85 @@ rule quast_stats: {input.fas} 2>&1 | \ tee {log.cmd} - # Compressing temporary files - apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - -@ {threads} $(dirname {output.report})/contigs_reports/*.fa - apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - -@ {threads} $(dirname {output.report})/contigs_reports/minimap_output/* + mv {params.tmp_dir}/report.html {output.report} - mv $(dirname {output.report})/report.html {output.report} + rm -r {params.tmp_dir} """ -rule contig_position: - # Produce figures with contig positions +rule assemblathon_stats: + """ + Run Assemblathon_stats on a haplotype. + + Input : + - Haplotype (<haplotype>.fa.gz) + - Fasta index of the haplotype (<haplotype>.fa.gz.fai) + + Output : + - Tab-delimited table with assemblaton stats (<haplotype>.stats.csv) + + Threads : 1 + + Memory : n_threads * 16Gb + + Parameters : + - app.path + """ + input: + fa="data/haplotypes/{haplotype}.fa.gz", + reffai="data/haplotypes/"+config['reference']+".fai" + output: + csv="data/haplotypes/{haplotype}.stats.csv" + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * 16000 + benchmark: + "statistics/assemblathon.{haplotype}.txt" + params: + app_path=config["app.path"] + shell: + """ + # Getting ref size + rsize=$(awk '{{sum += $2}} END {{print sum}}' {input.reffai}) + + # Running Assemblathon_stats + apptainer exec {params.app_path}/pan1c-env.sif assemblathon_stats.pl \ + --csv \ + --genome_size $rsize \ + {input.fa} + + mv data/haplotypes/{wildcards.haplotype}.csv {output.csv} + """ + +rule contig_positions: + """ + Unscaffold an assembly to get the contig delimitations of an assembly. [Optional] + + Input : + - Chromosome fasta (<chromosome>.fa.gz) + - Chromosome fasta index (<chromosome>.fa.gz.fai) + + Output : + - Figure of contig decomposition (<chromosome>.contig.png) + - Temporary directory [Temporary] + + Threads : 1 + + Memory : n_threads * mem_multiplier * 16Gb + + Parameters : + - app.path + """ input: fa="data/chrInputs/"+config["name"]+".{chromosome}.fa.gz", fai="data/chrInputs/"+config["name"]+".{chromosome}.fa.gz.fai" output: - fig="output/chr.contig/{chromosome}.contig.png", - outdir=directory("output/chr.contig/{chromosome}") + fig="output/chr.contig/Pan1c."+config['name']+".{chromosome}.contig.png", + outdir=temp(directory("output/chr.contig/{chromosome}")) threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 16000 + benchmark: + "statistics/contig_pos.{chromosome}.txt" params: app_path=config["app.path"] shell: @@ -205,20 +328,40 @@ rule contig_position: #cat $base/{wildcards.chromosome}.tsv # Creating figure - apptainer run --app Renv {params.app_path}/pan1c-env.sif Rscript scripts/contig_position.R \ + apptainer run --app Renv {params.app_path}/pan1c-env.sif Rscript scripts/contig.pos_figs.R \ -b $base/{wildcards.chromosome}.bands \ -t $base/{wildcards.chromosome}.tsv \ -o {output.fig} """ rule chromosome_clustering: - # Read ragtagged fastas and split chromosome sequences into according FASTA files + """ + Read ragtagged fastas and split chromosome sequences into according FASTA files. + + Input : + - All ragtagged haplotype fasta (<haplotype>.ragtagged.fa.gz) + + Output : + - All chromosome fastas (<chromosome>.fa) [Temporary] + + Threads : 1 + + Memory : n_threads * mem_multiplier * 16Gb + + Parameters : + - app.path + - Pangenome name + """ input: expand('data/hap.ragtagged/{haplotype}.ragtagged.fa.gz', haplotype=SAMPLES) output: temp(expand('data/chrInputs/'+config['name']+'.{chromosome}.fa', chromosome=CHRLIST)) threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 16000 priority: 100 + benchmark: + "statistics/chromosome_clustering.txt" params: app_path=config["app.path"], pan_name=config["name"] @@ -229,71 +372,186 @@ rule chromosome_clustering: --fasta {input} --output $(dirname {output[0]}) --panname {params.pan_name} """ -rule SyRI_on_ASM: - # Run SyRI on a single assembly. - # The assembly is mapped on the 'reference' and SyRI search for SV. +rule SyRI_on_ASM_mm2: + """ + Run SyRI on a ragtagged haplotype assembly using Minimap2 alignments. [Optional] + + Input : + - Reference assembly (.ragtagged.fa.gz) + - Haplotype assembly (<haplotype>.ragtagged.fa.gz) + + Output : + - SyRI figure (<haplotype>.syri_mm2.png) + - SyRI VCF (<haplotype>.syri_mm2.vcf.gz) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 12Gb + + Parameters : + - app.path + - Plotsr config file (src/plotsr-base.cfg) + """ input: ref="data/hap.ragtagged/"+config['reference'][:-5]+"ragtagged.fa.gz", qry="data/hap.ragtagged/{haplotype}.ragtagged.fa.gz" output: - fig="output/asm.syri.figs/pan1c."+config['name']+".{haplotype}.syri.png", - wrkdir=directory('data/asm.syri/{haplotype}') + fig="output/asm.syri.figs/Pan1c."+config['name']+".{haplotype}.syri_mm2.png", + vcf="data/asm.syri.mm2/Pan1c."+config['name']+".{haplotype}.syri.mm2.vcf.gz" log: - cmd="logs/SyRI_ASM/{haplotype}.SyRI_ASM.cmd.log", - time="logs/SyRI_ASM/{haplotype}.SyRI_ASM.time.log" + cmd="logs/SyRI_ASM/{haplotype}.SyRI_ASM.mm2.cmd.log", + time="logs/SyRI_ASM/{haplotype}.SyRI_ASM.mm2.time.log" threads: 4 - retries: 1 resources: - mem_gb=48 + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 12000 + benchmark: + "statistics/syri_asm_mm2.{haplotype}.txt" params: - app_path=config["app.path"] + app_path=config["app.path"], + wrk_dir="data/asm.syri.mm2", + plotsr_cfg="src/plotsr-base.cfg" shell: """ - mkdir -p {output.wrkdir} + dir="{params.wrk_dir}/{wildcards.haplotype}" + + mkdir -p $dir /usr/bin/time -v -o {log.time} \ - bash scripts/getSyriFigs.sh \ + bash scripts/Syri.figs_mm2.sh \ -a {params.app_path} \ -t {threads} \ - -d {output.wrkdir} \ + -d $dir \ -o $(basename {output.fig}) \ -r {input.ref} \ + -h 10 -w 20 -s "0.9" -f 10 \ + -c {params.plotsr_cfg} \ -q "{input.qry}" 2>&1 | \ tee {log.cmd} - mv {output.wrkdir}/$(basename {output.fig}) {output.fig} + mv $dir/$(basename {output.fig}) {output.fig} + mv $dir/*.vcf {params.wrk_dir}/$(basename {output.vcf} .gz) + + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} {params.wrk_dir}/$(basename {output.vcf} .gz) + + rm -r $dir """ + +rule SyRI_on_ASM_wfm: + """ + Run SyRI on a ragtagged haplotype assembly using WFmash alignments. [Optional] [WIP, broken] -""" -Core section : Running PGGB -""" + Input : + - Reference assembly (.ragtagged.fa.gz) + - Haplotype assembly (<haplotype>.ragtagged.fa.gz) + + Output : + - SyRI figure (<haplotype>.syri_wfm.png) + - SyRI VCF (<haplotype>.syri_wfm.vcf.gz) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 12Gb + + Parameters : + - app.path + - Plotsr config file (src/plotsr-base.cfg) + """ + input: + ref="data/hap.ragtagged/"+config['reference'][:-5]+"ragtagged.fa.gz", + qry="data/hap.ragtagged/{haplotype}.ragtagged.fa.gz" + output: + fig="output/asm.syri.figs/Pan1c."+config['name']+".{haplotype}.syri_wfm.png", + vcf="data/asm.syri.wfm/Pan1c."+config['name']+".{haplotype}.syri.wfm.vcf.gz" + log: + cmd="logs/SyRI_ASM/{haplotype}.SyRI_ASM.wfm.cmd.log", + time="logs/SyRI_ASM/{haplotype}.SyRI_ASM.wfm.time.log" + threads: 4 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 12000 + benchmark: + "statistics/syri_asm_wfmash.{haplotype}.txt" + params: + app_path=config["app.path"], + wrk_dir="data/asm.syri.wfm", + plotsr_cfg="src/plotsr-base.cfg", + segment_length=config['wfmash.segment_length'], + mapping_id=config['wfmash.mapping_id'], + wfmash_sec=config['wfmash.secondary'] + shell: + """ + dir="{params.wrk_dir}/{wildcards.haplotype}" + + mkdir -p $dir + /usr/bin/time -v -o {log.time} \ + bash scripts/Syri.figs_wfm.sh \ + -a {params.app_path} \ + -t {threads} \ + -d $dir \ + -o $(basename {output.fig}) \ + -r {input.ref} \ + -h 10 -w 20 -s "0.9" -f 10 \ + -c {params.plotsr_cfg} \ + -q "{input.qry}" 2>&1 | \ + tee {log.cmd} + + mv $dir/$(basename {output.fig}) {output.fig} + mv $dir/*.vcf {params.wrk_dir}/$(basename {output.vcf} .gz) + + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} {params.wrk_dir}/$(basename {output.vcf} .gz) + + rm -r $dir + """ rule SyRI_on_chrInput: - # WIP + """ + Run SyRI on chromosome fasta using Minimap2 alignments. [Optional] + + Input : + - Chromosome fasta (<chromosome>.fa.gz) + + Output : + - SyRI figure (<chromosome>.syri_mm2.png) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 12Gb + + Parameters : + - app.path + - Plotsr config file (src/plotsr-base.cfg) + """ input: fasta='data/chrInputs/'+config['name']+'.{chromosome}.fa.gz' output: - fig="output/chrInput.syri.figs/"+config['name']+".{chromosome}.asm.syri.png", - wrkdir=directory('data/chrInputs/syri/{chromosome}') + fig="output/chrInput.syri.figs/Pan1c."+config['name']+".{chromosome}.syri_mm2.png" threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 12000 + benchmark: + "statistics/syri_chr_mm2.{chromosome}.txt" params: app_path=config["app.path"], - ref=config['reference'] + ref=config['reference'], + wrk_dir="data/chrInput.syri", + plotsr_cfg="src/plotsr-base.cfg" shell: """ - mkdir {output.wrkdir} + dir="{params.wrk_dir}/{wildcards.chromosome}" + + mkdir -p $dir refname=$(basename {params.ref} .fa.gz | cut -d'.' -f1,2) ## Creating single fasta from multifasta - zcat {input.fasta} | awk -F"#" \ - '/^>/ {{OUT="{output.wrkdir}/" substr($0,2) ".fa"}}; {{print >> OUT; close(OUT)}}' + zcat {input.fasta} | awk -F"#" -v DIR=$dir \ + '/^>/ {{OUT= DIR "/" substr($0,2) ".fa"}}; {{print >> OUT; close(OUT)}}' apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - -@ {threads} {output.wrkdir}/*.fa - #zgrep '>' {output.wrkdir}/*.fa.gz + -@ {threads} $dir/*.fa ## Getting the list of sequences AllAsmList=() - for file in {output.wrkdir}/*.fa.gz; do + for file in $dir/*.fa.gz; do asm="$(basename $file .fa.gz | cut -f1,2 -d"#" | sed 's/#/\\.hap/').fa.gz" mv $file "$(dirname $file)/$asm" AllAsmList+=("$(dirname $file)/$asm") @@ -301,33 +559,150 @@ rule SyRI_on_chrInput: #echo "The ASM Array : ${{AllAsmList[@]}}" - bash scripts/getSyriFigs.sh \ + bash scripts/Syri.figs_mm2.sh \ -a {params.app_path} \ -t {threads} \ - -d {output.wrkdir} \ + -d $dir \ -o $(basename {output.fig}) \ - -r {output.wrkdir}/"${{refname}}.fa.gz" \ + -r "${{dir}}/${{refname}}.fa.gz" \ -q "${{AllAsmList[*]}}" \ - -h 9 -w 16 - mv {output.wrkdir}/$(basename {output.fig}) {output.fig} - rm {output.wrkdir}/*.fa + -c {params.plotsr_cfg} \ + -h 10 -w 20 -s "0.9" -f 10 + + mv $dir/$(basename {output.fig}) {output.fig} + rm -r $dir """ +def asm_json_inputs(wildcards): + """ + Creates the input list for asm_json rule. + """ + sections = dict() + + sections["csv"] = expand("data/haplotypes/{haplotype}.stats.csv", haplotype=SAMPLES) + sections["fai"] = expand('data/chrInputs/'+config['name']+'.{chromosome}.fa.gz.fai', chromosome=CHRLIST) + + if config["get_contig_pos"] == "True": + sections["contig_pos"] = expand( + "output/chr.contig/Pan1c."+config['name']+".{chromosome}.contig.png", + chromosome=CHRLIST + ) + + if config["get_ASMs_SyRI"] == "True": + sections["SyRI_on_ASMs_figs"] = expand( + "output/asm.syri.figs/Pan1c."+config['name']+".{haplotype}.syri_mm2.png", + haplotype=SAMPLES_NOREF + ) + + if config["get_chrInputs_SyRI"] == "True": + sections["SyRI_on_chrInputs_figs"] = expand( + "output/chrInput.syri.figs/Pan1c."+config['name']+".{chromosome}.syri_mm2.png", + chromosome=CHRLIST + ) + + return sections + +rule asm_json: + """ + Produce the assembly JSON for Pan1c-View. [Optional] + + Input : + - Tab-delimited table with assemblaton stats (<haplotype>.stats.csv) + - All chromosome fasta indexes (.fa.gz.fai) + - All figures of contig decomposition (.contig.png) [Optional] + - SyRI figure with MM2 (<haplotype>.syri_mm2.png) [Optional] + + Output : + - SyRI figure (<haplotype>.syri_mm2.png) + - SyRI VCF (<haplotype>.syri_mm2.vcf.gz) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 12Gb + + Parameters : + - app.path + - Plotsr config file (src/plotsr-base.cfg) + """ + input: + unpack(asm_json_inputs) + output: + json="output/report_data/Pan1c."+config['name']+".assembly.json", + merged="output/report_data/Pan1c."+config['name']+".assemblathon_stats.tsv" + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 16000 + params: + app_path=config["app.path"], + pan_name=config["name"], + ref_name=config['reference'], + add_ASMs_SyRI=config['get_ASMs_SyRI'], + add_chrInputs_SyRI=config['get_chrInputs_SyRI'], + add_contig_pos=config['get_contig_pos'], + add_quast=config["run_Quast"] + run: + shell("cat {input.fai} > output/report_data/all.fai") + shell("awk 'FNR==1 && NR!=1 {{next}} {{print}}' {input.csv} | tr ',' '\\t' > {output.merged}") + + command = [f"--output {output.json} --asm_stats {output.merged} --fai output/report_data/all.fai --name {params.pan_name} --ref {params.ref_name.rsplit('.', maxsplit=2)[0].replace('.hap', '#')}"] + + if params.add_ASMs_SyRI == "True": command.append("--syri_asm") + if params.add_chrInputs_SyRI == "True": command.append("--syri_chr") + if params.add_contig_pos == "True": command.append("--contig_pos") + if params.add_quast == "True": command.append("--quast") + + command = " ".join(command) + shell("apptainer run {params.app_path}/pan1c-env.sif python scripts/asm.pan1c_QC.py {command}") + + shell("rm output/report_data/all.fai") + +""" +Core section : Running PGGB +""" + rule wfmash_on_chr: - # Run wfmash on a specific chromosome input + """ + Run WFmash on a chromosome fasta (First PGGB step). + + Input : + - Chromosome fasta (<chromosome>.fa.gz) + - Chromosome fasta index (<chromosome>.fa.gz.fai) + + Output : + - Mapping (<chromosome>.wfmash.mapping.paf) [Temporary] + - Compressed mapping (<chromosome>.wfmash.mapping.paf.gz) + - Alignment (<chromosome>.wfmash.aln.paf) [Temporary] + - Compressed alignment (<chromosome>.wfmash.al.paf.gz) + + Threads : 16 + + Memory : n_threads * mem_multiplier * 2Gb + + Parameters : + - app.path + - wfmash.segment_length + - wfmash.mapping_id + - wfmash.secondary + """ input: fa='data/chrInputs/'+config['name']+'.{chromosome}.fa.gz', fai='data/chrInputs/'+config['name']+'.{chromosome}.fa.gz.fai' output: - mapping=temp("data/chrGraphs/{chromosome}/{chromosome}.wfmash.mapping.paf"), - aln=temp("data/chrGraphs/{chromosome}/{chromosome}.wfmash.aln.paf") + mapping=temp("data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.wfmash.mapping.paf"), + aln=temp("data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.wfmash.aln.paf"), + mapping_gz="data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.wfmash.mapping.paf.gz", + aln_gz="data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.wfmash.aln.paf.gz" threads: 16 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 priority: 100 params: app_path=config['app.path'], segment_length=config['wfmash.segment_length'], mapping_id=config['wfmash.mapping_id'], wfmash_sec=config['wfmash.secondary'] + benchmark: + "statistics/wfmash.{chromosome}.txt" log: cmd_map="logs/pggb/{chromosome}.wfmash_mapping.cmd.log", time_map="logs/pggb/{chromosome}.wfmash_mapping.time.log", @@ -337,22 +712,23 @@ rule wfmash_on_chr: """ ## Mapping /usr/bin/time -v -o {log.time_map} \ - apptainer exec {params.app_path}/pggb.sif wfmash \ + apptainer exec {params.app_path}/PanGeTools.sif wfmash \ -s {params.segment_length} -l $(( {params.segment_length} * 5 )) -p {params.mapping_id} \ -n $(cat {input.fai} | wc -l) {params.wfmash_sec} -t {threads} \ --tmp-base $(dirname {output.aln}) \ {input.fa} --approx-map \ + -X \ 1> {output.mapping} \ 2> >(tee {log.cmd_map} >&2) ## Aligning /usr/bin/time -v -o {log.time_aln} \ - apptainer exec {params.app_path}/pggb.sif wfmash \ + apptainer exec {params.app_path}/PanGeTools.sif wfmash \ -s {params.segment_length} -l $(( {params.segment_length} * 5 )) -p {params.mapping_id} \ -n $(cat {input.fai} | wc -l) {params.wfmash_sec} -t {threads} \ --tmp-base $(dirname {output.aln}) \ {input.fa} -i {output.mapping} \ - --invert-filtering \ + -X \ 1> {output.aln} \ 2> >(tee {log.cmd_aln} >&2) @@ -365,14 +741,35 @@ rule wfmash_on_chr: """ rule seqwish: - # Run seqwish on alignement produced by wfmash + """ + Run seqwish on alignment file produced by WFmash for a given chromosome. + + Input : + - Chromosome fasta (<chromosome>.fa.gz) + - Alignment file (<chromosome>.wfamsh.aln.paf.gz) + + Output : + - Seqwish graph (<chromosome>.seqwish.gfa.gz) [GFAv1.0] + + Threads : 8 + + Memory : n_threads * mem_multiplier * 4Gb + + Parameters : + - app.path + - seqwish.params + """ input: fa='data/chrInputs/'+config['name']+'.{chromosome}.fa.gz', - aln=rules.wfmash_on_chr.output.aln + aln=rules.wfmash_on_chr.output.aln_gz output: - temp("data/chrGraphs/{chromosome}/{chromosome}.seqwish.gfa") + gfa_gz="data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.seqwish.gfa.gz" threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 4000 priority: 100 + benchmark: + "statistics/seqwish.{chromosome}.txt" params: app_path=config['app.path'], seqwish=config['seqwish.params'] @@ -381,50 +778,100 @@ rule seqwish: time="logs/pggb/{chromosome}.seqwish.time.log" shell: """ + gfa_out="$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + /usr/bin/time -v -o {log.time} \ - apptainer exec {params.app_path}/pggb.sif seqwish \ - -s {input.fa} -p {input.aln} -g {output} \ + apptainer exec {params.app_path}/PanGeTools.sif seqwish \ + -s {input.fa} -p {input.aln} -g $gfa_out \ {params.seqwish} -t {threads} \ - --temp-dir $(dirname {output}) -P 2>&1 | \ + --temp-dir $(dirname {output.gfa_gz}) -P 2>&1 | \ tee {log.cmd} # Compressing apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - -@ {threads} -k {output} + -@ {threads} $gfa_out """ rule gfaffix_on_chr: - # Run gfaffix on seqwish graph + """ + GFAffix on Seqwish graph for a given chromosome. + + Input : + - Seqwish graph (<chromosome>.seqwish.gfa.gz) [GFAv1.0] + + Output : + - GFAffixed graph (<chromosome>.seqwish.gfaffixD.gfa.gz) [GFAv1.0] + - Transform logs (<chromosome>.seqwish.gfaffixD.transform.txt) + + Threads : 1 + + Memory : n_threads * mem_multiplier * 24Gb + + Parameters : + - app.path + """ input: - rules.seqwish.output + rules.seqwish.output.gfa_gz output: - gfa=temp("data/chrGraphs/{chromosome}/{chromosome}.seqwish.gfaffixD.gfa"), - transform="data/chrGraphs/{chromosome}/{chromosome}.seqwish.gfaffixD.transform.txt" + gfa_gz="data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.seqwish.gfaffixD.gfa.gz", + transform="data/chrGraphs/PGGB.{chromosome}/Pan1c."+config['name']+".{chromosome}.seqwish.gfaffixD.transform.txt" threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 24000 priority: 100 + benchmark: + "statistics/gfaffix.{chromosome}.txt" params: app_path=config['app.path'] log: time="logs/pggb/{chromosome}.gfaffix.time.log" shell: """ + # Decompressing Seqwish graph + gzip -d -k {input} + gfa_in="$(dirname {input})/$(basename {input} .gz)" + gfa_out="$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + /usr/bin/time -v -o {log.time} \ - apptainer exec {params.app_path}/pggb.sif gfaffix \ - {input} -o {output.gfa} -t {output.transform} \ + apptainer exec {params.app_path}/PanGeTools.sif gfaffix \ + $gfa_in -o $gfa_out -t {output.transform} \ > /dev/null # Compressing apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - -@ {threads} -k {output.gfa} + -@ {threads} $gfa_out + + rm $gfa_in """ rule odgi_postprocessing: - # Running pggb's postprocessing (mainly odgi) steps with gfaffix graph + """ + ODGI postprocessing on GFAffixed graph, for a given chromosome. + + Input : + - Graph metadata (Pan1c.<pangenome_name>.gfa.metadata) + - GFAffixed graph (<chromosome>.seqwish.gfaffixD.gfa.gz) + + Output : + - Chromosome graph (<chromosome>.gfa.gz) [GFAv1.0] + + Threads : 8 + + Memory : n_threads * mem_multiplier * 4Gb + + Parameters : + - app.path + """ input: - rules.gfaffix_on_chr.output.gfa + tags="output/Pan1c."+config['name']+".gfa.metadata", + gfa_gz=rules.gfaffix_on_chr.output.gfa_gz output: - gfa='data/chrGraphs/'+config['name']+'.{chromosome}.gfa' + gfa_gz="data/chrGraphs/Pan1c.PGGB."+config['name']+".{chromosome}.gfa.gz" threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 4000 + benchmark: + "statistics/odgi_postprocessing.{chromosome}.txt" priority: 100 params: app_path=config['app.path'] @@ -439,14 +886,21 @@ rule odgi_postprocessing: time_view="logs/pggb/{chromosome}.odgi_view.time.log" shell: """ - OGfile="$(dirname {input})/$(basename {input} .gfa)" + OGfile="$(dirname {input.gfa_gz})/$(basename {input.gfa_gz} .gfa.gz)" + + gzip -d -k {input.gfa_gz} + gfa_in="$(dirname {input.gfa_gz})/$(basename {input.gfa_gz} .gz)" + gfa_out="$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" ## Converting to odgi format and optimizing namespace /usr/bin/time -v -o {log.time_build} \ apptainer run --app odgi {params.app_path}/PanGeTools.sif \ - build -t {threads} -P -g {input} -o $OGfile.og -O 2>&1 | \ + build -t {threads} -P -g $gfa_in -o $OGfile.og -O 2>&1 | \ tee {log.cmd_build} + ## Removing input GFA + rm $gfa_in + ## Unchoping the nodes (merges unitigs into single nodes) /usr/bin/time -v -o {log.time_unchop} \ apptainer run --app odgi {params.app_path}/PanGeTools.sif \ @@ -464,71 +918,220 @@ rule odgi_postprocessing: /usr/bin/time -v -o {log.time_view} \ apptainer run --app odgi {params.app_path}/PanGeTools.sif \ view -i $OGfile.unchoped.sorted.og -g \ - 1> {output.gfa} \ + 1> $gfa_out \ 2> >(tee {log.cmd_view} >&2) ## Removing .og files for space savings - rm $(dirname {input})/*.og + rm $(dirname {input.gfa_gz})/*.og - ## Adding metadata - python scripts/getTags.py \ - --appdir {params.app_path} --config-file config.yaml \ - > "$(dirname {input})/metadata.txt" - sed -i "/^H/r $(dirname {input})/metadata.txt" {output.gfa} + ## Adding new tags + sed -i '/^H/r {input.tags}' $gfa_out - # Compressing - # apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ - # -@ {threads} -k {output.gfa} + ## Compressing + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} $gfa_out + """ + +rule MC_graph: + """ + Running Cactus-Pangenome workflow for a given chromosome fasta. + + Input : + - Graph metadata (Pan1c.<pangenome_name>.gfa.metadata) + - Chromosome fasta (<chromosome>.fa.gz) + + Output : + - Minigraph-Cactus graph (<chromosome>.gfa.gz) [GFAv1.0 (converted from GFAv1.1 with GFAvc)] + + Threads : 16 + + Memory : mem_multiplier * 32Gb + + Parameters : + - app.path + - reference + - MC.params + """ + input: + tags="output/Pan1c."+config['name']+".gfa.metadata", + fa='data/chrInputs/'+config['name']+'.{chromosome}.fa.gz' + output: + gfa_gz='data/chrGraphs/Pan1c.MC.'+config['name']+'.{chromosome}.gfa.gz' + threads: 16 + resources: + mem_mb = lambda wildcards, threads: config["mem_multiplier"] * 32000 + benchmark: + "statistics/MC.{chromosome}.txt" + params: + mc_dir='data/chrGraphs/MC.{chromosome}', + ref_name=config['reference'], + app_path=config['app.path'], + mc=config['MC.params'] + log: + stdout="logs/MC/{chromosome}.mc.stdout.log", + stderr="logs/MC/{chromosome}.mc.stderr.log" + shell: + """ + if [ -d {params.mc_dir} ]; then rm -r {params.mc_dir}; fi + mkdir -p {params.mc_dir} + + # Creating a fasta for each sequence + zcat {input.fa} | awk -v DIR={params.mc_dir} \ + '/^>/ {{name=substr($0, 2); gsub(/#/, ".", name); OUT= DIR "/" name ".fa"}}; {{print >> OUT; close(OUT)}}' + + # Listing fasta files + for hap in {params.mc_dir}/*.fa; do + fullname=$(basename $hap .fa) + genome=$(echo $fullname | cut -f1 -d'.') + hapid=$(echo $fullname | cut -f2 -d'.') + echo -e "${{genome}}.${{hapid}}\t${{hap}}" >> {params.mc_dir}/{wildcards.chromosome}.genomes.txt + done + + # Running MC + apptainer run {params.app_path}/minigraph-cactus_v2.7.0.sif \ + {params.mc_dir}/tmp {params.mc_dir}/{wildcards.chromosome}.genomes.txt \ + --outDir {params.mc_dir} \ + --outName $(basename {output.gfa_gz} .gfa.gz) \ + --reference "$(basename {params.ref_name} .fa.gz | cut -f1 -d'.').$(basename {params.ref_name} .fa.gz | cut -f2 -d '.' | cut -f2 -d'p')" \ + --gfa \ + {params.mc} + + # Converting to GFA 1.0 + apptainer run --app gfavc {params.app_path}/PanGeTools.sif \ + --gfa1 {params.mc_dir}/$(basename {output.gfa_gz} .gfa.gz).full.gfa.gz \ + --simplify \ + --outName "$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + + # Cleaning and compressing + rm -r {params.mc_dir}/*.fa {params.mc_dir}/*.txt {params.mc_dir}/chrom-* + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} {params.mc_dir}/*.hal {params.mc_dir}/*.paf + + ## Adding new tags + sed -i '/^H/r {input.tags}' "$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} "$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + """ rule generate_graph_list: - # Generate a text file containing all created graphs + """ + Generate a list of chromosome graphs from a given tool. Used to trigger construction of all graphs. + + Input : + - Chromosome graphs (<gtool>.<chromosome>.tmp.gfa) [GFAv1.0] + + Output : + - List of graphs (graphList.<gtool>.txt) + + Threads : 1 + + Memory : 4Gb + """ input: - expand('data/chrGraphs/'+config['name']+'.{chromosome}.gfa', chromosome=CHRLIST) + gfas=expand('data/chrGraphs/Pan1c.{{gtool}}.'+config['name']+'.{chromosome}.tmp.gfa', chromosome=CHRLIST) output: - "data/chrGraphs/graphsList.txt" + temp("data/chrGraphs/graphsList.{gtool}.txt") threads: 1 + resources: + mem_mb = lambda wildcards, threads: 4000 priority: 100 run: with open(output[0], "w") as handle: - for file in input: + for file in input.gfas: handle.write(file+"\n") rule graph_squeeze: - # Using odgi to merge every subgraphs into a final one + """ + Concatenate chromosome graphs into a final graph, using Odgi squeez command. + It only is run for a single tool (PGGB, MC) at a time. + + Input : + - list of graphs (graphList.<gtool>.txt) + - Graph metadata (Pan1c.<pangenome_name>.gfa.metadata) + - Chromosome graphs (Pan1c.<gtool>.<pangenome_name>.<chromosome>.tmp.gfa) [GFAv1.0] + + Output : + - Final graph (Pan1c.<gtool>.<pangenome_name>.gfa.gz) [GFAv1.0] + + Threads : 16 + + Memory : n_threads * mem_multiplier * 2Gb + + Parameters : + - app.path + """ input: - glist="data/chrGraphs/graphsList.txt", - graphs=expand('data/chrGraphs/'+config['name']+'.{chromosome}.gfa', chromosome=CHRLIST) + glist="data/chrGraphs/graphsList.{gtool}.txt", + tags="output/Pan1c."+config['name']+".gfa.metadata", + graphs=expand('data/chrGraphs/Pan1c.{{gtool}}.'+config['name']+'.{chromosome}.tmp.gfa', chromosome=CHRLIST) output: - "output/pan1c."+config['name']+".gfa" + gfa_gz="output/Pan1c.{gtool}."+config['name']+".gfa.gz" log: - cmd="logs/squeeze/"+config['name']+".squeeze.cmd.log", - time="logs/squeeze/"+config['name']+".squeeze.time.log", + cmd="logs/squeeze/Pan1c.{gtool}."+config['name']+".squeeze.cmd.log", + time="logs/squeeze/Pan1c.{gtool}."+config['name']+".squeeze.time.log", threads: 16 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 + benchmark: + "statistics/graph_squeeze.{gtool}.txt" priority: 100 params: app_path=config['app.path'] shell: """ + gfa_out="$(dirname {output.gfa_gz})/$(basename {output.gfa_gz} .gz)" + /usr/bin/time -v -o {log.time} \ apptainer run --app odgi {params.app_path}/PanGeTools.sif \ - squeeze -t {threads} -O -P -f {input.glist} -o {output}.og 2>&1 | \ + squeeze -t {threads} -O -P -f {input.glist} -o $gfa_out.og 2>&1 | \ tee {log.cmd} apptainer run --app odgi {params.app_path}/PanGeTools.sif \ - view -t {threads} -P -i {output}.og -g > {output} + view -t {threads} -P -i $gfa_out.og -g > $gfa_out + + rm $gfa_out.og - rm {output}.og + ## Adding new tags + sed -i '/^H/r {input.tags}' $gfa_out + + # Compressing + apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ + -@ {threads} $gfa_out """ rule graph_stats: - # Using GFAstats to produce stats on every chromosome graphs + """ + Statistics computed on a chromosome graphs using a custom script. + The script can be found [here](https://forgemia.inra.fr/alexis.mergez/pangetools/-/blob/main/GFAstats.py?ref_type=heads). + See the list of metrics [here](https://forgemia.inra.fr/alexis.mergez/pangetools/-/wikis/GFAstats---Documentation). + + Input : + - A single chromosome graph (<gtool>.<pangenome_name>.<chromosome>.gfa.gz) [GFAv1.0] + + Output : + - General statistics (<chromosome>.general.stats.tsv) + - Path statistics (<chromosome>.path.stats.tsv) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 8Gb + + Parameters : + - app.path + - name + """ input: - graph='data/chrGraphs/'+config['name']+'.{chromosome}.gfa' + graph='data/chrGraphs/Pan1c.{gtool}.'+config['name']+'.{chromosome}.gfa.gz' output: - genstats="output/stats/chrGraphs/"+config['name']+".{chromosome}.general.stats.tsv", - pathstats="output/stats/chrGraphs/"+config['name']+".{chromosome}.path.stats.tsv" + genstats="output/stats/chrGraphs.{gtool}/Pan1c.{gtool}."+config['name']+".{chromosome}.general.stats.tsv", + pathstats="output/stats/chrGraphs.{gtool}/Pan1c.{gtool}."+config['name']+".{chromosome}.path.stats.tsv" threads: 4 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 + benchmark: + "statistics/graph_stats.{gtool}.{chromosome}.txt" params: app_path=config['app.path'], pan_name=config['name'] @@ -536,18 +1139,38 @@ rule graph_stats: """ apptainer run --app gfastats {params.app_path}/PanGeTools.sif \ -g {input.graph} -P \ - -o $(dirname {output.genstats})/{params.pan_name}.{wildcards.chromosome} \ + -o $(dirname {output.genstats})/Pan1c.{wildcards.gtool}.{params.pan_name}.{wildcards.chromosome} \ -t {threads} """ rule graph_figs: - # Creating figures using odgi viz + """ + Create 1D viz of a chromosome graph using Odgi viz. + + Input : + - Chromosome graph (<gtool>.<pangenome_name>.<chromosome>.tmp.gfa) + + Output : + - 1D viz (<gtool>.<pangenome_name>.<chromosome>.1Dviz.png) + - 1D path coverage (<gtool>.<pangenome_name>.<chromosome>.pcov.png) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 4Gb + + Parameters : + - app.path + - odgi.1Dviz.params + - odgi.pcov.params + """ input: - graph='data/chrGraphs/'+config['name']+'.{chromosome}.gfa' + graph='data/chrGraphs/Pan1c.{gtool}.'+config['name']+'.{chromosome}.tmp.gfa' output: - oneDviz="output/chrGraphs.figs/"+config['name']+".{chromosome}.1Dviz.png", - pcov="output/chrGraphs.figs/"+config['name']+".{chromosome}.pcov.png" + oneDviz="output/chrGraphs.figs/Pan1c.{gtool}."+config['name']+".{chromosome}.1Dviz.png", + pcov="output/chrGraphs.figs/Pan1c.{gtool}."+config['name']+".{chromosome}.pcov.png" threads: 4 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 4000 params: app_path=config['app.path'], oneDviz=config['odgi.1Dviz.params'], @@ -561,33 +1184,76 @@ rule graph_figs: ## Pcov viz apptainer run --app odgi {params.app_path}/PanGeTools.sif \ viz -i {input.graph} -o {output.pcov} {params.pcov} -t {threads} -P + """ rule aggregate_graphs_stats: - # Reading and merging all stats files from chromosome graphs into a .tsv. + """ + Aggregate graph statistics produced for each chromosome graph, given a graph tool. + + Input : + - Chromosome graphs general statistics (<gtool>.<pangenome_name>.<chromosome>.general.stats.tsv) + - Chromosome graphs paths statistics (<gtool>.<pangenome_name>.<chromosome>.path.stats.tsv) + + Output : + - General statistics (<gtool>.<pangenome_name>.chrGraphs.general.stats.tsv) + - Paths statistics (<gtool>.<pangenome_name>.chrGraphs.path.stats.tsv) + + Threads : 1 + + Memory : n_threads * mem_multiplier * 2Gb + + Parameters : + - app.path + - name + """ input: - genstats=expand("output/stats/chrGraphs/"+config['name']+".{chromosome}.general.stats.tsv", chromosome=CHRLIST) + genstats=expand("output/stats/chrGraphs.{{gtool}}/Pan1c.{{gtool}}."+config['name']+".{chromosome}.general.stats.tsv", chromosome=CHRLIST) output: - genstats="output/stats/pan1c."+config['name']+".chrGraph.general.stats.tsv", - pathstats="output/stats/pan1c."+config['name']+".chrGraph.path.stats.tsv" + genstats="output/stats/Pan1c.{gtool}."+config['name']+".chrGraph.general.stats.tsv", + pathstats="output/stats/Pan1c.{gtool}."+config['name']+".chrGraph.path.stats.tsv" threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 params: app_path=config['app.path'], pan_name=config['name'] shell: """ - apptainer run {params.app_path}/pan1c-env.sif python scripts/chrStatsAggregation.py \ + echo "{input.genstats}" + + apptainer run {params.app_path}/pan1c-env.sif python scripts/chrGraphs.stats_aggregate.py \ --input $(dirname {input[0]}) --outputGeneral {output.genstats} \ --outputPaths {output.pathstats} --panname {params.pan_name} """ -rule final_graph_tagging: - # Add metadata to the final GFA +rule get_graph_tags: + """ + Generate graph metadata file using tools version, passed arguments... + + Input : + - config.yaml + + Output : + - Graph metadata (Pan1c.<pangenome_name>.gfa.metadata) + - Tags JSON for Pan1c-View (summary section) + + Threads : 1 + + Memory : n_threads * mem_multiplier * 8Gb + + Parameters : + - app.path + - name + """ input: - graph="output/pan1c."+config['name']+".gfa", + "config.yaml" output: - "output/pan1c."+config['name']+".gfa.metadata" + md="output/Pan1c."+config['name']+".gfa.metadata", + json="output/report_data/Pan1c."+config['name']+".tags.json" threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 priority: 100 params: app_path=config['app.path'], @@ -595,104 +1261,305 @@ rule final_graph_tagging: shell: """ python scripts/getTags.py \ - --appdir {params.app_path} --config-file config.yaml > {output} - sed -i '/^H/r {output}' {input.graph} + --appdir {params.app_path} --config-file config.yaml --json {output.json} > {output.md} """ -rule pggb_input_stats: - # Produces statistics on pggb input sequences - input: - flag="output/stats/pan1c."+config['name']+".chrGraph.general.stats.tsv" - output: - "output/stats/pan1c."+config['name']+".chrInput.stats.tsv" - threads: 1 - params: - app_path=config['app.path'], - pan_name=config['name'] - shell: - """ - apptainer run {params.app_path}/pan1c-env.sif python scripts/chrInputStats.py \ - -f data/chrInputs/*.fa.gz -o {output} -p {params.pan_name} - """ +rule graph_json: + """ + Generate graph JSON for Pan1c-View graph section. [Optional] + + Input : + - Aggregated general statistics for all graph tool (*.chrGraph.general.stats.tsv) + - Aggregated paths statistics for all graph tool (*.chrGraph.general.stats.tsv) + - Contig-Odgi figures (*.report.fig.png) + + Output : + - Graph JSON (Pan1c.<pangenome_name>.graph.json) + + Threads : 1 -rule core_statistics: - # Aggregate chrInput, chrGraph and pggb statistics into a single tsv + Memory : n_threads * mem_multiplier * 16Gb + + Parameters : + - app.path + - name + """ input: - chrInputStats = "output/stats/pan1c."+config['name']+".chrInput.stats.tsv", - chrGraphStats = "output/stats/pan1c."+config['name']+".chrGraph.general.stats.tsv" + genstats = expand("output/stats/Pan1c.{gtool}."+config['name']+".chrGraph.general.stats.tsv", gtool=graph_tools), + pathstats = expand("output/stats/Pan1c.{gtool}."+config['name']+".chrGraph.path.stats.tsv", gtool=graph_tools), + odgifigs = expand("output/report/Pan1c.{gtool}."+config['name']+".{chromosome}.report.fig.png", gtool=graph_tools, chromosome=CHRLIST) output: - tsv = "output/stats/pan1c."+config['name']+".core.stats.tsv", - dir = directory("output/pggb.usage.figs") + json="output/report_data/Pan1c."+config['name']+".graph.json" threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 16000 + benchmark: + "statistics/graph_json.txt" params: - app_path=config['app.path'], - pan_name=config['name'] + app_path=config["app.path"], + pan_name=config["name"] shell: """ - mkdir -p {output.dir} - apptainer run {params.app_path}/pan1c-env.sif python scripts/coreStats.py \ - --pggbStats logs/pggb --chrInputStats {input.chrInputStats} \ - --chrGraphStats {input.chrGraphStats} -o {output.tsv} -f {output.dir} -p {params.pan_name} + apptainer run {params.app_path}/pan1c-env.sif python scripts/graph.pan1c_QC.py \ + --gen_stats {input.genstats} \ + --path_stats {input.pathstats} \ + --name {params.pan_name} \ + --odgi_figs {input.odgifigs} \ + --output {output.json} """ - + """ Post-processing section """ -rule get_pav: - # Create PAV matrix readable by panache for a given chromosome scale graph - input: - "data/chrGraphs/graphsList.txt" - output: - directory("output/pav.matrices") - threads: 16 - params: - app_path=config['app.path'] - run: - shell("mkdir {output}") - # Getting the list of graphs - with open(input[0]) as handle: - graphList = [graph.rstrip("\n") for graph in handle.readlines()] - # Iterating over graphs - for graph in graphList: - shell("bash scripts/getPanachePAV.sh -g {graph} -d data/chrGraphs/$(basename {graph} .gfa) -o {output}/$(basename {graph} .gfa).pav.matrix.tsv -a {params.app_path} -t {threads}") rule panacus_stats: - # Produces panacus reports for a chromosome graph + """ + Generate Panacus figure for a chromosome graph. [Optional] + + Input : + - Aggregated general statistics for all graph tool (*.chrGraph.general.stats.tsv) + - Aggregated paths statistics for all graph tool (*.chrGraph.general.stats.tsv) + - Contig-Odgi figures (*.report.fig.png) + + Output : + - Graph JSON (Pan1c.<pangenome_name>.graph.json) + + Threads : 1 + + Memory : n_threads * mem_multiplier * 16Gb + + Parameters : + - app.path + - name + """ input: - graph='data/chrGraphs/'+config['name']+'.{chromosome}.gfa' + graph='data/chrGraphs/Pan1c.{gtool}.'+config['name']+'.{chromosome}.tmp.gfa' output: - html='output/panacus.reports/'+config['name']+'.{chromosome}.histgrowth.html' + html='output/panacus.reports/Pan1c.{gtool}.'+config['name']+'.{chromosome}.histgrowth.html' log: - cmd="logs/panacus/{chromosome}.panacus.cmd.log", - time="logs/panacus/{chromosome}.panacus.time.log" + cmd="logs/panacus/Pan1c.{gtool}.{chromosome}.panacus.cmd.log", + time="logs/panacus/Pan1c.{gtool}.{chromosome}.panacus.time.log" params: app_path=config['app.path'], pan_name=config['name'], refname=config['reference'] threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 + benchmark: + "statistics/panacus.{gtool}.{chromosome}.txt" shell: """ /usr/bin/time -v -o {log.time} \ bash scripts/getPanacusHG.sh \ -g {input.graph} \ -r $(basename {params.refname} .fa.gz) \ - -d data/chrGraphs/{wildcards.chromosome} \ + -d data/chrGraphs/{wildcards.gtool}.{wildcards.chromosome} \ -o {output.html} \ -a {params.app_path} \ -t {threads} 2>&1 | \ tee {log.cmd} """ +rule vg_deconstruct: + """ + Use VG deconstruct to call variants from the graph. The reference haplotype set for the workflow is used as reference. [Optional] + + Input : + - Final graph in XG format (Pan1c.<gtool>.<pangenome_name>.xg) + + Output : + - VCF (Pan1c.<gtool>.<pangenome_name>.vcf) [Temporary] + + Threads : 8 + + Memory : n_threads * mem_multiplier * 32Gb + + Parameters : + - app.path + - reference + """ + input: + graph="output/Pan1c.{gtool}."+config['name']+".xg", + output: + vcf=temp("output/Pan1c.{gtool}."+config['name']+".vcf"), + threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 32000 + benchmark: + "statistics/vg_deconstruct.{gtool}.txt" + params: + app_path=config['app.path'], + ref=config['reference'] + log: + cmd="logs/vg_deconstruct/Pan1c.{gtool}.vg_deconstruct.cmd.log", + time="logs/vg_deconstruct/Pan1c.{gtool}.vg_deconstruct.time.log" + shell: + """ + /usr/bin/time -v -o {log.time} \ + apptainer run --app vg {params.app_path}/PanGeTools.sif \ + deconstruct -a \ + -P $(echo {params.ref} | cut -f1 -d'.') \ + {input.graph} \ + -t {threads} -v \ + 1> {output.vcf} \ + 2> >(tee {log.cmd} >&2) + """ + +rule vg_vcf_2_tsv: + """ + Convert a VCF from VG deconstruct into a TSV. [Optional] + + Input : + - VCF from VG deconstruct (Pan1c.<gtool>.<pangenome_name>.vcf.gz) + + Output : + - TSV (vg_<gtool>.tsv) [Temporary] + + Threads : 1 + + Memory : n_threads * mem_multiplier * 8Gb + """ + input: + "output/Pan1c.{gtool}."+config['name']+".vcf.gz" + output: + temp("tmp/var_json/vg_{gtool}.tsv") + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 + shell: + """ + zcat {input} | awk -f scripts/vcf_2_tsv_vg.awk > {output} + """ + +rule syri_vcf_2_tsv: + """ + Convert all VCF from SyRI into a single TSV. [Optional] + + Input : + - VCFs from SyRI (<haplotypes>.syri.mm2.vcf.gz) + + Output : + - TSV (syri_mm2.tsv) [Temporary] + + Threads : 1 + + Memory : n_threads * mem_multiplier * 8Gb + """ + input: + expand("data/asm.syri.mm2/Pan1c."+config['name']+".{haplotype}.syri.mm2.vcf.gz", haplotype=SAMPLES_NOREF) + output: + temp("tmp/var_json/syri_mm2.tsv") + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 + params: + app_path=config['app.path'], + pan_name=config['name'], + refname=config['reference'] + shell: + """ + RHAP=$(basename {params.refname} .fa.gz | cut -f1 -d'.') + RHAPN=$(basename {params.refname} .fa.gz | cut -f2 -d'.' | cut -f2 -d'p') + FOLDER=$(dirname {input[0]}) + + #% SyRI VCF MM2 + ## Going through all folders + for vcf in $FOLDER/*.vcf.gz; do + THAP=$(basename $vcf .syri.mm2.vcf.gz | cut -f3 -d'.') + THAPN=$(basename $vcf .syri.mm2.vcf.gz | cut -f4 -d'.' | cut -f2 -d'p') + + # Producing intermediate TSVs + zcat $vcf | \ + awk -v THAP=$THAP -v THAPN=$THAPN -v RHAP=$RHAP -v RHAPN=$RHAPN -f scripts/vcf_2_tsv_syri.awk \ + > $FOLDER/$(basename $vcf .gz).tsv + done + + ## Merging TSVs + head -n1 $FOLDER/$(basename $vcf .gz).tsv > {output} + tail -n +2 -q $FOLDER/*.vcf.tsv >> {output} + + rm $FOLDER/*.tsv + """ + +def var_json_inputs(wildcards): + """ + Creates the input list for var_json rule. + """ + inputs = {} + inputs["vg"] = expand("tmp/var_json/vg_{gtool}.tsv", gtool=graph_tools) + inputs["syri_mm2"] = "tmp/var_json/syri_mm2.tsv" + + return inputs + +rule var_json: + """ + Produce the variant JSON for Pan1c-View (Variant section) [Optional] + + Input : + - TSVs from VG deconstruct VCF (vg_<gtool>.tsv) + - TSV from SyRI with minimap2 (syri_mm2.tsv) + + Output : + - Variant JSON (Pan1c.<pangenome_name>.var.json) + + Threads : 1 + + Memory : n_threads * mem_multiplier * 48Gb + + Parameters : + - app.path + - reference + """ + input: + unpack(var_json_inputs) + output: + json="output/report_data/Pan1c."+config['name']+".var.json" + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 48000 + benchmark: + "statistics/var_json.txt" + params: + app_path=config['app.path'], + refname=config['reference'] + shell: + """ + apptainer run {params.app_path}/pan1c-env.sif python scripts/var.pan1c_QC.py \ + --input {input.vg} {input.syri_mm2} \ + --output {output.json} + """ + rule create_pan1c_report_fig: - # Produces a markdown report figure of chromosomes graphs + """ + Create a figure combining contig composition of assemblies and odgi 1D figure, for a chromosome graph. [Optional] + + Input : + - Chromosome graph (.<chromosome>.tmp.gfa) + - Contig decomposition figures (.<chromosome>.contig.png) + + Output : + - Odgi 1D figure (.<gtool>.<chromosome>.odgi.png) [Temporary] + - Name figure (.<gtool>.<chromosome>.name.png) + - Contig-Odgi figures (<gtool>.<pangenome_name>.<chromosome>.report.fig.png) + + Threads : 4 + + Memory : n_threads * mem_multiplier * 2Gb + + Parameters : + - app.path + """ input: - graph='data/chrGraphs/'+config['name']+'.{chromosome}.gfa', - contigfig="output/chr.contig/{chromosome}.contig.png", + graph='data/chrGraphs/Pan1c.{gtool}.'+config['name']+'.{chromosome}.tmp.gfa', + contigfig="output/chr.contig/Pan1c."+config['name']+".{chromosome}.contig.png", output: - odgifig=temp("tmp/{chromosome}.odgi.png"), - namefig=temp("tmp/{chromosome}.name.png"), - reportfig="output/report/"+config['name']+".{chromosome}.report.fig.png" + odgifig=temp("tmp/Pan1c.{gtool}.{chromosome}.odgi.png"), + namefig=temp("tmp/Pan1c.{gtool}.{chromosome}.name.png"), + reportfig="output/report/Pan1c.{gtool}."+config['name']+".{chromosome}.report.fig.png" threads: 4 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 params: app_path=config['app.path'] shell: @@ -718,91 +1585,98 @@ rule create_pan1c_report_fig: composite -geometry "+790+$ctgheight" {output.odgifig} {output.reportfig} {output.reportfig} """ -def get_report_sections(wildcards): +def Pan1c_view_data_inputs(wildcards): """ - Return 'create_pan1c_report' optional inputs to add them to the final report. - For example : - - SyRI figures on Assemblies + Function to gather all files to include in the Pan1c-View tarball. + Inputs depends of optionnal parts of the workflow. """ - sections = dict() + inputs = list() + + # Adding JSONs + inputs += [ + "output/report_data/Pan1c."+config['name']+".assembly.json", + "output/report_data/Pan1c."+config['name']+".graph.json", + "output/report_data/Pan1c."+config['name']+".var.json", + "output/report_data/Pan1c."+config['name']+".tags.json", + "doc/Pan1c.documentation.json" + ] - sections["metadata"] = "output/pan1c."+config['name']+".gfa.metadata" - sections["odgifigs"] = expand("output/report/"+config['name']+".{chromosome}.report.fig.png", chromosome=CHRLIST) - sections["genstats"] = "output/stats/pan1c."+config['name']+".chrGraph.general.stats.tsv" - sections["pathstats"] = "output/stats/pan1c."+config['name']+".chrGraph.path.stats.tsv" + # Adding GFAv1 + inputs += expand( + "output/Pan1c.{gtool}."+config["name"]+".gfa.gz", + gtool=graph_tools + ) + + # Adding optional figures + if config["get_contig_pos"] == "True": + inputs += expand( + "output/chr.contig/Pan1c."+config['name']+".{chromosome}.contig.png", + chromosome=CHRLIST + ) if config["get_ASMs_SyRI"] == "True": - sections["SyRI_on_ASMs_figs"] = expand( - "output/asm.syri.figs/pan1c."+config['name']+".{haplotype}.syri.png", + inputs += expand( + "output/asm.syri.figs/Pan1c."+config['name']+".{haplotype}.syri_mm2.png", haplotype=SAMPLES_NOREF - ) + ) - return sections + if config["get_chrInputs_SyRI"] == "True": + inputs += expand( + "output/chrInput.syri.figs/Pan1c."+config['name']+".{chromosome}.syri_mm2.png", + chromosome=CHRLIST + ) + + if config["get_VCF"] == "True": + inputs += expand( + "output/Pan1c.{gtool}."+config["name"]+".{extension}", + gtool=graph_tools, + extension=["vcf.gz", "xg"] + ) -rule create_pan1c_report: - # Produces a markdown report of chromosomes graphs + # Adding 1D viz + inputs += expand( + "output/report/Pan1c.{gtool}."+config['name']+".{chromosome}.report.fig.png", + gtool=graph_tools, chromosome=CHRLIST + ) + + return inputs + +rule Pan1c_View_data: + """ + Generate a final Tarball for Pan1c-View. [Optional] + + Input : + - Tags JSON (.tags.json) + - Assembly JSON (.assembly.json) + - Graph JSON (.graph.json) + - Variant JSON (.var.json) + - Documentation JSON (doc/Pan1c.documentation.json) + - Final graphs (*.gfa.gz) [GFAv1.0] + - Contig-Odgi figures (<gtool>.<pangenome_name>.<chromosome>.report.fig.png) + - Contig decomposition figures (.<chromosome>.contig.png) [Optional] + - SyRI figure on whole haplotypes, with minimap2 (.<haplotype>.syri_mm2.png) [Optional] + - SyRI figure on chromosomes, with minimap2 (<chromosome>.syri_mm2.png) [Optional] + - VCF(s) from VG deconstruct (*.vcf.gz) [Optional] + - XG indexed graphs (*.xg) [Optional] + + Output : + - Tarball (<pangenome_name>.Pan1c_View.tar.gz) + + Threads : 1 + + Memory : 8Gb + """ input: - unpack(get_report_sections) + Pan1c_view_data_inputs output: - report="output/pan1c."+config['name']+".report.md" - threads: 4 - params: - app_path=config['app.path'], - add_SyRI=config['get_ASMs_SyRI'] - run: - shell("touch {output.report}") - - # Adding Summary (made for importing in Joplin) - shell("echo '# Summary' >> {output.report}") - shell("echo '[TOC]' >> {output.report}") - shell("echo '' >> {output.report}") - - # Adding graph construction info - ## WIP - - # Adding metadata - shell("echo '# Graph metadata' >> {output.report}") - shell("echo '```' >> {output.report}") - shell("cat {input.metadata} >> {output.report}") - shell("echo '```' >> {output.report}") - shell("echo '' >> {output.report}") - - # Adding General stats - shell("echo '# General stats' >> {output.report}") - shell("cat {input.genstats} | csv2md -d $'\\t' >> {output.report}") - shell("echo '' >> {output.report}") - - # Adding Path stats - shell("echo '# Path stats' >> {output.report}") - shell("cat {input.pathstats} | csv2md -d $'\\t' >> {output.report}") - shell("echo '' >> {output.report}") - - # Adding chromosomes figures - fig_list = [fig for fig in input.odgifigs] - print(fig_list) - fig_list.sort() - - shell("echo '# Chromosome-scale odgi graphs' >> {output.report}") - for fig in fig_list: - basename=os.path.basename(fig) - chr_name=basename.split('.')[1] - - shell("echo '## {chr_name}' >> {output.report}") - shell("echo '' >> {output.report}") - shell("echo '' >> {output.report}") - - # Adding SyRI figs if produced - if params.add_SyRI: - - fig_list = [fig for fig in input.SyRI_on_ASMs_figs] - fig_list.sort() - - shell("echo '# SyRI on input assemblies' >> {output.report}") - for fig in fig_list: - basename = os.path.basename(fig) - hap_name = basename.split('.')[2:4] - - shell("echo '## {hap_name[0]}, {hap_name[1]}' >> {output.report}") - shell("echo '' >> {output.report}") - - shell("echo '' >> {output.report}") \ No newline at end of file + "output/"+config['name']+".Pan1c_View.tar.gz" + threads: 1 + resources: + mem_mb = 8000 + shell: + """ + tar --transform 's/output/data/' \ + --transform 's/report\//odgifigs\//' \ + --transform 's/data\/report_data\///' \ + -czvf {output} {input} + """ diff --git a/config.yaml b/config.yaml index 0a8502abe2dabedd0b05e95e023f130705635997..ebca56adae841503605225a986c3a2876611f9b8 100644 --- a/config.yaml +++ b/config.yaml @@ -1,33 +1,47 @@ -## Main parameters +#% Main parameters # Pangenome name name: '<pangenome name>' # Reference fasta (BGziped) reference: '<reference_name>' # Directory of apptainer images (downloaded with getApps.sh) app.path: '<path>' +# Memory multiplier (increase when OOM). Formula : job_default_mem * n_retries * mem_multiplier * threads +mem_multiplier: 1 -# Core parameters +#% RagTag parameters (see: https://github.com/malonge/RagTag/wiki/scaffold) +# Main RagTag parameters +ragtag_args: '-i 0.4' +# Minimap2 parameters piped through RagTag +ragtag_mm2_conf: '-x asm5' +## Add -f 0.02 for large genomes + +#% PGGB parameters +run_PGGB: 'True' # Wfmash alignement parameters : wfmash.segment_length: 10000 wfmash.mapping_id: 95 -wfmash.secondary: '-k 19 -H 0.001 -X' - +wfmash.secondary: '-k 19 -H 0.001' # Seqwish parameters seqwish.params: '-B 10000000 -k 19 -f 0' -# Odgi 1D and path coverage viz parameters +#% MC parameters +run_MC: 'False' +# MC arguments passed to 'cactus-pangenome' +MC.params: '--clip 0 --filter 0' + +#% Odgi 1D and path coverage viz parameters odgi.1Dviz.params: '-x 2000 -b' odgi.pcov.params: '-x 2000 -O' -## Optional parts of the workflow -# Running Quast to get statistics on input haplotypes -run_Quast: 'True' +#% Optional parts of the workflow +# Running Quast to get more statistics on input haplotypes +run_Quast: 'False' # Getting figures showing chromosome decomposition into contigs get_contig_pos: 'True' -# Computes Presence Absence Variant matrices for Panache (not recommended; very long) -get_PAV: 'False' # Computes SyRI figures for haplotypes get_ASMs_SyRI: 'False' # Haplotype vs Reference get_chrInputs_SyRI: 'False' # SyRI on chrInputs -# Creating final report -create_report: 'True' +# Producing VCF from Pangenome graph +get_VCF: 'False' +# Creating Pan1c_View tarball +Pan1c-View_jsons: 'True' diff --git a/doc/Pan1c.documentation.json b/doc/Pan1c.documentation.json new file mode 100644 index 0000000000000000000000000000000000000000..75cbc85862d6225e3e7579263adcf4ebfb91a4b2 --- /dev/null +++ b/doc/Pan1c.documentation.json @@ -0,0 +1,67 @@ +{ + "Summary": { + "General Information": "Pan1c (Pangenome at Chromosome Scale) is a Snakemake workflow designed to simplify the construction and quality assessment of pangenome graphs. The workflow involves splitting the overall graph construction into chromosome-level graph construction, followed by concatenation. Thus, only intra-chromosomal variation is represented. Inter-chromosomal variation is not embedded into the graph but can be inferred when mapping onto it. RagTag is used to scaffold input assemblies and allows for clustering inputs by chromosome sequences.", + "Files": { + "GFAv1": "Graphical Fragment Assembly v1.0: Haplotypes are represented as P-lines. See https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md for more info.", + "XG": "Indexed graph used with the VG toolkit. See https://github.com/vgteam/vg/wiki/File-Formats#xg-xg-lightweight-graph--path-index for more info.", + "VCF": "Variant Calling Format generated using 'vg deconstruct' on the final graph. The reference is the reference haplotype set in the workflow config file." + } + }, + "Assemblies": { + "Assemblathon": "Assemblathon is run on the input assemblies (before RagTag) to evaluate their quality.", + "Chromosome Length": "The length is retrieved from the FAI computed on clustered sequences (after RagTag).", + "Contig Position": "Each scaffolded haplotype (after RagTag) is split into its contigs (using N repeats as contig borders). Transitions between contigs are represented with gray color changes. A 'black' region indicates the presence of many small contigs.", + "SyRI": "SyRI detects structural variations between two genomes. It uses an alignment file (produced by Minimap2 here). Comparisons are pairwise between two haplotypes." + }, + "Graph": { + "General Metrics": { + "Description": "Metrics are computed using GFAstats.py (included in the PanGeTools Apptainer image).", + "Metrics": { + "Path.length": "Length of the path.", + "Path.nodes.count": "Number of unique node_ids in the path (excluding repeats).", + "Path.private.nodes.count": "Number of unique node_ids traversed only by this path (excluding repeats).", + "Path.core.nodes.count": "Number of node_ids traversed by all paths (excluding repeats).", + "Path.private.length": "Sum of the lengths of private nodes in the path (excluding repeats).", + "Path.core.length": "Sum of the lengths of core nodes in the path (excluding repeats).", + "Path.private.R.length": "Sum of the lengths of private nodes in the path (including repeats).", + "Path.core.R.length": "Sum of the lengths of core nodes in the path (including repeats).", + "Path.steps.count": "Number of nodes in the path (including repeats).", + "Path.nodes.R.size.mean": "Mean length of nodes in the path (including repeats).", + "Path.nodes.size.mean": "Mean length of nodes in the path (excluding repeats).", + "Path.nodes.R.size.median": "Median length of nodes in the path (including repeats).", + "Path.nodes.size.median": "Median length of nodes in the path (excluding repeats).", + "Path.degree.mean": "Mean degree of path nodes (excluding repeats).", + "Path.private.degree.mean": "Mean degree of private nodes in the path (excluding repeats).", + "Path.core.degree.mean": "Mean degree of core nodes in the path (excluding repeats).", + "Nodes.count": "Number of unique node_ids in the graph (excluding repeats).", + "Edges.count": "Number of edges in the graph (excluding repeats).", + "Path.count": "Number of paths in the graph.", + "Path.length.mean": "Mean path length.", + "Path.length.median": "Median path length.", + "Nodes.private.count": "Number of private nodes (excluding repeats).", + "Nodes.core.count": "Number of core nodes (excluding repeats).", + "Steps.count": "Total number of nodes required to obtain all complete paths (including repeats).", + "Total.nodes.length": "Total length of nodes (excluding repeats).", + "Total.sequence.length": "Total sequence length stored in the graph.", + "Compression.factor": "Size reduction factor between total sequence length and total node length.", + "Nodes.length.mean": "Mean length of nodes (excluding repeats).", + "Nodes.length.median": "Median length of nodes (excluding repeats).", + "Degree.mean": "Mean number of edges per node. Edges are counted twice for start and end nodes.", + "Degree.median": "Median number of edges per node. Edges are counted twice for start and end nodes." + } + }, + "Path Composition": "Using GFAstats.py, nodes are categorized into three groups at the chromosome scale: Core, Private, Other. Core corresponds to nodes traversed by all paths (haplotypes) from the chromosome. Private corresponds to nodes traversed by only one path (haplotype). Other includes the remaining nodes. For a given path, the proportion of its length by group is computed using the sequence length of the corresponding nodes, including repetition.", + "Shared Content": { + "Description": "At the chromosome scale, the list of node_ids shared between each pair of paths (haplotypes) is computed. Several metrics are derived from this list.", + "Metrics": { + "Shared.nodes.count": "Number of nodes shared between a given pair of paths.", + "Shared.length": "Sum of the sequence lengths of shared nodes (excluding repeats).", + "Shared.R.length": "Sum of the sequence lengths of shared nodes (including repeats).", + "Shared.prop": "Proportion of the path that is shared (excluding repeats).", + "Shared.R.prop": "Proportion of the path that is shared (including repeats)." + } + }, + "Odgi Figures": "The contig position figure from the Assembly section is superimposed with the Odgi 1D visualization for each chromosome graph. Coordinates are not strictly identical, as the Odgi figure aims to identify similar regions (indicated by corresponding blocks of color) and INDEL regions (indicated by a white line in a given haplotype), which can introduce gaps in haplotypes." + }, + "Variants": "Variants are called from graphs (PGGB or MC) using 'vg deconstruct' and from linear haplotypes using SyRI. Only variants between 50bp and 100kbp in size are kept for statistics and comparisons. SyRI represents the classic method of detecting variants. In both cases, the reference is the same: the reference haplotype set in the workflow config file." +} diff --git a/example/config_CICD.yaml b/example/config_CICD.yaml index cc0c279fee81b7a6f5c2f430f2af704756ed58bc..4ba75da839f492af1415f73ff525265b1ee08bd1 100644 --- a/example/config_CICD.yaml +++ b/example/config_CICD.yaml @@ -5,30 +5,41 @@ name: '03SC_CICD' reference: 'R64.hap1.fa.gz' # Directory of apptainer images (downloaded with getApps.sh) app.path: 'appimgs/' +# Memory multiplier (increase when OOM). Formula : job_default_mem * n_retries * mem_multiplier * threads +mem_multiplier: 1 -# Core parameters +#% RagTag parameters (see: https://github.com/malonge/RagTag/wiki/scaffold) +ragtag_args: '-i 0.4' +ragtag_mm2_conf: '-x asm5' +## Add -f 0.02 for large genomes + +#% PGGB parameters +run_PGGB: 'True' # Wfmash alignement parameters : wfmash.segment_length: 5000 -wfmash.mapping_id: 90 -wfmash.secondary: '-k 19 -H 0.001 -X' - +wfmash.mapping_id: 95 +wfmash.secondary: '-k 19 -H 0.001' # Seqwish parameters seqwish.params: '-B 10000000 -k 19 -f 0' -# Odgi 1D and path coverage viz parameters -odgi.1Dviz.params: '-x 2000 -a 25 -b' -odgi.pcov.params: '-x 2000 -a 25 -O' +#% MC parameters +run_MC: 'False' +# MC arguments passed to 'cactus-pangenome' +MC.params: '--clip 0 --filter 0' + +#% Odgi 1D and path coverage viz parameters +odgi.1Dviz.params: '-x 2000 -b' +odgi.pcov.params: '-x 2000 -O' -## Optional parts of the workflow -# Running Quast to get statistics on input haplotypes +#% Optional parts of the workflow +# Running Quast to get more statistics on input haplotypes run_Quast: 'True' # Getting figures showing chromosome decomposition into contigs get_contig_pos: 'True' -# Computes Presence Absence Variant matrices for Panache (not recommended; very long) -get_PAV: 'False' # Computes SyRI figures for haplotypes -#get_allASM_SyRI: 'False' # All vs all get_ASMs_SyRI: 'True' # Haplotype vs Reference get_chrInputs_SyRI: 'True' # SyRI on chrInputs -# Creating final report -create_report: 'True' +# Producing VCF from Pangenome graph +get_VCF: 'True' +# Creating Pan1c_View tarball +Pan1c-View_jsons: 'True' diff --git a/getApps.sh b/getApps.sh index 1ba6cc919c6864b39ab607850d42f0dbf3fd5edc..6162a6a2281375091586c2a5419ba64b46b145f2 100755 --- a/getApps.sh +++ b/getApps.sh @@ -17,5 +17,5 @@ done # Script apptainer pull $appdir/PanGeTools.sif oras://registry.forgemia.inra.fr/alexis.mergez/pangetools/pangetools:latest apptainer pull $appdir/pan1c-env.sif oras://registry.forgemia.inra.fr/alexis.mergez/pan1capps/pan1cenv:latest -apptainer pull $appdir/pggb.sif oras://registry.forgemia.inra.fr/alexis.mergez/pangratools/pggb:latest -apptainer pull $appdir/pan1c-box.sif oras://registry.forgemia.inra.fr/alexis.mergez/pan1capps/pan1cbox:latest +apptainer pull $appdir/pan1c-box.sif oras://registry.forgemia.inra.fr/alexis.mergez/pan1capps/pan1cbox:latest +apptainer pull $appdir/minigraph-cactus_v2.7.0.sif oras://registry.forgemia.inra.fr/alexis.mergez/pangratools/minigraph-cactus:latest diff --git a/rules/tools.smk b/rules/tools.smk index 53fe1ff12207014832f44559b92bf28efd907294..db1f56678a5fa10c68cea68b31e56ac11fb77f33 100644 --- a/rules/tools.smk +++ b/rules/tools.smk @@ -1,3 +1,9 @@ +""" +Functions --------------------------------------------------------------------------------------- +""" +def get_mem_mb(wildcards, attempt, threads, multiplier=config["mem_multiplier"]): + return attempt * multiplier * threads + """ Tool rules --------------------------------------------------------------------------------------- """ @@ -9,6 +15,9 @@ rule samtools_index: "{sample}.fa.gz.fai", "{sample}.fa.gz.gzi" threads: 1 + retries: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 2000 params: app_path=config["app.path"] shell: @@ -22,10 +31,53 @@ rule run_bgzip: output: "{file}.gz" threads: 4 + retries: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 params: app_path=config["app.path"] shell: """ apptainer run --app bgzip {params.app_path}/PanGeTools.sif \ -@ {threads} {input} + """ + +rule decompress_graph: + # Decompressing graph if required + input: + "{file}.gfa.gz" + output: + temp("{file}.tmp.gfa") + threads: 1 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 + params: + app_path=config["app.path"] + shell: + """ + gzip -d -c {input} > {output} + """ + +rule gfa_2_xg: + # Convert a GFA to XG + input: + "{graph}.tmp.gfa" + output: + "{graph}.xg" + threads: 8 + resources: + mem_mb = lambda wildcards, threads: threads * config["mem_multiplier"] * 8000 + params: + app_path=config["app.path"] + shell: + """ + # Converting to GFAv1.1 + apptainer run --app gfavc {params.app_path}/PanGeTools.sif \ + --gfa {input} \ + --outName "{output}.v11.gfa" + + apptainer run --app vg {params.app_path}/PanGeTools.sif \ + convert -g -x -t {threads} "{output}.v11.gfa" > {output} + + rm "{output}.v11.gfa" """ \ No newline at end of file diff --git a/runSnakemakeSLURM.sh b/runSnakemakeSLURM.sh new file mode 100755 index 0000000000000000000000000000000000000000..3a9616e5ff0fc979b540dcb2818302ed5eb258f0 --- /dev/null +++ b/runSnakemakeSLURM.sh @@ -0,0 +1,24 @@ +#!/bin/bash +#SBATCH --cpus-per-task=1 +#SBATCH -p unlimitq +#SBATCH -o <log path>.log +#SBATCH -J <jname> +#SBATCH --mem=24G +#SBATCH --mail-type=BEGIN,END,FAIL +#SBATCH --mail-user=<mail> + +module purge +module load containers/Apptainer/1.2.5 +module load devel/Miniconda/Miniconda3 + +source activate <path_to_pan1c_env> + +apppath=<path_to_pan1c-box> + +# Creating DAG +echo "Creating DAG ..." +snakemake -c $(nproc) --dag | dot -Tsvg > workflow.svg +# Running the workflow +echo "Running the workflow ..." +/usr/bin/time -v -o whole.run.time.log snakemake --executor slurm --jobs 10 + diff --git a/scripts/getSyriFigs.sh b/scripts/Syri.figs_mm2.sh similarity index 85% rename from scripts/getSyriFigs.sh rename to scripts/Syri.figs_mm2.sh index 84f93c2bf82f8bff51581690f4c6ef7b85351ffa..08e1ba7aa6aafe3c88e4dda5b64a4246eca6d0ad 100755 --- a/scripts/getSyriFigs.sh +++ b/scripts/Syri.figs_mm2.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Create a Syri figure for the given genomes +# Create a Syri figure for the given genomes using minimap2 # @author: alexis.mergez@inrae.fr # Initializing arguments @@ -11,9 +11,12 @@ wrkdir="" # Working directory (directory used by pggb to store step files output="" # Output Syri figure(s) height=16 # Figure height width=9 # Figure width +fontsize=12 +space="0.7" # Space for homologous chromosomes +config="" # Plotsr config file ## Getting arguments -while getopts "r:q:a:t:d:o:h:w:" option; do +while getopts "r:q:a:t:d:o:h:w:f:s:c:" option; do case "$option" in r) ref="$OPTARG";; q) qry="$OPTARG";; @@ -23,7 +26,10 @@ while getopts "r:q:a:t:d:o:h:w:" option; do o) output="$OPTARG";; h) height="$OPTARG";; w) width="$OPTARG";; - \?) echo "Usage: $0 [-r ref] [-q query] [-a appdir] [-t threads] [-d wrkdir] [-o output] [-h height] [-w width]" >&2 + f) fontsize="$OPTARG";; + s) space="$OPTARG";; + c) config="$OPTARG";; + \?) echo "Usage: $0 [-r ref] [-q query] [-a appdir] [-t threads] [-d wrkdir] [-o output] [-h height] [-w width] [-f fontsize] [-s space] [-c config]" >&2 exit 1;; esac done @@ -98,11 +104,11 @@ done # Each line contains 2 columns : fasta filepath and its simpler name echo -e "#files\tname" > $wrkdir/genomes.txt for asm in "${asmList[@]}"; do - echo -e "$wrkdir/$(basename ${asm} .gz)\t$(basename $asm .fa.gz | cut -d'.' -f1)" >> $wrkdir/genomes.txt + echo -e "$wrkdir/$(basename ${asm} .gz)\t$(basename $asm .fa.gz | cut -d'.' -f1)#$(basename $asm .fa.gz | cut -d'.' -f2 | cut -d'p' -f2)" >> $wrkdir/genomes.txt done # Generating the plotsr command -command="--genomes $wrkdir/genomes.txt -o $wrkdir/$output -f 12 -H $height -W $width -d 600 " +command="--genomes $wrkdir/genomes.txt -o $wrkdir/$output -f $fontsize -S $space -H $height -W $width -d 600 --cfg $config " # Adding syri files to the command as each needs to be specified using "--sr" argument for file in "${syriFileList[@]}"; do diff --git a/scripts/Syri.figs_wfm.sh b/scripts/Syri.figs_wfm.sh new file mode 100755 index 0000000000000000000000000000000000000000..0bbd825a111f22fc9112c346c796650152546b4a --- /dev/null +++ b/scripts/Syri.figs_wfm.sh @@ -0,0 +1,133 @@ +#!/bin/bash +# Create a Syri figure for the given genomes using wfmash +# @author: alexis.mergez@inrae.fr + +# Initializing arguments +ref="" # Reference fasta +qry="" # Queries fasta +appdir="" # Directory containing apptainer images +threads="" +wrkdir="" # Working directory (directory used by pggb to store step files like .paf, etc...) +output="" # Output Syri figure(s) +height=16 # Figure height +width=9 # Figure width +fontsize=12 +space="0.7" # Space for homologous chromosomes +config="" # Plotsr config file +wfm_args="-p 95 -s 10000 -l 50000 -k19 -H 0.001" # wfmash arguments + +## Getting arguments +while getopts "r:q:a:t:d:o:h:w:f:s:c:w:" option; do + case "$option" in + r) ref="$OPTARG";; + q) qry="$OPTARG";; + a) appdir="$OPTARG";; + t) threads="$OPTARG";; + d) wrkdir="$OPTARG";; + o) output="$OPTARG";; + h) height="$OPTARG";; + w) width="$OPTARG";; + f) fontsize="$OPTARG";; + s) space="$OPTARG";; + c) config="$OPTARG";; + w) wfm_args="$OPTARG";; + \?) echo "Usage: $0 [-r ref] [-q query] [-a appdir] [-t threads] [-d wrkdir] [-o output] [-h height] [-w width] [-f fontsize] [-s space] [-c config] [-w wfmash arguments]" >&2 + exit 1;; + esac +done + +## Main script +# Reading query argument and creating an array containing the path to query fasta files +IFS=' ' read -r -a temp <<< "$qry" +readarray -td '' temp_sorted < <(printf '%s\0' "${temp[@]}" | sort -z) + +#echo "Query : $qry" +#echo "tempArr : ${temp[@]}" + +asmList=("$ref") + +# Sorting the array to put the reference in first +for item in "${temp_sorted[@]}"; do + if [[ $item != "$ref" ]]; then + asmList+=($item) + fi +done + +#echo "The array : ${asmList[@]}" + +# Array to store the created syri files +syriFileList=() + +# Iterating 2 by 2 with overlap, over the array of fasta files +for ((i = 0; i < ${#asmList[@]} - 1; i++)); do + + # Setting filepaths for later + bamFile="$(basename ${asmList[i]} .fa.gz)_$(basename ${asmList[i + 1]} .fa.gz).bam" + syriFile="$(basename ${asmList[i]} .fa.gz)_$(basename ${asmList[i + 1]} .fa.gz).syri.out" + hapID=$(basename ${asmList[i + 1]}) + refID=$(basename ${asmList[i]}) + + # Debug + echo -e "\n[Debug::SyRI_script::$hapID] hapID: $hapID\trefID: $refID\n" + + # Adding the output syri file to the array + syriFileList+=($syriFile) + + # Prunning fasta files based of common chromosomes + apptainer run $appdir/pan1c-env.sif python scripts/syri_pruning.py \ + -r ${asmList[i]} -f ${asmList[i + 1]} -o $wrkdir -d + + #echo "\n[Debug::SyRI_script::$hapID] REF" + #grep '>' $wrkdir/$(basename ${asmList[i]} .gz) + #echo "\n[Debug::SyRI_script::$hapID] QRY" + #grep '>' $wrkdir/$(basename ${asmList[i + 1]} .gz) + + # Renaming chromosomes with same ids as the reference (Not working) + #sed -i "s/${hapID}#chr/${refID}#chr/g" $wrkdir/$(basename ${asmList[i + 1]} .gz) + + # Compressing and indexing + apptainer run $appdir/PanGeTools.sif bgzip \ + -@ $threads -k $wrkdir/$(basename ${asmList[i]} .gz) + apptainer run $appdir/PanGeTools.sif bgzip \ + -@ $threads -k $wrkdir/$(basename ${asmList[i+1]} .gz) + apptainer run --app samtools $appdir/PanGeTools.sif \ + faidx $wrkdir/$(basename ${asmList[i]}) + apptainer run --app samtools $appdir/PanGeTools.sif \ + faidx $wrkdir/$(basename ${asmList[i+1]}) + + # Wfmash genome vs genome alignment + apptainer run --app wfmash $appdir/PanGeTools.sif \ + $wrkdir/$(basename ${asmList[i]}) $wrkdir/$(basename ${asmList[i + 1]}) \ + -p 95 -s 1000 -l 5000 -k19 -H 0.001 \ + -n $(grep '>' $wrkdir/$(basename ${asmList[i]} .gz) | wc -l) \ + -t $threads -a | sed "s/_[0-9]//g" > $wrkdir/$bamFile.sam + + apptainer run --app samtools $appdir/PanGeTools.sif sort -O BAM -@ $threads $wrkdir/$bamFile.sam > $wrkdir/$bamFile + rm $wrkdir/$bamFile.sam + #rm $wrkdir/*.fa + + # Syri on previous alignment + apptainer run $appdir/pan1c-env.sif \ + syri -c $wrkdir/$bamFile -r $wrkdir/$(basename ${asmList[i]} .gz) -q $wrkdir/$(basename ${asmList[i + 1]} .gz) -k -F B \ + --nc $threads \ + --dir $wrkdir --prefix "$(basename ${asmList[i]} .fa.gz)_$(basename ${asmList[i + 1]} .fa.gz)." +done + +# Creating genomes.txt for plotsr. It is used to give simple names to fasta files in the final figure +# Each line contains 2 columns : fasta filepath and its simpler name +echo -e "#files\tname" > $wrkdir/genomes.txt +for asm in "${asmList[@]}"; do + echo -e "$wrkdir/$(basename ${asm} .gz)\t$(basename $asm .fa.gz | cut -d'.' -f1)#$(basename $asm .fa.gz | cut -d'.' -f2 | cut -d'p' -f2)" >> $wrkdir/genomes.txt +done + +# Generating the plotsr command +command="--genomes $wrkdir/genomes.txt -o $wrkdir/$output -f $fontsize -S $space -H $height -W $width -d 600 --cfg $config " + +# Adding syri files to the command as each needs to be specified using "--sr" argument +for file in "${syriFileList[@]}"; do + command+="--sr $wrkdir/$file " +done + +# Running plotsr +apptainer run $appdir/pan1c-env.sif plotsr \ + $command diff --git a/scripts/VCF.stats_figs.py b/scripts/VCF.stats_figs.py new file mode 100644 index 0000000000000000000000000000000000000000..adad390ad7c1260fdfe266a05bfe30d9d3dd8b17 --- /dev/null +++ b/scripts/VCF.stats_figs.py @@ -0,0 +1,332 @@ +""" +VCF statistics figure script for Pan1c workflow + +Use aggregated TSV produced with vcf_2_tsv_syri.awk and vcf_2_tsv_vg.awk + +@author: alexis.mergez@inrae.fr +@version: 1.0 +""" + +#% Librairies +import os +import pandas as pd +import matplotlib.pyplot as plt +from matplotlib.lines import Line2D +import seaborn as sns +import argparse + +#% Parsing arguments +arg_parser = argparse.ArgumentParser(description='VCF statistic figures for Pan1c workflow') +arg_parser.add_argument( + "--vg", + dest = "vg", + required = True, + help = "Input TSV containing vg variants" + ) +arg_parser.add_argument( + "--syri", + dest = "syri", + required = True, + help = "Input TSV containing syri detected variants" + ) +arg_parser.add_argument( + "--output_dir", + dest = "dir", + required = True, + help = "Output dir for figures" + ) +arg_parser.add_argument( + "--panname", + "-p", + dest = "panname", + required = True, + help = "Pangenome name" + ) +args = arg_parser.parse_args() + +#% Loading and preparing data +## VG dataset +vg = pd.read_csv(args.vg, sep="\t") +vg.rename(columns={"CHROM": "NAME"}, inplace=True) +vg[["QUERY", "CHROM"]] = vg["NAME"].str.rsplit("#", n=1, expand=True) +vg[["Genome", "Haplotype"]] = vg["HAP"].str.rsplit("#", n=1, expand=True) +vg.set_index(["CHROM", "QUERY", "HAP"], inplace=True) +vg.drop("NAME", axis=1, inplace=True) +vg.loc[:,"LEN"] = -vg["LEN"] + +## Syri dataset +sr = pd.read_csv(args.syri, sep="\t") +sr.rename(columns={"CHROM": "NAME"}, inplace=True) +sr[["QUERY", "CHROM"]] = sr["NAME"].str.rsplit("#", n=1, expand=True) +sr[["Genome", "Haplotype"]] = sr["HAP"].str.rsplit("#", n=1, expand=True) +sr.set_index(["CHROM", "QUERY", "HAP"], inplace=True) +sr.drop("NAME", axis=1, inplace=True) +sr.loc[:,"LEN"] = -sr["LEN"] + +## Getting ref name +REF = vg.index.unique("QUERY")[0].split("#")[0] + +#% Figure settings +## CMAP & Genome order +hue_order = sorted(vg.index.unique("HAP")) + +# Creating color palette based on genome instead of haplotype +genomes = sorted(vg["Genome"].unique()) +col_values = sns.color_palette(n_colors=len(genomes)) +colors = dict(zip(genomes, col_values)) +cmap = { + f"{name}#{hap}": color + for name, color in colors.items() + for hap in range(1, int(vg["Haplotype"].max())+1) +} + +#% Figures funtion +# General histogram +def hist_general(data, title, savedir, hue_order=hue_order, colors=colors, cmap=cmap, genomes=genomes): + INS = data.query("LEN > 50 & LEN < 100000").rename(columns={"LEN": "INS"}) + DEL = data.query("LEN < -50 & LEN > -100000").rename(columns={"LEN": "DEL"}) + DEL.loc[:,"DEL"] = -DEL["DEL"] + + LS = ['solid', 'dashed', 'dashdot', 'dotted'] + + fig, (delax, insax) = plt.subplots(1, 2, gridspec_kw={'wspace': 0}, figsize=(18, 9), dpi=300, sharey=True) + + # Plotting deletions + sns.histplot( + data=DEL, + x="DEL", + hue="HAP", + hue_order=hue_order, + palette = cmap, + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=delax, + legend=False + ) + + # Reversing xaxis on DEL + delax.set(xlim=delax.get_xlim()[::-1]) + sns.despine(ax=delax) + delax.grid() + + # Plotting insertions + sns.histplot( + data=INS, + x="INS", + hue="HAP", + hue_order=hue_order, + palette = cmap, + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=insax, + legend=False + ) + + insax.grid() + + # Changing linestyle based on haplotype id + for ax in [insax, delax]: + for line, hap in zip(ax.lines, hue_order): + # Appliquer le style selon le haplotype + line.set_linestyle(LS[int(hap.split('#')[-1])-1]) + + # Custom legend + genome_legend_elements = [ + Line2D([0],[0], color=colors[Genome], linestyle="solid", label=Genome) + for Genome in genomes + ] + hap_legend_elements = [ + Line2D([0],[0], color="black", linestyle=LS[i-1], label=i) + for i in range(1, int(data["Haplotype"].max())+1) + ] + leg_handles = [Line2D([],[], color="w", label="\n$\\bf{Haplotypes}$")]+genome_legend_elements+[Line2D([],[], color="w", label="\n$\\bf{Hap ID}$")]+hap_legend_elements + insax.legend(handles = leg_handles, ncol=2, frameon=False) + + # Y axis to the right + insax.yaxis.tick_right() + + # Changing labels of xTicks + insax.set_xticks([100, 1000, 10000, 100000]) + insax.set_xticklabels(["100bp", "1kbp", "10kbp", "100kbp"]) + delax.set_xticks([100, 1000, 10000, 100000]) + delax.set_xticklabels(["-100bp", "-1kbp", "-10kbp", "-100kbp"]) + + sns.despine(ax=insax, right=False) + + # Moving legend outside the figure + sns.move_legend(insax, "upper left", bbox_to_anchor=(1, 1)) + + fig.suptitle(title, fontsize=15 , fontweight="bold") + + plt.tight_layout() + + plt.savefig(savedir) + plt.close() + +# Genome specific hstograms +def hist_genome(data, title, savedir): + INS = data.query("LEN > 50 & LEN < 100000").rename(columns={"LEN": "INS"}) + DEL = data.query("LEN < -50 & LEN > -100000").rename(columns={"LEN": "DEL"}) + DEL.loc[:,"DEL"] = -DEL["DEL"] + + hue_order = sorted(data.index.unique("HAP")) + genomes = sorted(data["Genome"].unique()) + genomes = [(tool, name) for name in genomes for tool in data.index.unique("TOOL")] + col_values = sns.color_palette(n_colors=len(genomes)*2) + colors = dict(zip(genomes, col_values)) + cmap = { + cur_tool: { + f"{name}#{hap}": color + for (tool, name), color in colors.items() + for hap in range(1, int(data["Haplotype"].max())+1) + if tool == cur_tool + } + for cur_tool in data.index.unique("TOOL") + } + + LS = ['solid', 'dashed', 'dashdot', 'dotted'] + + fig, (delax, insax) = plt.subplots(1, 2, gridspec_kw={'wspace': 0}, figsize=(18, 9), dpi=300, sharey=True) + + # Plotting deletions + sns.histplot( + data=DEL.query("TOOL == 'VG'"), + x="DEL", + hue="HAP", + hue_order=hue_order, + palette = cmap['VG'], + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=delax, + legend=False + ) + sns.histplot( + data=DEL.query("TOOL == 'Syri'"), + x="DEL", + hue="HAP", + hue_order=hue_order, + palette = cmap['Syri'], + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=delax, + legend=False + ) + + # Reversing xaxis on DEL + delax.set(xlim=delax.get_xlim()[::-1]) + sns.despine(ax=delax) + delax.grid() + + # Plotting insertions + sns.histplot( + data=INS.query("TOOL == 'VG'"), + x="INS", + hue="HAP", + #hue_order=hue_order, + palette = cmap['VG'], + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=insax, + legend=False + ) + sns.histplot( + data=INS.query("TOOL == 'Syri'"), + x="INS", + hue="HAP", + #hue_order=hue_order, + palette = cmap['Syri'], + binwidth=0.05, + log_scale=True, + element='poly', + fill=False, + ax=insax, + legend=False + ) + + insax.grid() + + # Changing linestyle based on haplotype id + for ax in [insax, delax]: + for line, hap in zip(ax.lines, 2*hue_order): + # Applying style according to haplotype ID + line.set_linestyle(LS[int(hap.split('#')[-1])-1]) + + # Custom legend + tool_legend_elements = [ + Line2D([0],[0], color=colors[(tool, Genome)], linestyle="solid", label=tool) + for tool, Genome in genomes + ] + + hap_legend_elements = [ + Line2D([0],[0], color="black", linestyle=LS[i-1], label=i) + for i in range(1, int(data["Haplotype"].max())+1) + ] + leg_handles = [Line2D([],[], color="w", label="\n$\\bf{Tool}$")]+tool_legend_elements+[Line2D([],[], color="w", label="\n$\\bf{Hap ID}$")]+hap_legend_elements + + insax.legend(handles = leg_handles, frameon=False) + + # Y axis to the right + insax.yaxis.tick_right() + + # Changing labels of xTicks + insax.set_xticks([100, 1000, 10000, 100000]) + insax.set_xticklabels(["100bp", "1kbp", "10kbp", "100kbp"]) + delax.set_xticks([100, 1000, 10000, 100000]) + delax.set_xticklabels(["-100bp", "-1kbp", "-10kbp", "-100kbp"]) + + sns.despine(ax=insax, right=False) + + # Moving legend outside the figure + sns.move_legend(insax, "upper left", bbox_to_anchor=(1, 1)) + + fig.suptitle(title, fontsize=15 , fontweight="bold") + + plt.tight_layout() + + plt.savefig(savedir) + plt.close() + +#% Generating figures +# General histograms +hist_general( + vg, + f"Pan1c - {args.panname} - {REF} - VG", + os.path.join(args.dir, f"pan1c.{args.panname}.General.vcf.vg.png") +) +hist_general( + sr, + f"Pan1c - {args.panname} - {REF} - Syri", + os.path.join(args.dir, f"pan1c.{args.panname}.General.vcf.syri.png") +) + +# Genome specific histograms +# Iterating over haplotypes +for genome in sorted(vg["Genome"].unique()): + + # Concatening data from both datasets + data = pd.concat( + { + "VG": vg.query("Genome == @genome"), + "Syri": sr.query("Genome == @genome") + }, + names=["TOOL"] + ) + + print(data) + + # Creating the figure + hist_genome( + data, + f"Pan1c - {args.panname} - {REF} - {genome}", + os.path.join(args.dir, f"pan1c.{args.panname}.{genome}.vcf.both.png") + ) \ No newline at end of file diff --git a/scripts/asm.pan1c_QC.py b/scripts/asm.pan1c_QC.py new file mode 100644 index 0000000000000000000000000000000000000000..8ba7cecf4072485471ff26df0104829cab8489de --- /dev/null +++ b/scripts/asm.pan1c_QC.py @@ -0,0 +1,159 @@ +""" +Assembly JSON creator for Pan1c-QC + +@author: alexis.mergez@inrae.fr +@version: 1.2 +""" + +import os +import argparse +import pandas as pd +import json + +## Arguments +arg_parser = argparse.ArgumentParser(description='Assembly JSON for Pan1c-QC') +arg_parser.add_argument( + "--asm_stats", + dest = "asmstats", + required = True, + help = "Combined assemblathon stats" + ) +arg_parser.add_argument( + "--fai", + dest = "fai", + required = True, + help = "concatenated chrInputs fasta index" + ) +arg_parser.add_argument( + "--name", + dest = "name", + required = True, + help = "Pangenome name" + ) +arg_parser.add_argument( + "--ref", + dest = "ref", + required = True, + help = "Workflow reference assembly name" + ) +arg_parser.add_argument( + '--syri_asm', + action="store_true", + dest = "syri_asm", + help = "Add path to Syri figures for haplotypes" +) +arg_parser.add_argument( + '--syri_chr', + action="store_true", + dest = "syri_chr", + help = "Add path to Syri figures for chrInputs" +) +arg_parser.add_argument( + '--contig_pos', + action="store_true", + dest = "contig_pos", + help = "Add path to contig pos figures" +) +arg_parser.add_argument( + '--quast', + action="store_true", + dest = "quast", + help = "Add path to quast report" +) +arg_parser.add_argument( + "--output", + dest = "output", + required = True, + help = "Output path" + ) +args = arg_parser.parse_args() + +## Preparing chromosome length table ------------------------------------------------------------------------ +cdf = pd.read_csv(args.fai, sep="\t", header = None, usecols=[0,1], names=["Full_hap", "Length"]) + +# Splitting full haplotype name and indeexing based on chromosome and haplotype id +cdf[["Hap", "Chr"]] = cdf["Full_hap"].str.rsplit("#", n=1, expand=True) +cdf = cdf.set_index(["Chr", "Hap"]).drop(columns = "Full_hap") + +# Computing deviation from mean length and reference length +cdf["Deviation from mean length (%)"] = cdf.groupby(level=0)["Length"].transform( + lambda x: round((x - x.mean()) * 100 / x.mean(), 2) +) + +_dev_from_ref_ = [] +for index, row in cdf.iterrows(): + ref_value = cdf.loc[(index[0], args.ref), "Length"] + _dev_from_ref_.append( + round((row["Length"] - ref_value)*100/ref_value, 2) + ) + +cdf["Deviation from ref length (%)"] = _dev_from_ref_ + +## Preparing Assemblathon stats ----------------------------------------------------------------------------- +adf = pd.read_csv(args.asmstats, sep="\t", index_col=0) + +# Renaming haplotypes according to PanSN +rename_dict = { name: name.replace(".hap", "#").replace(".fa.gz", "").split("/")[-1] for name in adf.index } +adf.rename(index = rename_dict, inplace = True) + +# Subsetting the whole dataframe +column_ids = [ + ["Number of scaffolds", "Total size of scaffolds", "N50 scaffold length", "L50 scaffold count", "scaffold %N", "scaffold %C", "scaffold %G"], + ["Number of contigs", "Total size of contigs", "N50 contig length", "L50 contig count", "contig %N", "contig %C", "contig %G"] +] +renamed_col = ["# contigs", "Total length", "N50", "L50", "# N's per 100kbp", "GC (%)"] + +ctgs = adf.loc[:, column_ids[1]] +scfs = adf.loc[:, column_ids[0]] + +# Adding type, renaming columns and merging back +ctgs["Type"] = ["Contig"]*len(ctgs) +scfs["Type"] = ["Scaffold"]*len(scfs) + +ctgs["GC (%)"] = ctgs["contig %C"]+ctgs["contig %G"] +scfs["GC (%)"] = scfs["scaffold %C"]+scfs["scaffold %G"] + +scfs = scfs.drop(columns=["scaffold %C", "scaffold %G"]).rename(columns = {column_ids[0][i]: renamed_col[i] for i in range(len(column_ids[0])-1)}) +ctgs = ctgs.drop(columns=["contig %C", "contig %G"]).rename(columns = {column_ids[1][i]: renamed_col[i] for i in range(len(column_ids[1])-1)}) +adf = pd.concat([scfs, ctgs], axis=0).reset_index() + +ref_len = cdf.xs(args.ref, level='Hap')["Length"].sum() + +# Computing deviation from mean length and reference length +mean_length = adf["Total length"].mean() +adf["Deviation from mean length (%)"] = round((adf["Total length"] - mean_length)*100 / mean_length, 2) +adf["Deviation from ref length (%)"] = round((adf["Total length"] - ref_len)*100 / ref_len, 2) + +adf = adf.sort_values(["Type", "Assembly"]).set_index(["Type", "Assembly"]) + +## Creating Assembly JSON ----------------------------------------------------------------------------------- +assembly = { + "ASM_stats": { + "Scaffold": adf.loc["Scaffold",:].to_dict(orient="index"), + "Contig": adf.loc["Contig",:].to_dict(orient="index") + }, + "Chrom_length": { + chrid: cdf.loc[chrid].to_dict(orient='index') for chrid in cdf.index.get_level_values(0).unique() + } +} + +if args.quast: + assembly["ASM_stats"]["Quast_path"] = f"data/{args.name}.quast.report.html" + +if args.contig_pos: + assembly["Contig_pos"] = { + chrid: f"data/chr.contig/Pan1c.{args.name}.{chrid}.contig.png" for chrid in cdf.index.get_level_values(0).unique() + } + +if args.syri_asm: + assembly["Syri_hap"] = { + hap: f"data/asm.syri.figs/Pan1c.{args.name}.{hap.replace('#', '.hap')}.syri_mm2.png" for hap in cdf.index.get_level_values(1).unique() if hap != args.ref + } + +if args.syri_chr: + assembly["Syri_chr"] = { + chrid: f"data/chrInput.syri.figs/Pan1c.{args.name}.{chrid}.syri_mm2.png" for chrid in cdf.index.get_level_values(0).unique() + } + +with open(args.output, "w") as handle: + json.dump(assembly, handle, indent=6) diff --git a/scripts/chrStatsAggregation.py b/scripts/chrGraphs.stats_aggregate.py similarity index 100% rename from scripts/chrStatsAggregation.py rename to scripts/chrGraphs.stats_aggregate.py diff --git a/scripts/chrGraphs.stats_figs.py b/scripts/chrGraphs.stats_figs.py new file mode 100644 index 0000000000000000000000000000000000000000..ac459d0aefd53692916ca1078852be1b1b529e50 --- /dev/null +++ b/scripts/chrGraphs.stats_figs.py @@ -0,0 +1,359 @@ +""" +Chromosomes statistics figure script for Pan1c workflow + +Use aggregated TSV produced with chrGraphs.stats_aggregate.py + +@author: alexis.mergez@inrae.fr +@version: 1.1 +""" + +#% Librairies +import os +import numpy as np +import matplotlib.pyplot as plt +import matplotlib.patches as mpatches +import seaborn as sns +import pandas as pd +import argparse +from sklearn.cluster import DBSCAN +from adjustText import adjust_text +from plottable import Table, ColDef + +#% Parsing arguments +arg_parser = argparse.ArgumentParser(description='Statistic figures for Pan1c workflow') +arg_parser.add_argument( + "--input", + "-i", + dest = "input", + required = True, + help = "Input TSV containing chromosome path stats" + ) +arg_parser.add_argument( + "--output_dir", + dest = "dir", + required = True, + help = "Output dir for figures" + ) +arg_parser.add_argument( + "--panname", + "-p", + dest = "panname", + required = True, + help = "Pangenome name" + ) +arg_parser.add_argument( + "--grapher", + "-g", + dest = "grapher", + required = True, + help = "Graph creation tool" + ) +arg_parser.add_argument( + "--reference", + "-r", + dest = "ref", + required = True, + help = "Reference name" + ) +args = arg_parser.parse_args() + +#% Loading and preparing data +def get_path_name(pathname): + return pathname.rsplit("#", 1)[0] + +# Reading data +gData = pd.read_csv(args.input, sep='\t') +gData["Path.name"] = gData["Path.name"].apply(get_path_name) +gData.set_index(["Pangenome.name","Chr.id", "Path.name"], inplace=True) + +# Splitting shared content and other stats +shared_content = gData.loc[:, ["Path.length", "Shared.content"]].to_dict() +gData.drop("Shared.content", axis=1, inplace=True) + +# Summing Private and core for the barplot figure +gData["Path.PrivCore.R.Length"] = gData["Path.private.R.length"] + gData["Path.core.R.length"] + +# Switching to Mbp instead of Bp +gData[["Path.length", "Path.PrivCore.R.Length", "Path.core.R.length"]] = gData[["Path.length", "Path.PrivCore.R.Length", "Path.core.R.length"]].astype(float) +gData.loc[:, ["Path.length", "Path.PrivCore.R.Length", "Path.core.R.length"]] = gData.loc[:, ["Path.length", "Path.PrivCore.R.Length", "Path.core.R.length"]] / 1000000 + +# Adding percentages relative to path length +gData["Path.private.R.per"] = (gData["Path.private.R.length"]*100)/(gData["Path.length"]*1000000) +gData["Path.core.R.per"] = (gData["Path.core.R.length"]*100)/gData["Path.length"] + +# Shared content +shared_dict = {} +A = 0 +for key, value in shared_content["Shared.content"].items(): + for elem in value.split(';'): + target, stats = elem.split(":") + target = target.rsplit("#", 1)[0] + + shared_dict[A] = list(key)+[target]+[shared_content["Path.length"][key]]+[int(val) for val in stats.split(',')] + A+=1 +sData = pd.DataFrame.from_dict(shared_dict, orient='index', columns = ["Pangenome.name", "Chr.id", "Query.name", "Target.name", "Path.length", "Shared.nodes.count", "Shared.length", "Shared.R.length"]) +sData.set_index(["Pangenome.name","Chr.id", "Query.name", "Target.name"], inplace=True) +sData.loc[:, "Shared.prop"] = sData["Shared.length"]*100/sData["Path.length"] +sData.loc[:, "Shared.R.prop"] = sData["Shared.R.length"]*100/sData["Path.length"] +sData.loc[:, "Shared.length.mb"] = sData["Shared.length"]/1000000 + +#% Figures functions +# Bar plots with path decomposition into Core, Private and Other proportion +def get_group_decomp_fig(data, title, savedir): + sns.set_style("ticks") + fig, (mbax, perax) = plt.subplots(1, 2, gridspec_kw={'wspace': 0.45, 'hspace':0}, figsize=(16, 10), dpi=300) + + # Adding bars for Mbax (in megabase) + sns.barplot(data, y="Path.name", x="Path.length", color='#FCDC94', estimator=np.mean, errorbar=None, orient="h", ax=mbax) + sns.barplot(data, y="Path.name", x="Path.PrivCore.R.Length", color='#78ABA8', estimator=np.mean, errorbar=None, orient="h", ax=mbax) + sns.barplot(data, y="Path.name", x="Path.core.R.length", color='#EF9C66', estimator=np.mean, errorbar=None, orient="h", ax=mbax) + + # Adding correspong size on Mbax + for path_name in data.index.unique("Path.name"): + _len = data.loc[path_name, "Path.length"] + + columns = ["Path.length", "Path.PrivCore.R.Length", "Path.core.R.length"] + for i in range(len(columns)): + try : + _var = data.loc[path_name, columns[i:i+2]] + x = _var.mean() + value = (_var.iloc[0] - _var.iloc[1]) + except : + _var = data.loc[path_name, columns[i]] + x = _var/2 + value = (_var - 0) + mbax.text(x, path_name, f"{round(value, 1)}Mb", horizontalalignment='center', verticalalignment='center', fontsize=10) + + # Computing the relative percentages + data["Path.core.R.length.per"] = data["Path.core.R.length"] * 100 / data["Path.length"] + data["Path.PrivCore.R.Length.per"] = data["Path.PrivCore.R.Length"] * 100 / data["Path.length"] + data["Path.length.per"] = [100] * len(data) + + # Adding bars for Perax (in percentage) + #sns.barplot(data, y="Path.name", x="Path.length.per", color='#FCDC94', estimator=np.mean, errorbar=None, orient="h", ax=perax) + sns.barplot(data, y="Path.name", x="Path.PrivCore.R.Length.per", color='#78ABA8', estimator=np.mean, errorbar=None, orient="h", ax=perax) + sns.barplot(data, y="Path.name", x="Path.core.R.length.per", color='#EF9C66', estimator=np.mean, errorbar=None, orient="h", ax=perax) + + # Adding correspong size on Perax + for path_name in data.index.unique("Path.name"): + + columns = ["Path.PrivCore.R.Length.per", "Path.core.R.length.per"] + for i in range(len(columns)): + try : + _var = data.loc[path_name, columns[i:i+2]] + x = _var.mean() + value = (_var.iloc[0] - _var.iloc[1]) + except : + _var = data.loc[path_name, columns[i]] + x = _var/2 + value = (_var - 0) + perax.text(x, path_name, f"{round(value, 1)}%", horizontalalignment='center', verticalalignment='center', fontsize=10) + + # Adding legend + top_bar = mpatches.Patch(color='#FCDC94', label='Other') + middle_bar = mpatches.Patch(color='#78ABA8', label='Private') + bottom_bar = mpatches.Patch(color='#EF9C66', label='Core') + perax.legend(handles=[top_bar, middle_bar, bottom_bar], frameon=False) + sns.move_legend(perax, 'best', bbox_to_anchor=(1, 0.5)) + + # Adding/Modifying titles + fig.suptitle(title, fontsize=15 , fontweight="bold") + mbax.set_xlabel('Path length (Mbp)', loc='right', fontweight="bold") + perax.set_xlabel('Path relative proportion (%)', loc='right', fontweight="bold") + mbax.set_ylabel(None) + perax.set_ylabel(None) + + sns.despine() + plt.savefig(savedir, bbox_inches='tight') + plt.close() + +# 2D scatter Core vs Private +def get_group_2d_fig(data, title, savedir): + sns.set_style("ticks") + fig, (ax, ax_table) = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2.5, 2], 'wspace': 0.05, 'left':0.05, 'right':0.95}, figsize=(20, 9), dpi=300) + + # Clustering with DBSCAN + X = data.loc[:, ["Path.private.R.per", "Path.core.R.per"]].to_numpy() + data["Clusters"] = DBSCAN(eps=2, min_samples = 1).fit_predict(X).tolist() + + # Plotting scatter + sns.scatterplot(data, x = "Path.private.R.per", y = "Path.core.R.per", hue="Clusters", palette="tab10", ax=ax) + + # Adding labels + TEXTS = [] + for genome in data.index.unique("Path.name"): + x, y = data.loc[genome, ["Path.private.R.per", "Path.core.R.per"]] + TEXTS.append(ax.text(x, y, genome, color="#7F7F7F", fontsize=9)) + adjust_text( + TEXTS, + expand=(2,2), + arrowprops=dict( + arrowstyle="->", + color="#b3b3b3", + lw=0.5, + shrinkA=0.2 + ), + ax=fig.axes[0] + ) + + # Adding table + table_data = { + f"Cluster {cluster_id}" : data[data["Clusters"] == cluster_id].index.to_list() + for cluster_id in sorted(data["Clusters"].unique()) + } + max_row = max((len(genomes) for genomes in table_data.values())) + for cluster_id in table_data.keys(): + table_data[cluster_id] += (max_row - len(table_data[cluster_id])) * [''] + table_df = pd.DataFrame.from_dict(table_data) + + ax_table = Table( + table_df, + row_dividers=False, + column_definitions = [ + ColDef("index", width=0, title="", textprops={"fontsize": 0}) + ], + textprops={"ha": "left"} + ) + + # Adding/Modifying titles + ax.set_title(title, fontsize=15 , fontweight="bold") + ax.set_xlabel('Relative Proportion of Private sequence (%)', loc='right', fontweight="bold", fontsize=11) + ax.set_ylabel('Relative Proportion of Core sequence (%)', loc='top', fontweight="bold", fontsize=11) + + sns.despine() + plt.savefig(savedir, bbox_inches='tight') + plt.close() + +# Shared content heatmap +def get_hm_shared_fig(data, title, savedir): + sns.set_style("ticks") + fig, (Rax, NRax) = plt.subplots(1, 2, gridspec_kw={'wspace': 0.3, 'hspace': 0, 'top':1.1, 'bottom':0.01, "left":0.1, "right":1}, figsize=(20, 10), dpi=300) + + # Pivoting the dataframe for the heatmap + data["Shared.length"] = data["Shared.length"]/1000000 + srTable = data.reset_index().pivot(values=["Shared.R.prop"], index=["Query.name"], columns=["Target.name"]) + snrTable = data.reset_index().pivot(values=["Shared.length"], index=["Query.name"], columns=["Target.name"]) + + # Removing multi-index from columns + srTable.columns = srTable.columns.droplevel() + snrTable.columns = snrTable.columns.droplevel() + + # Heatmap + sns.heatmap( + srTable, + cmap="Spectral_r", + square = True, + ax=Rax, + cbar_kws=dict(shrink=0.6, label="Proportion relative to Query, with repeats (%)") + ) + Rax.figure.axes[-1].yaxis.label.set(fontweight="bold", fontsize=13) + #fig.colorbar(Rax.collections[0], cax=Rax_bar, shrink=0.6) + sns.heatmap( + snrTable, + cmap="Spectral_r", + square = True, + ax=NRax, + mask=np.triu(snrTable), + cbar_kws=dict(shrink=0.6, label="Size, without repeats (Mbp)") + ) + NRax.figure.axes[-1].yaxis.label.set(fontweight="bold", fontsize=13) + #fig.colorbar(NRax.collections[0], cax=NRax_bar, shrink=0.6) + + # Adding/Modifying titles + fig.suptitle(title, fontsize=18 , fontweight="bold") + Rax.set_title("Shared Sequence Percentages", fontsize=13 , fontweight="bold") + NRax.set_title("Shared Sequence Size", fontsize=13 , fontweight="bold") + Rax.set_xlabel(None) + NRax.set_xlabel(None) + Rax.set_ylabel('Query', fontweight="bold") + NRax.set_ylabel(None) + + sns.despine() + plt.savefig(savedir, bbox_inches='tight') + plt.close() + +# Shared content difference between chromosomes heatmap +def get_hm_diff_fig(data, title, savedir): + sns.set_style("ticks") + fig, ax = plt.subplots(1, 1, figsize=(10, 9), gridspec_kw={'wspace': 0.2, 'hspace': 0, 'top':0.98, 'left':0.05, 'right':1, 'bottom':0.01}, dpi=300) + + # Heatmap + sns.heatmap(data, cmap="hot_r", square = True, ax=ax, mask=np.triu(data), cbar_kws=dict(shrink=0.8, label='Euclidean distance')) + ax.figure.axes[-1].yaxis.label.set(fontweight="bold", fontsize=13) + + # Adding/Modifying titles + fig.suptitle(title, fontsize=18 , fontweight="bold") + ax.set_xlabel(None) + ax.set_ylabel(None) + + sns.despine() + plt.savefig(savedir, bbox_inches='tight') + plt.close() + +#% Figure generation +# Bar plots +for pangenome in gData.index.unique("Pangenome.name"): + ## For each chromosome + for chrid in gData.index.unique("Chr.id"): + get_group_decomp_fig( + data = gData.loc[pangenome, chrid,:].copy(), + title = f"Path composition by groups - {args.grapher}.{pangenome} - {chrid}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.path.decomp.{chrid}.png") + ) + ## For the mean across chromosomes + get_group_decomp_fig( + data = gData.groupby("Path.name").sum().copy(), + title = f"Path composition mean accross chromosomes - {args.grapher}.{pangenome}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.path.decomp.mean.png") + ) + +# 2D Scatter Core vs Private +for pangenome in gData.index.unique("Pangenome.name"): + for chrid in gData.index.unique("Chr.id"): + get_group_2d_fig( + data = gData.loc[pangenome, chrid,:].copy(), + title = f"Private vs. Core Sequence for Each Graph Path - {args.grapher}.{pangenome} - {chrid}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.2D.scatter.{chrid}.png") + ) + get_group_2d_fig( + data = gData.groupby("Path.name").mean().copy(), + title = f"Mean Private vs. Core Sequence for Each Graph Path - {args.grapher}.{pangenome}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.2D.scatter.mean.png") + ) + +# Shared content heatmap +for pangenome in sData.index.unique("Pangenome.name"): + for chrid in sData.index.unique("Chr.id"): + get_hm_shared_fig( + data = sData.loc[pangenome, chrid,:].copy(), + title = f"Pairwise Path Comparison - {args.grapher}.{pangenome} - {chrid}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.sharred.content.{chrid}.png") + ) + +# Shared content difference between chromosomes heatmap +dData = {"Q":[], "T":[], "Diff":[]} +for pangenome in sData.index.unique("Pangenome.name"): + # Iterating over chromosomes twice to make pairs + for Q in sData.index.unique("Chr.id"): + for T in sData.index.unique("Chr.id"): + Qtable = sData.loc[pangenome, Q, :].copy().reset_index().pivot(values=["Shared.length"], index=["Query.name"], columns=["Target.name"]).fillna(0) + Qtable = Qtable / sData.loc[(pangenome, Q, args.ref),:].iloc[0]["Path.length"] + Ttable = sData.loc[pangenome, T, :].copy().reset_index().pivot(values=["Shared.length"], index=["Query.name"], columns=["Target.name"]).fillna(0) + Ttable = Ttable / sData.loc[(pangenome, T, args.ref),:].iloc[0]["Path.length"] + + # Computing Euclid distance using Frobenious norm + dData["Q"].append(Q) + dData["T"].append(T) + try : # Catching non similar shapes in case some path are not available in both matrices + dData["Diff"].append(np.linalg.norm(Qtable.values-Ttable.values, ord = 'fro')) + except : + dData["Diff"].append(np.nan) + + dData = pd.DataFrame.from_dict(dData).pivot(values=["Diff"], index=["Q"], columns=["T"]) + dData.columns = dData.columns.droplevel() + + get_hm_diff_fig( + dData, + title = f"Pairwise Euclidean distance between path comparison matrices - {args.grapher}.{pangenome}", + savedir = os.path.join(args.dir, f"{args.grapher}.{pangenome}.shared.content.diff.png") + ) \ No newline at end of file diff --git a/scripts/chrInputStats.py b/scripts/chrInput.stats_compute.py similarity index 100% rename from scripts/chrInputStats.py rename to scripts/chrInput.stats_compute.py diff --git a/scripts/contig_position.R b/scripts/contig.pos_figs.R similarity index 86% rename from scripts/contig_position.R rename to scripts/contig.pos_figs.R index b0eeb0a10bd783f1160aaac4bbfa5a739a79abf7..eb3fe198c7f0d34a9e99fcc63d020fa2b8bd2ed9 100644 --- a/scripts/contig_position.R +++ b/scripts/contig.pos_figs.R @@ -1,3 +1,10 @@ +# Contig position figure script for Pan1c workflow +# +# Use contig position to produce a figure showing their position into respective chromosome +# +# @author: alexis.mergez@inrae.fr, adapted from Cedric Cabau +# @version: 1.0 + library("karyoploteR") library("optparse") @@ -21,6 +28,8 @@ opt = parse_args(opt_parser); #print("Reading tsv file ...") x = read.table(opt$tsv, sep = "\t", comment.char="^") colnames(x) = x[1,] +haplotypes = sort(unique(colnames(x[,1])), method="shell") +print(haplotypes) x = x[-1,] my.genome <- toGRanges(x) nChr = nrow(x) @@ -70,8 +79,6 @@ if (nChr >= 4){ } - - -kp <- plotKaryotype(genome=my.genome, cytobands=my.cytobands, plot.params=pp, chromosomes="all") +kp <- plotKaryotype(genome=my.genome, cytobands=my.cytobands, plot.params=pp, chromosomes=haplotypes) kp <- kpAddBaseNumbers(kp, cex=0.6, tick.dist=5000000) dev.off() \ No newline at end of file diff --git a/scripts/coreStats.py b/scripts/core.stats_compute.py similarity index 100% rename from scripts/coreStats.py rename to scripts/core.stats_compute.py diff --git a/scripts/getPanacusHG.sh b/scripts/getPanacusHG.sh index d26dda39bc02a53159ec9b6a3c79203e3838b1e9..b25fc6d2721e4da354d0e7fa031387b88c93dbb6 100755 --- a/scripts/getPanacusHG.sh +++ b/scripts/getPanacusHG.sh @@ -25,7 +25,7 @@ while getopts "g:r:a:t:d:o:" option; do done # Getting chromosome name -chrname=$(basename ${gfa} .gfa | cut -d'.' -f2) +chrname=$(basename ${gfa} .gfa | cut -d'.' -f3) ref=$(echo $refname | sed 's/.hap/#/') # Getting paths in chromosome graph diff --git a/scripts/getTags.py b/scripts/getTags.py index 2e9c010d1945fec21f392da0fc7a31da106621c7..22a844ced5d87c2df0a031ccbdfd1cea8994399c 100644 --- a/scripts/getTags.py +++ b/scripts/getTags.py @@ -28,6 +28,12 @@ arg_parser.add_argument( required = True, help = "Pan1c config file" ) +arg_parser.add_argument( + "--json", + dest = "json", + required = True, + help = "JSON output" + ) args = arg_parser.parse_args() ## Main script @@ -39,7 +45,10 @@ Tags dictionnary : tags = {} ### Pan1c-workflow section -tags["Pan1c"] = {} +tags["Pangenome"] = {} +tags["Parameters"] = {} +tags["Tools"] = {} +tags["Apptainer"] = {} # Using git to get the version of the Pan1c workflow _output = subprocess.run( @@ -49,8 +58,10 @@ _output = subprocess.run( ).stdout[:-1] # Adding tags -tags["Pan1c"]["pan1c.version"] = _output -tags["Pan1c"]["pan1c.home"] = "https://forgemia.inra.fr/alexis.mergez/pan1c" +tags["Pan1c"] = { + "version": _output, + "pan1c.home": "https://forgemia.inra.fr/alexis.mergez/pan1c" +} # Getting the parameters used in the workflow from the config file with open(args.config, 'r') as handle: @@ -58,11 +69,14 @@ with open(args.config, 'r') as handle: line = line[:-1] if len(line) and line[0] != "#": # parameter _split = line.split(": ") - tags["Pan1c"][_split[0]] = _split[-1] -### PanGeTools section -tags["pangetools"] = {} + if _split[0] not in ["name", "reference"]: + tags["Parameters"][_split[0]] = _split[-1].replace("'", "") + + else : + tags["Pangenome"][_split[0].capitalize()] = _split[-1].replace("'", "") +### PanGeTools section # Reading the apps versions from the apptainer tags _output = subprocess.run( ["apptainer", "inspect", "-j", f"{args.appdir}/PanGeTools.sif"], @@ -71,36 +85,19 @@ _output = subprocess.run( ).stdout _output = json.loads(_output) labels = _output['data']['attributes']['labels'] -tags["pangetools"]["image.version"] = labels['Version'] -tags["pangetools"]["image.home"] = labels['about.home'] -# Adding app versions to the tag dictionnary -for key in labels.keys(): - if ".Version" in key: - tags["pangetools"][key.lower()] = labels[key] - -### PGGB image section -tags["pggb"] = {} - -# Reading the apps versions from the apptainer tags -_output = subprocess.run( - ["apptainer", "inspect", "-j", f"{args.appdir}/pggb.sif"], - capture_output=True, - text=True -).stdout -_output = json.loads(_output) -labels = _output['data']['attributes']['labels'] -tags["pggb"]["image.version"] = labels['Version'] -tags["pggb"]["image.home"] = labels['about.home'] +# Populating Apptainer section +tags["Apptainer"]["pangetools"] = { + "version": labels['Version'], + "home": labels['about.home'] +} -# Adding app versions to the tag dictionnary +# Adding app versions to the Tool section for key in labels.keys(): if ".Version" in key: - tags["pggb"][key.lower()] = labels[key] + tags["Tools"][key.lower().split(".")[0]] = labels[key] ### Pan1c-Env section -tags["pan1c-env"] = {} - # Reading the apps versions from the apptainer tags _output = subprocess.run( ["apptainer", "inspect", "-j", f"{args.appdir}/pan1c-env.sif"], @@ -109,17 +106,19 @@ _output = subprocess.run( ).stdout _output = json.loads(_output) labels = _output['data']['attributes']['labels'] -tags["pan1c-env"]["image.version"] = labels['Version'] -tags["pan1c-env"]["image.home"] = labels['about.home'] + +# Populating Apptainer section +tags["Apptainer"]["pan1c-env"] = { + "version": labels['Version'], + "home": labels['about.home'] +} # Adding app versions to the tag dictionnary for key in labels.keys(): if ".Version" in key: - tags["pan1c-env"][key.lower()] = labels[key] + tags["Tools"][key.lower().split(".")[0]] = labels[key] ## Pan1c-Box section -tags["pan1c-box"] = {} - # Reading the apps versions from the apptainer tags _output = subprocess.run( ["apptainer", "inspect", "-j", f"{args.appdir}/pan1c-box.sif"], @@ -128,20 +127,55 @@ _output = subprocess.run( ).stdout _output = json.loads(_output) labels = _output['data']['attributes']['labels'] -tags["pan1c-box"]["image.version"] = labels['Version'] -tags["pan1c-box"]["image.home"] = labels['about.home'] + +# Populating Apptainer section +tags["Apptainer"]["pan1c-box"] = { + "version": labels['Version'], + "home": labels['about.home'] +} # Adding app versions to the tag dictionnary for key in labels.keys(): if ".Version" in key: - tags["pan1c-box"][key.lower()] = labels[key] + tags["Tools"][key.lower().split(".")[0]] = labels[key] ## Exporting tags to stdout print("#\tThis graph have been created using the Pan1c workflow (https://forgemia.inra.fr/alexis.mergez/pan1c)\n#") print("#\tTool versions and commands\n#") -for first_elem in tags.keys(): - print(f'#\t-- {first_elem} --') - for label in tags[first_elem].keys(): - print(f"#\t{first_elem}\t{label}: {tags[first_elem][label]}") - print('#') - +for section, svalues in tags.items(): + print(f'#\t-- {section} --') + + if section == "Apptainer": + for image_name, image_info in svalues.items(): + for key, value in image_info.items(): + print(f"#\t{image_name}\t{key}: {value}") + print('#') + + else : + for key, value in svalues.items(): + print(f"#\t{key}: {value}") + print('#') + +# Adding path to generated files +gtools = (tags["Parameters"]["run_PGGB"] == "True")*["PGGB"] + (tags["Parameters"]["run_MC"] == "True")*["MC"] + +tags["Files"] = { + "GFAv1": { + tool: f"data/Pan1c.{tool}.{tags['Pangenome']['Name']}.gfa.gz" + for tool in gtools + } +} + +if tags["Parameters"]["get_VCF"] == "True": + tags["Files"]["XG"] = { + tool: f"data/Pan1c.{tool}.{tags['Pangenome']['Name']}.xg" + for tool in gtools + } + tags["Files"]["VCF"] = { + tool: f"data/Pan1c.{tool}.{tags['Pangenome']['Name']}.vcf.gz" + for tool in gtools + } + +# Exporting tags to JSON +with open(args.json, "w") as handle: + json.dump(tags, handle, indent=6) \ No newline at end of file diff --git a/scripts/graph.pan1c_QC.py b/scripts/graph.pan1c_QC.py new file mode 100644 index 0000000000000000000000000000000000000000..51fba1327e06792d3519bae545dbb4a027e05ce3 --- /dev/null +++ b/scripts/graph.pan1c_QC.py @@ -0,0 +1,159 @@ +""" +Graph JSON creator for Pan1c-QC + +@author: alexis.mergez@inrae.fr +@version: 1.1 +""" + +import os +import argparse +import pandas as pd +import numpy as np +import json + +## Arguments +arg_parser = argparse.ArgumentParser(description='Graph JSON for Pan1c-QC') +arg_parser.add_argument( + "--gen_stats", + dest = "general", + nargs="+", + required = True, + help = "General stats. Multiple allowed (MC, PGGB, etc...)" + ) +arg_parser.add_argument( + "--path_stats", + dest = "path", + nargs="+", + required = True, + help = "Path stats. Multiple allowed (MC, PGGB, etc...)" + ) +arg_parser.add_argument( + "--name", + dest = "name", + required = True, + help = "Pangenome name" + ) +arg_parser.add_argument( + '--odgi_figs', + dest = "odgifigs", + nargs="+", + help = "Path to odgi figures" + ) +arg_parser.add_argument( + "--output", + dest = "output", + required = True, + help = "Output path" + ) +args = arg_parser.parse_args() + + +## General statistics +gen_stats = {} + +for tsv in args.general: + gtool = os.path.basename(tsv).split(".")[1] + + gen_stats[gtool] = pd.read_csv(tsv, sep="\t").drop(columns="Pangenome.name").set_index("Chr.id").to_dict(orient="index") + +## Path statistics and shared content +path_stats = {} +shared_table = {} +diff_table = {} + +for tsv in args.path: + gtool = os.path.basename(tsv).split(".")[1] + df = pd.read_csv(tsv, sep="\t").drop(columns="Pangenome.name") + df["Path.name"] = df["Path.name"].str.rsplit("#", n=1).str[0] + + # Path stats + path_stats[gtool] = {} + + for chrid in df["Chr.id"].unique(): + path_stats[gtool][chrid] = df[df["Chr.id"] == chrid].drop(columns=["Chr.id", "Shared.content"]).set_index("Path.name").to_dict(orient="index") + + # Shared content stats + shared_table[gtool] = {} + shared_content = df.set_index(["Chr.id", "Path.name"]).loc[:, ["Path.length", "Shared.content"]].to_dict() + + ## Creating new dataframe based on concatenated chain in "Shared.content" column + shared_dict = {} + A = 0 + for key, value in shared_content["Shared.content"].items(): + for elem in value.split(';'): + target, stats = elem.split(":") + target = target.rsplit("#", 1)[0] + + shared_dict[A] = list(key)+[target]+[shared_content["Path.length"][key]]+[int(val) for val in stats.split(',')] + A+=1 + + ## Computing stats on shared content + sdf = pd.DataFrame.from_dict(shared_dict, orient='index', columns = ["Chr.id", "Query.name", "Target.name", "Path.length", "Shared.nodes.count", "Shared.length", "Shared.R.length"]) + sdf.set_index(["Chr.id", "Query.name", "Target.name"], inplace=True) + sdf.loc[:, "Shared.prop"] = sdf["Shared.length"]*100/sdf["Path.length"] + sdf.loc[:, "Shared.R.prop"] = sdf["Shared.R.length"]*100/sdf["Path.length"] + sdf.reset_index(inplace=True) + + for chrid in sdf["Chr.id"].unique(): + chrdf = sdf[sdf["Chr.id"] == chrid].drop(columns="Chr.id") + shared_table[gtool][chrid] = {} + + for query in chrdf["Query.name"].unique(): + shared_table[gtool][chrid][query] = chrdf[chrdf["Query.name"] == query].drop(columns="Query.name").set_index("Target.name").to_dict(orient="index") + + ## Computing difference between shared content heatmaps + dData = {"Qchr":[], "Tchr":[], "Diff":[]} + + # Iterating over chromosomes twice to make pairs + for Qchr in shared_table[gtool].keys(): + # Creating Query shared length matrice + Qtable = np.array([ + [ + shared_table[gtool][Qchr][Qasm][Tasm]["Shared.prop"] + for Tasm in shared_table[gtool][Qchr][Qasm].keys() + ] + for Qasm in shared_table[gtool][Qchr].keys() + ]) + + for Tchr in shared_table[gtool].keys(): + Ttable = np.array([ + [ + shared_table[gtool][Tchr][Qasm][Tasm]["Shared.prop"] + for Tasm in shared_table[gtool][Tchr][Qasm].keys() + ] + for Qasm in shared_table[gtool][Tchr].keys() + ]) + + # Computing Euclid distance using Frobenious norm + dData["Qchr"].append(Qchr) + dData["Tchr"].append(Tchr) + try : # Catching non similar shapes in case some path are not available in both matrices + dData["Diff"].append(np.linalg.norm(Qtable-Ttable, ord = 'fro')) + except : + dData["Diff"].append(np.nan) + + dData = pd.DataFrame.from_dict(dData).pivot(values=["Diff"], index=["Qchr"], columns=["Tchr"]) + dData.columns = dData.columns.droplevel() + diff_table[gtool] = dData.to_dict(orient="index") + +## Assembling output JSON +Graph_JSON = { + "General_stats": gen_stats, + "Paths_stats": path_stats, + "Shared_content": shared_table, + "Diff_shared_content": diff_table +} + +avail_gtool = list(set([os.path.basename(figs).split('.')[1] for figs in args.odgifigs])) + +Graph_JSON["odgi_figs"] = { + gtool: { + chrid : f"data/odgifigs/Pan1c.{gtool}.{args.name}.{chrid}.report.fig.png" + for chrid in gen_stats[gtool].keys() + } + for gtool in avail_gtool +} + +## Outputing to JSON +with open(args.output, "w") as handle: + json.dump(Graph_JSON, handle, indent=6) \ No newline at end of file diff --git a/scripts/quast.tsv_2_json.py b/scripts/quast.tsv_2_json.py new file mode 100644 index 0000000000000000000000000000000000000000..d975ebf4a5b21e3a43d54ec293e1b0ad681233f8 --- /dev/null +++ b/scripts/quast.tsv_2_json.py @@ -0,0 +1,31 @@ +""" +Converter from Quast assembly stats TSV to json + +@author: alexis.mergez@inrae.fr +@version: 1.0 +""" + +import os +import argparse +import pandas as pd + +## Arguments +arg_parser = argparse.ArgumentParser(description='Quast ASM stats to JSON') +arg_parser.add_argument( + "--input", + "-i", + dest = "input", + required = True, + help = "Quast transposed_report TSV" + ) +arg_parser.add_argument( + "--output", + dest = "output", + required = True, + help = "Output path" + ) +args = arg_parser.parse_args() + +data=pd.read_csv(args.input, sep="\t", index_col = 0) + +data.to_json(args.output, orient="index") \ No newline at end of file diff --git a/scripts/ragtagChromInfer.sh b/scripts/ragtagChromInfer.sh index b8dc8eab24e535d82c956c025b540d620c8c2c29..52c99240523ed7f4e15902008b68e21227f302ed 100755 --- a/scripts/ragtagChromInfer.sh +++ b/scripts/ragtagChromInfer.sh @@ -8,9 +8,11 @@ threads="" # Threads count inputref="" # Reference fasta inputquery="" # Query fasta output="" # Output fasta +mm2params="" # Minimap2 parameters +rtcommand="" # RagTag commands ## Getting arguments -while getopts "d:a:t:r:q:c:o:" option; do +while getopts "d:a:t:r:q:c:o:m:" option; do case "$option" in d) tmpdir="$OPTARG";; a) appdir="$OPTARG";; @@ -19,7 +21,8 @@ while getopts "d:a:t:r:q:c:o:" option; do q) inputquery="$OPTARG";; c) rtcommand="$OPTARG";; o) output="$OPTARG";; - \?) echo "Usage: $0 [-d tmpdir] [-a apptainer dir] [-t threads] [-r inputref] [-q inputquery] [-o output fasta] [-n pangenome name]" >&2 + m) mm2params="$OPTARG";; + \?) echo "Usage: $0 [-d tmpdir] [-a apptainer dir] [-t threads] [-r inputref] [-q inputquery] [-o output fasta] [-n pangenome name] [-m mm2-params]" >&2 exit 1;; esac done @@ -33,21 +36,21 @@ ref=$(echo $fullref | sed 's/.hap/#/') mkdir -p $tmpdir # Running ragtag scaffold +echo -e "\nRunning RagTag\n" apptainer run $appdir/pan1c-env.sif ragtag.py scaffold \ - -t $threads -o $tmpdir $inputref $inputquery + --mm2-params "$mm2params -t $threads" $rtcommand -o $tmpdir $inputref $inputquery 2>&1 # Renaming sequence according to naming scheme -grep ">${ref}#chr*" -A1 $tmpdir/ragtag.scaffold.fasta | \ - sed "s/${ref}#chr\([^_]*\)_RagTag/${hapID}#chr\1/g" > $tmpdir/${sample}.ragtagged.fa - -# Compressing fasta -# apptainer run --app bgzip $appdir/PanGeTools.sif \ -# -@ $threads $tmpdir/${sample}.ragtagged.fa +echo -e "\nRenaming sequences\n" +grep ">${ref}" -A1 $tmpdir/ragtag.scaffold.fasta | \ + sed "s/${ref}#\(.*\)_RagTag/${hapID}#\1/g" > $tmpdir/${sample}.ragtagged.fa # Moving fa.gz to output dir +echo -e "\nMoving final file\n" mv $tmpdir/${sample}.ragtagged.fa $output # Compressing temporary files +echo -e "\nCompressing tmp dir\n" tar --remove-files -cf $tmpdir.tar $tmpdir apptainer run --app bgzip $appdir/PanGeTools.sif \ -@ $threads $tmpdir.tar diff --git a/scripts/sr_mapping2graph.sh b/scripts/sr_mapping2graph.sh deleted file mode 100755 index e203385b0731a6f5582372150cc1add48ab4787f..0000000000000000000000000000000000000000 --- a/scripts/sr_mapping2graph.sh +++ /dev/null @@ -1,32 +0,0 @@ -#!/bin/bash -# Map given shrot reads to the pangenome graph using vg giraffe. See wiki for detailled explanation. - -# Initializing arguments -shortreads="" # Fastq file to map against the graph -appdir="" # Directory containing apptainer images -threads="" # Threads count -graph="" # Pangenome graph -output="" # Output gam - -## Getting arguments -while getopts "a:t:g:f:o:" option; do - case "$option" in - a) appdir="$OPTARG";; - t) threads="$OPTARG";; - g) graph="$OPTARG";; - f) shortreads="$OPTARG";; - o) output="$OPTARG";; - \?) echo "Usage: $0 [-a apptainer dir] [-t threads] [-g graph] [-f fastq] [-o output gam]" >&2 - exit 1;; - esac -done - -## Main script -apptainer run --app vg $appdir/PanGeTools.sif \ - autoindex -t $threads -g $graph --workflow giraffe -p $(basename $graph) - -apptainer run --app vg $appdir/PanGeTools.sif \ - giraffe -Z ${graph}.giraffe.gbz -m ${graph}.min -d ${graph}.dist -f $shortreads -t $threads -p > $output - -apptainer run --app vg $appdir/PanGeTools.sif \ - stats -a $output diff --git a/scripts/var.pan1c_QC.py b/scripts/var.pan1c_QC.py new file mode 100644 index 0000000000000000000000000000000000000000..420e61948cee7a0f79b1f47b19fc9fc96dd73759 --- /dev/null +++ b/scripts/var.pan1c_QC.py @@ -0,0 +1,143 @@ +""" +Variant JSON creator for Pan1c-QC + +@author: alexis.mergez@inrae.fr +@version: 1.0 +""" + +import os +import argparse +import pandas as pd +import numpy as np +import json + +## Arguments +arg_parser = argparse.ArgumentParser(description='Variant JSON for Pan1c-QC') +arg_parser.add_argument( + "--input", + "-i", + dest = "inputs", + nargs="+", + required = True, + help = "TSV(s) of variants. Filename should be the tool used (for example: df_pan1c, syri_mm2, ...)" + ) +arg_parser.add_argument( + "--output", + dest = "output", + required = True, + help = "Output path" + ) +args = arg_parser.parse_args() + +## Parsing functions +def log_transform(x): + if x > 0: + return np.log10(x) # Logarithme pour les valeurs positives + elif x < 0: + return -np.log10(-x) # -Logarithme pour les valeurs négatives + else: + return np.nan + +def log_untransform(x): + if x >= 0: + return 10**(x) # Pour les valeurs non négatives + else: + return -10**(-x) + +def parse_tsv(file): + # Parsing TSV into dataframe + df = pd.read_csv(file, sep="\t") + df.rename(columns={"CHROM": "NAME"}, inplace=True) + df[["QUERY", "CHROM"]] = df["NAME"].str.rsplit("#", n=1, expand=True) + df[["Genome", "Haplotype"]] = df["HAP"].str.rsplit("#", n=1, expand=True) + df.drop("NAME", axis=1, inplace=True) + df.loc[:,"LEN"] = -df["LEN"] + + # Keeping variants between 50 <= |<val>| <= 100000 + df = df.query('(-100000 <= LEN <= -50) or (50 <= LEN <= 100000)') + + # Converting the length to Log + df.loc[:, "LOGLEN"] = df.loc[:, 'LEN'].apply(log_transform) + + # Binning LOGLEN + bins = pd.interval_range(start=-5, freq=0.05, end=5) + df.loc[:, 'BIN'] = pd.cut(df.loc[:, 'LOGLEN'], bins=bins, precision=2) + + # Counting variants for each bin, grouped by Query, Chromosome and Haplotype + fdata = {} + for query in df["QUERY"].unique(): + fdata[query] = {} + + for hapid in df["HAP"].unique(): + fdata[query][hapid] = {"Sums":{}, "Counts":{}} + + tmp_data = {} + fdata[query][hapid]["Sums"]["All_Total"] = 0 + fdata[query][hapid]["Sums"]["All_INS"] = 0 + fdata[query][hapid]["Sums"]["All_DEL"] = 0 + + for chromid in df["CHROM"].unique(): + + sub_df = df.query("QUERY == @query and CHROM == @chromid and HAP == @hapid") + bin_counts = sub_df['BIN'].value_counts(sort=False) + bin_counts = bin_counts.reindex(bins, fill_value=0) + + # Filtering bins for values between -50 and 50 + bin_counts = bin_counts[np.array([(interval.left > np.log10(50)) or (interval.right < -np.log10(50)) for interval in bin_counts.index])] + + # Saving bin counts for summing later + tmp_data[chromid] = bin_counts.values + + # Creating a concatenated list + tmp_counts = [str(k) for k in list(bin_counts.values)] + tmp_counts = tmp_counts[:66] + ["null"] + tmp_counts[66:] # Adding null for -50;50bp variants + fdata[query][hapid]["Counts"][chromid] = ';'.join(tmp_counts) + + # Computing Total number of variants, number of Deletions, number of Insertions + fdata[query][hapid]["Sums"][f"{chromid}_Total"] = int(bin_counts.values.sum()) + fdata[query][hapid]["Sums"]["All_Total"] += fdata[query][hapid]["Sums"][f"{chromid}_Total"] + fdata[query][hapid]["Sums"][f"{chromid}_DEL"] = int(bin_counts.values[:66].sum()) + fdata[query][hapid]["Sums"]["All_DEL"] += fdata[query][hapid]["Sums"][f"{chromid}_DEL"] + fdata[query][hapid]["Sums"][f"{chromid}_INS"] = int(bin_counts.values[66:].sum()) + fdata[query][hapid]["Sums"]["All_INS"] += fdata[query][hapid]["Sums"][f"{chromid}_INS"] + + bins_string = [f"({int(round(log_untransform(interval.left), 0))}, {int(round(log_untransform(interval.right), 0))}]" for interval in bin_counts.index] + + if "index" in fdata: + assert (np.array(fdata["index"])==np.array(bins_string)).all() + else: + fdata["index"] = bins_string + + all_chrom = pd.DataFrame.from_dict(tmp_data, orient='columns') + all_chrom.index = fdata["index"] + + tmp_counts = [str(k) for k in list(all_chrom.sum(axis=1).values)] + tmp_counts = tmp_counts[:66] + ["null"] + tmp_counts[66:] # Adding null for -50;50bp variants + fdata[query][hapid]["Counts"]["All"] = ';'.join(tmp_counts) + + fdata["index"] = ";".join(fdata["index"][:66]+["(-50, 50]"]+fdata["index"][66:]) + return fdata + +## Parsing all TSV and aggregating into a final dictionnary +data = {"x": {}, "y": {}} +for file in args.inputs: + tool_name = os.path.basename(file).rsplit('.', 1)[0] + print(tool_name) + data["y"][tool_name] = parse_tsv(file) + + if "index" in data: + assert (data["x"]["bin"]==data["y"][tool_name]["index"]) + else: + data["x"]["bin"] = data["y"][tool_name]["index"] + + del data["y"][tool_name]["index"] + +data["x"]["logx"] = ';'.join( + [ + str(np.array(interval[1:-1].split(", ")).astype('int').mean()) # Mean coord between min max of each bin + for interval in data["x"]["bin"].split(";") + ] +) + +with open(args.output, "w") as handle: + json.dump(data, handle, indent=6) diff --git a/scripts/vcf_2_tsv_syri.awk b/scripts/vcf_2_tsv_syri.awk new file mode 100644 index 0000000000000000000000000000000000000000..598653dae5335e06d0d71a4527c529433438662f --- /dev/null +++ b/scripts/vcf_2_tsv_syri.awk @@ -0,0 +1,21 @@ +# Header +/^#CHROM/ { print "CHROM\tPOS\tID\tHAP\tLEN" } + + +# INS/DEL counter (Counting SyRI typed INDEL) +#!/^#/ { +# TYPE=substr($3, 1, 3) +# if (TYPE=="DEL" || TYPE=="INS") { +# ALTLEN=length($4)-length($5) +# printf("%s#%s#%s\t%s\t%s\t%s#%s\t%s\n", RHAP, RHAPN, $1, $2, $3, THAP, THAPN, ALTLEN) +# } +#} + +# INS/DEL counter (Dumb style counting) +!/^#/ { + TYPE=substr($3, 1, 1) + if (TYPE!="<") { + ALTLEN=length($4)-length($5) + printf("%s#%s#%s\t%s\t%s\t%s#%s\t%s\n", RHAP, RHAPN, $1, $2, $3, THAP, THAPN, ALTLEN) + } +} \ No newline at end of file diff --git a/scripts/vcf_2_tsv_vg.awk b/scripts/vcf_2_tsv_vg.awk new file mode 100644 index 0000000000000000000000000000000000000000..75563c2b29055a3f0ce3ffba450bbafe94bc3b86 --- /dev/null +++ b/scripts/vcf_2_tsv_vg.awk @@ -0,0 +1,31 @@ +# Header +/^#CHROM/ { + # Getting the name of haplotypes + for (i=10;i<=NF;i++) {HAP[i]=$i}; print "CHROM\tPOS\tID\tHAP\tLEN" +} + +# INS/DEL counter +!/^#/ { + ## Get allele length compared to the ref + # Adding the reference which is 0 + ALTLEN[0]=0 + # Split the comma separated list of alternative alleles + split($5, arr, ",") + # Compute the length of each alternative allele + for (i in arr) { + ALTLEN[i]=(length($4) - length(arr[i])) + } + # For each sample + for (i=10; i<=NF; i++) { + # Splitting haplotypes + split($i, ALTs, "|") + hap=0 + #Iterating over haplotypes + for (a in ALTs) { + hap+=1 + if (ALTs[a] != "." && ALTLEN[ALTs[a]] != 0) { + printf("%s\t%s\t%s|%s\t%s#%s\t%s\n", $1, $2, $3, ALTs[a], HAP[i], hap, ALTLEN[ALTs[a]]) + } + } + } +} \ No newline at end of file diff --git a/src/README.md b/src/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0164645b236dc4645a97143a6130d1f721b31ab0 --- /dev/null +++ b/src/README.md @@ -0,0 +1,2 @@ +# Github flavoured markdown CSS +From https://gist.github.com/fergiemcdowall/9ecbea41f67465b5cfcd3560508eb100#file-github-markdown-css \ No newline at end of file diff --git a/src/github-markdown.css b/src/github-markdown.css new file mode 100644 index 0000000000000000000000000000000000000000..1c5d981b96b86ea02232f01c65c993699f604484 --- /dev/null +++ b/src/github-markdown.css @@ -0,0 +1,938 @@ +@font-face { + font-family: octicons-link; + src: url(data:font/woff;charset=utf-8;base64,d09GRgABAAAAAAZwABAAAAAACFQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEU0lHAAAGaAAAAAgAAAAIAAAAAUdTVUIAAAZcAAAACgAAAAoAAQAAT1MvMgAAAyQAAABJAAAAYFYEU3RjbWFwAAADcAAAAEUAAACAAJThvmN2dCAAAATkAAAABAAAAAQAAAAAZnBnbQAAA7gAAACyAAABCUM+8IhnYXNwAAAGTAAAABAAAAAQABoAI2dseWYAAAFsAAABPAAAAZwcEq9taGVhZAAAAsgAAAA0AAAANgh4a91oaGVhAAADCAAAABoAAAAkCA8DRGhtdHgAAAL8AAAADAAAAAwGAACfbG9jYQAAAsAAAAAIAAAACABiATBtYXhwAAACqAAAABgAAAAgAA8ASm5hbWUAAAToAAABQgAAAlXu73sOcG9zdAAABiwAAAAeAAAAME3QpOBwcmVwAAAEbAAAAHYAAAB/aFGpk3jaTY6xa8JAGMW/O62BDi0tJLYQincXEypYIiGJjSgHniQ6umTsUEyLm5BV6NDBP8Tpts6F0v+k/0an2i+itHDw3v2+9+DBKTzsJNnWJNTgHEy4BgG3EMI9DCEDOGEXzDADU5hBKMIgNPZqoD3SilVaXZCER3/I7AtxEJLtzzuZfI+VVkprxTlXShWKb3TBecG11rwoNlmmn1P2WYcJczl32etSpKnziC7lQyWe1smVPy/Lt7Kc+0vWY/gAgIIEqAN9we0pwKXreiMasxvabDQMM4riO+qxM2ogwDGOZTXxwxDiycQIcoYFBLj5K3EIaSctAq2kTYiw+ymhce7vwM9jSqO8JyVd5RH9gyTt2+J/yUmYlIR0s04n6+7Vm1ozezUeLEaUjhaDSuXHwVRgvLJn1tQ7xiuVv/ocTRF42mNgZGBgYGbwZOBiAAFGJBIMAAizAFoAAABiAGIAznjaY2BkYGAA4in8zwXi+W2+MjCzMIDApSwvXzC97Z4Ig8N/BxYGZgcgl52BCSQKAA3jCV8CAABfAAAAAAQAAEB42mNgZGBg4f3vACQZQABIMjKgAmYAKEgBXgAAeNpjYGY6wTiBgZWBg2kmUxoDA4MPhGZMYzBi1AHygVLYQUCaawqDA4PChxhmh/8ODDEsvAwHgMKMIDnGL0x7gJQCAwMAJd4MFwAAAHjaY2BgYGaA4DAGRgYQkAHyGMF8NgYrIM3JIAGVYYDT+AEjAwuDFpBmA9KMDEwMCh9i/v8H8sH0/4dQc1iAmAkALaUKLgAAAHjaTY9LDsIgEIbtgqHUPpDi3gPoBVyRTmTddOmqTXThEXqrob2gQ1FjwpDvfwCBdmdXC5AVKFu3e5MfNFJ29KTQT48Ob9/lqYwOGZxeUelN2U2R6+cArgtCJpauW7UQBqnFkUsjAY/kOU1cP+DAgvxwn1chZDwUbd6CFimGXwzwF6tPbFIcjEl+vvmM/byA48e6tWrKArm4ZJlCbdsrxksL1AwWn/yBSJKpYbq8AXaaTb8AAHja28jAwOC00ZrBeQNDQOWO//sdBBgYGRiYWYAEELEwMTE4uzo5Zzo5b2BxdnFOcALxNjA6b2ByTswC8jYwg0VlNuoCTWAMqNzMzsoK1rEhNqByEyerg5PMJlYuVueETKcd/89uBpnpvIEVomeHLoMsAAe1Id4AAAAAAAB42oWQT07CQBTGv0JBhagk7HQzKxca2sJCE1hDt4QF+9JOS0nbaaYDCQfwCJ7Au3AHj+LO13FMmm6cl7785vven0kBjHCBhfpYuNa5Ph1c0e2Xu3jEvWG7UdPDLZ4N92nOm+EBXuAbHmIMSRMs+4aUEd4Nd3CHD8NdvOLTsA2GL8M9PODbcL+hD7C1xoaHeLJSEao0FEW14ckxC+TU8TxvsY6X0eLPmRhry2WVioLpkrbp84LLQPGI7c6sOiUzpWIWS5GzlSgUzzLBSikOPFTOXqly7rqx0Z1Q5BAIoZBSFihQYQOOBEdkCOgXTOHA07HAGjGWiIjaPZNW13/+lm6S9FT7rLHFJ6fQbkATOG1j2OFMucKJJsxIVfQORl+9Jyda6Sl1dUYhSCm1dyClfoeDve4qMYdLEbfqHf3O/AdDumsjAAB42mNgYoAAZQYjBmyAGYQZmdhL8zLdDEydARfoAqIAAAABAAMABwAKABMAB///AA8AAQAAAAAAAAAAAAAAAAABAAAAAA==) format('woff'); +} + +.octicon { + display: inline-block; + fill: currentColor; + vertical-align: text-bottom; +} + +.anchor { + float: left; + line-height: 1; + margin-left: -20px; + padding-right: 4px; +} + +.anchor:focus { + outline: none; +} + +h1 .octicon-link, +h2 .octicon-link, +h3 .octicon-link, +h4 .octicon-link, +h5 .octicon-link, +h6 .octicon-link { + color: #1b1f23; + vertical-align: middle; + visibility: hidden; +} + +h1:hover .anchor, +h2:hover .anchor, +h3:hover .anchor, +h4:hover .anchor, +h5:hover .anchor, +h6:hover .anchor { + text-decoration: none; +} + +h1:hover .anchor .octicon-link, +h2:hover .anchor .octicon-link, +h3:hover .anchor .octicon-link, +h4:hover .anchor .octicon-link, +h5:hover .anchor .octicon-link, +h6:hover .anchor .octicon-link { + visibility: visible; +} + +body { + -ms-text-size-adjust: 100%; + -webkit-text-size-adjust: 100%; + color: #24292e; + line-height: 1.5; + font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol; + font-size: 16px; + line-height: 1.5; + word-wrap: break-word; +} + +.pl-c { + color: #6a737d; +} + +.pl-c1, +.pl-s .pl-v { + color: #005cc5; +} + +.pl-e, +.pl-en { + color: #6f42c1; +} + +.pl-s .pl-s1, +.pl-smi { + color: #24292e; +} + +.pl-ent { + color: #22863a; +} + +.pl-k { + color: #d73a49; +} + +.pl-pds, +.pl-s, +.pl-s .pl-pse .pl-s1, +.pl-sr, +.pl-sr .pl-cce, +.pl-sr .pl-sra, +.pl-sr .pl-sre { + color: #032f62; +} + +.pl-smw, +.pl-v { + color: #e36209; +} + +.pl-bu { + color: #b31d28; +} + +.pl-ii { + background-color: #b31d28; + color: #fafbfc; +} + +.pl-c2 { + background-color: #d73a49; + color: #fafbfc; +} + +.pl-c2:before { + content: "^M"; +} + +.pl-sr .pl-cce { + color: #22863a; + font-weight: 700; +} + +.pl-ml { + color: #735c0f; +} + +.pl-mh, +.pl-mh .pl-en, +.pl-ms { + color: #005cc5; + font-weight: 700; +} + +.pl-mi { + color: #24292e; + font-style: italic; +} + +.pl-mb { + color: #24292e; + font-weight: 700; +} + +.pl-md { + background-color: #ffeef0; + color: #b31d28; +} + +.pl-mi1 { + background-color: #f0fff4; + color: #22863a; +} + +.pl-mc { + background-color: #ffebda; + color: #e36209; +} + +.pl-mi2 { + background-color: #005cc5; + color: #f6f8fa; +} + +.pl-mdr { + color: #6f42c1; + font-weight: 700; +} + +.pl-ba { + color: #586069; +} + +.pl-sg { + color: #959da5; +} + +.pl-corl { + color: #032f62; + text-decoration: underline; +} + +details { + display: block; +} + +summary { + display: list-item; +} + +a { + background-color: transparent; +} + +a:active, +a:hover { + outline-width: 0; +} + +strong { + font-weight: inherit; + font-weight: bolder; +} + +h1 { + font-size: 2em; + margin: .67em 0; +} + +img { + border-style: none; +} + +code, +kbd, +pre { + font-family: monospace,monospace; + font-size: 1em; +} + +hr { + box-sizing: content-box; + height: 0; + overflow: visible; +} + +input { + font: inherit; + margin: 0; +} + +input { + overflow: visible; +} + +[type=checkbox] { + box-sizing: border-box; + padding: 0; +} + +* { + box-sizing: border-box; +} + +input { + font-family: inherit; + font-size: inherit; + line-height: inherit; +} + +a { + color: #0366d6; + text-decoration: none; +} + +a:hover { + text-decoration: underline; +} + +strong { + font-weight: 600; +} + +hr { + background: transparent; + border: 0; + border-bottom: 1px solid #dfe2e5; + height: 0; + margin: 15px 0; + overflow: hidden; +} + +hr:before { + content: ""; + display: table; +} + +hr:after { + clear: both; + content: ""; + display: table; +} + +table { + border-collapse: collapse; + border-spacing: 0; +} + +td, +th { + padding: 0; +} + +details summary { + cursor: pointer; +} + +h1, +h2, +h3, +h4, +h5, +h6 { + margin-bottom: 0; + margin-top: 0; +} + +h1 { + font-size: 32px; +} + +h1, +h2 { + font-weight: 600; +} + +h2 { + font-size: 24px; +} + +h3 { + font-size: 20px; +} + +h3, +h4 { + font-weight: 600; +} + +h4 { + font-size: 16px; +} + +h5 { + font-size: 14px; +} + +h5, +h6 { + font-weight: 600; +} + +h6 { + font-size: 12px; +} + +p { + margin-bottom: 10px; + margin-top: 0; +} + +blockquote { + margin: 0; +} + +ol, +ul { + margin-bottom: 0; + margin-top: 0; + padding-left: 0; +} + +ol ol, +ul ol { + list-style-type: lower-roman; +} + +ol ol ol, +ol ul ol, +ul ol ol, +ul ul ol { + list-style-type: lower-alpha; +} + +dd { + margin-left: 0; +} + +code, +pre { + font-family: SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace; + font-size: 12px; +} + +pre { + margin-bottom: 0; + margin-top: 0; +} + +input::-webkit-inner-spin-button, +input::-webkit-outer-spin-button { + -webkit-appearance: none; + appearance: none; + margin: 0; +} + +.border { + border: 1px solid #e1e4e8!important; +} + +.border-0 { + border: 0!important; +} + +.border-bottom { + border-bottom: 1px solid #e1e4e8!important; +} + +.rounded-1 { + border-radius: 3px!important; +} + +.bg-white { + background-color: #fff!important; +} + +.bg-gray-light { + background-color: #fafbfc!important; +} + +.text-gray-light { + color: #6a737d!important; +} + +.mb-0 { + margin-bottom: 0!important; +} + +.my-2 { + margin-bottom: 8px!important; + margin-top: 8px!important; +} + +.pl-0 { + padding-left: 0!important; +} + +.py-0 { + padding-bottom: 0!important; + padding-top: 0!important; +} + +.pl-1 { + padding-left: 4px!important; +} + +.pl-2 { + padding-left: 8px!important; +} + +.py-2 { + padding-bottom: 8px!important; + padding-top: 8px!important; +} + +.pl-3, +.px-3 { + padding-left: 16px!important; +} + +.px-3 { + padding-right: 16px!important; +} + +.pl-4 { + padding-left: 24px!important; +} + +.pl-5 { + padding-left: 32px!important; +} + +.pl-6 { + padding-left: 40px!important; +} + +.f6 { + font-size: 12px!important; +} + +.lh-condensed { + line-height: 1.25!important; +} + +.text-bold { + font-weight: 600!important; +} + +a:not([href]) { + color: inherit; + text-decoration: none; +} + +blockquote, +dl, +ol, +p, +pre, +table, +ul { + margin-bottom: 16px; + margin-top: 0; +} + +hr { + background-color: #e1e4e8; + border: 0; + height: .25em; + margin: 24px 0; + padding: 0; +} + +blockquote { + border-left: .25em solid #dfe2e5; + color: #6a737d; + padding: 0 1em; +} + +blockquote>:first-child { + margin-top: 0; +} + +blockquote>:last-child { + margin-bottom: 0; +} + +kbd { + background-color: #fafbfc; + border: 1px solid #c6cbd1; + border-bottom-color: #959da5; + border-radius: 3px; + box-shadow: inset 0 -1px 0 #959da5; + color: #444d56; + display: inline-block; + font-size: 11px; + line-height: 10px; + padding: 3px 5px; + vertical-align: middle; +} + +h1, +h2, +h3, +h4, +h5, +h6 { + font-weight: 600; + line-height: 1.25; + margin-bottom: 16px; + margin-top: 24px; +} + +h1 { + font-size: 2em; +} + +h1, +h2 { + border-bottom: 1px solid #eaecef; + padding-bottom: .3em; +} + +h2 { + font-size: 1.5em; +} + +h3 { + font-size: 1.25em; +} + +h4 { + font-size: 1em; +} + +h5 { + font-size: .875em; +} + +h6 { + color: #6a737d; + font-size: .85em; +} + +ol, +ul { + padding-left: 2em; +} + +ol ol, +ol ul, +ul ol, +ul ul { + margin-bottom: 0; + margin-top: 0; +} + +li { + word-wrap: break-all; +} + +li>p { + margin-top: 16px; +} + +li+li { + margin-top: .25em; +} + +dl { + padding: 0; +} + +dl dt { + font-size: 1em; + font-style: italic; + font-weight: 600; + margin-top: 16px; + padding: 0; +} + +dl dd { + margin-bottom: 16px; + padding: 0 16px; +} + +table { + display: block; + overflow: auto; + width: 100%; +} + +table th { + font-weight: 600; +} + +table td, +table th { + border: 1px solid #dfe2e5; + padding: 6px 13px; +} + +table tr { + background-color: #fff; + border-top: 1px solid #c6cbd1; +} + +table tr:nth-child(2n) { + background-color: #f6f8fa; +} + +img { + background-color: #fff; + box-sizing: content-box; + max-width: 100%; +} + +img[align=right] { + padding-left: 20px; +} + +img[align=left] { + padding-right: 20px; +} + +code { + background-color: rgba(27,31,35,.05); + border-radius: 3px; + font-size: 85%; + margin: 0; + padding: .2em .4em; +} + +pre { + word-wrap: normal; +} + +pre>code { + background: transparent; + border: 0; + font-size: 100%; + margin: 0; + padding: 0; + white-space: pre; + word-break: normal; +} + +.highlight { + margin-bottom: 16px; +} + +.highlight pre { + margin-bottom: 0; + word-break: normal; +} + +.highlight pre, +pre { + background-color: #f6f8fa; + border-radius: 3px; + font-size: 85%; + line-height: 1.45; + overflow: auto; + padding: 16px; +} + +pre code { + background-color: transparent; + border: 0; + display: inline; + line-height: inherit; + margin: 0; + max-width: auto; + overflow: visible; + padding: 0; + word-wrap: normal; +} + +.commit-tease-sha { + color: #444d56; + display: inline-block; + font-family: SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace; + font-size: 90%; +} + +.blob-wrapper { + border-bottom-left-radius: 3px; + border-bottom-right-radius: 3px; + overflow-x: auto; + overflow-y: hidden; +} + +.blob-wrapper-embedded { + max-height: 240px; + overflow-y: auto; +} + +.blob-num { + -moz-user-select: none; + -ms-user-select: none; + -webkit-user-select: none; + color: rgba(27,31,35,.3); + cursor: pointer; + font-family: SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace; + font-size: 12px; + line-height: 20px; + min-width: 50px; + padding-left: 10px; + padding-right: 10px; + text-align: right; + user-select: none; + vertical-align: top; + white-space: nowrap; + width: 1%; +} + +.blob-num:hover { + color: rgba(27,31,35,.6); +} + +.blob-num:before { + content: attr(data-line-number); +} + +.blob-code { + line-height: 20px; + padding-left: 10px; + padding-right: 10px; + position: relative; + vertical-align: top; +} + +.blob-code-inner { + color: #24292e; + font-family: SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace; + font-size: 12px; + overflow: visible; + white-space: pre; + word-wrap: normal; +} + +.pl-token.active, +.pl-token:hover { + background: #ffea7f; + cursor: pointer; +} + +kbd { + background-color: #fafbfc; + border: 1px solid #d1d5da; + border-bottom-color: #c6cbd1; + border-radius: 3px; + box-shadow: inset 0 -1px 0 #c6cbd1; + color: #444d56; + display: inline-block; + font: 11px SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace; + line-height: 10px; + padding: 3px 5px; + vertical-align: middle; +} + +:checked+.radio-label { + border-color: #0366d6; + position: relative; + z-index: 1; +} + +.tab-size[data-tab-size="1"] { + -moz-tab-size: 1; + tab-size: 1; +} + +.tab-size[data-tab-size="2"] { + -moz-tab-size: 2; + tab-size: 2; +} + +.tab-size[data-tab-size="3"] { + -moz-tab-size: 3; + tab-size: 3; +} + +.tab-size[data-tab-size="4"] { + -moz-tab-size: 4; + tab-size: 4; +} + +.tab-size[data-tab-size="5"] { + -moz-tab-size: 5; + tab-size: 5; +} + +.tab-size[data-tab-size="6"] { + -moz-tab-size: 6; + tab-size: 6; +} + +.tab-size[data-tab-size="7"] { + -moz-tab-size: 7; + tab-size: 7; +} + +.tab-size[data-tab-size="8"] { + -moz-tab-size: 8; + tab-size: 8; +} + +.tab-size[data-tab-size="9"] { + -moz-tab-size: 9; + tab-size: 9; +} + +.tab-size[data-tab-size="10"] { + -moz-tab-size: 10; + tab-size: 10; +} + +.tab-size[data-tab-size="11"] { + -moz-tab-size: 11; + tab-size: 11; +} + +.tab-size[data-tab-size="12"] { + -moz-tab-size: 12; + tab-size: 12; +} + +.task-list-item { + list-style-type: none; +} + +.task-list-item+.task-list-item { + margin-top: 3px; +} + +.task-list-item input { + margin: 0 .2em .25em -1.6em; + vertical-align: middle; +} + +hr { + border-bottom-color: #eee; +} + +.pl-0 { + padding-left: 0!important; +} + +.pl-1 { + padding-left: 4px!important; +} + +.pl-2 { + padding-left: 8px!important; +} + +.pl-3 { + padding-left: 16px!important; +} + +.pl-4 { + padding-left: 24px!important; +} + +.pl-5 { + padding-left: 32px!important; +} + +.pl-6 { + padding-left: 40px!important; +} + +.pl-7 { + padding-left: 48px!important; +} + +.pl-8 { + padding-left: 64px!important; +} + +.pl-9 { + padding-left: 80px!important; +} + +.pl-10 { + padding-left: 96px!important; +} + +.pl-11 { + padding-left: 112px!important; +} + +.pl-12 { + padding-left: 128px!important; +} diff --git a/src/plotsr-base.cfg b/src/plotsr-base.cfg new file mode 100644 index 0000000000000000000000000000000000000000..394e11c640f41addccda2aadc44da1154aefc133 --- /dev/null +++ b/src/plotsr-base.cfg @@ -0,0 +1,18 @@ +## COLOURS and transparency for alignments (syntenic, inverted, translocated, and duplicated) +syncol:#CCCCCC +invcol:#FFA500 +tracol:#9ACD32 +dupcol:#00BBFF +alpha:0.8 + +## Margins and dimensions: +chrmar:0.1 ## Adjusts the gap between chromosomes and tracks. Higher values leads to more gap +exmar:0.1 ## Extra margin at the top and bottom of plot area +marginchr:0.1 ## Margin between adjacent chromosomes when using --itx + +## Legend +legend:T ## To plot legend use T, use F to not plot legend +genlegcol:5 ## Number of columns for genome legend, set -1 for automatic setup +bbox:0,1.01,0.5,0.3 ## [Left edge, bottom edge, width, height] +bbox_v:0,1.1,0.5,0.3 ## For vertical chromosomes (using -v option) +bboxmar:0.8 ## Margin between genome and annotation legends \ No newline at end of file