Skip to content
Snippets Groups Projects
README.md 9.35 KiB
Newer Older
Nicolas Lapalu's avatar
Nicolas Lapalu committed
 [![coverage report](https://forgemia.inra.fr/nicolas.lapalu/ingenannot/badges/refactoring/coverage.svg)](https://forgemia.inra.fr/nicolas.lapalu/ingenannot/-/commits/refactoring)


Nicolas Lapalu's avatar
Nicolas Lapalu committed
# INGENANNOT: INspection of GENe ANNOTation
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
INGENANNOT is a set of utilities to inspect and generate 
statistics for one or several sets of gene annotations. It allows
structure comparison and can help you to prioritize your 
efforts in manual curation. INGENNANNOT uses among other
things, the Sequence Ontology gene-splicing classification 
Nicolas Lapalu's avatar
Nicolas Lapalu committed
([1](),[2]()) that aims to classify alternative transcripts in seven 
Nicolas Lapalu's avatar
Nicolas Lapalu committed
categories or the Annotation Edit Distance (AED) proposed as a metric for
evidence support. 
Nicolas Lapalu's avatar
Nicolas Lapalu committed

As several approaches and tools exist to annotate genes in newly assembled genomes, it could be usefull to compare predictions and extract best evidence supported.

Nicolas Lapalu's avatar
Nicolas Lapalu committed
**Table of Contents:**
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
[[_TOC_]]

The documentation below describes each tool separetely. A complete usecase of annotation update is available [here](doc/analysis.md)     
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Common options

INGENANNOT can handle multiple gffs from different sources. In case of several annotations, gene boundaries are often divergent
(especially if you tried to predict UTR regions), that implies to
clusterize genes, to propose new loci sharing a list of transcripts.
We define these new loci as 'meta-gene' and propose several options
to clusterize them. We tried to summarize the pro and cons of classification feature type in the following table.

||pros|cons|
|:--:|--|--|
|`--clu-type gene`|detect problem of missens predictions|overlaps of UTR merge different genes|
|`--clu-type cds`|detect problem of missens predictions||
|`--clu-type gene` `--clu-stranded`|resolve conflict between genes and possible non-coding RNA on the opposite strand|will not detect severe problem due to divergent prediction on opposite strand, overlaps of UTR merge different genes|
|`--clu-type cds` `--clu-stranded`|||




Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Validate: check your GFF/GTF file format
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Despite the efforts made to propose common format with rules/constraints,
GFF/GTF format are permissive and contents could be misunderstood 
during file parsing.

Nicolas Lapalu's avatar
Nicolas Lapalu committed
We propose a simple validator to ensure that INGENANNOT supports your annotation file format. Moreover, the validate command propose a `statistics` option to display several metrics on your gene sets.
Nicolas Lapalu's avatar
Nicolas Lapalu committed

```
Nicolas Lapalu's avatar
Nicolas Lapalu committed
ingenannot -v 2 validate myfile.gff -s
Nicolas Lapalu's avatar
Nicolas Lapalu committed
```

Nicolas Lapalu's avatar
Nicolas Lapalu committed
In case of failure, check your file to fix the missing or required data. Find [here](#annotation-files-supported-formats) outputs of several tools supported by ingenannot and their specificities versus standard formats.
Nicolas Lapalu's avatar
Nicolas Lapalu committed
If you need help, or required a parser for your specific format, 
Nicolas Lapalu's avatar
Nicolas Lapalu committed
feel free to contact us or open an issue.     
Nicolas Lapalu's avatar
Nicolas Lapalu committed


## Filter: remove undesired annotations

```
ingenannot filter
```

## Classify: classification based on Sequence Ontology (SO)

Nicolas Lapalu's avatar
Nicolas Lapalu committed
|Class|definition|example|
Nicolas Lapalu's avatar
Nicolas Lapalu committed
|--|--|--|
Nicolas Lapalu's avatar
Nicolas Lapalu committed
|N:0:0|No transcript-pairs share any exon sequence|![N:0:0](doc/images/N_0_0.png)|
Nicolas Lapalu's avatar
Nicolas Lapalu committed
|N:N:0|Some transcript-pairs share sequence, but none have common exon boundaries|![N:N:0](doc/images/N_N_0.png)|
|N:0:N|Some transcript-pairs share no sequence, others have common exon boundaries|![N:0:N](doc/images/N_0_N.png)|
|N:N:N|Some transcript-pairs share no sequence, others have common sequence and exon boundaries|![N:N:N](doc/images/N_N_N.png)|
|0:N:0|All transcript-pairs share sequence in common, but none share exon boundaries|![0:N:0](doc/images/0_N_0.png)|
|0:N:N|All transcript-pairs share sequence in common and some share exon boundaries|![0:N:N](doc/images/0_N_N.png)|
|0:0:N|All transcript-pairs share some exons in common|![0:0:N](doc/images/0_0_N.png)|
Nicolas Lapalu's avatar
Nicolas Lapalu committed



As described above, the SO classification was originally based on exon boundaries,
that could be highly problematic for de-novo annotations with poorly
defined UTR parts. To avoid such problem, you can choose to perform
the same classification based on CDS coordinates. In this case you 
will obtained less biased results.  We tried
to summarize the pro and cons of classification feature type in
the following table.

||pros|cons|
|:--:|--|--|
|`--clatype gene`|complete gene structure analysis|too sensitive in case of divergent set of annotations (ex UTR, vs no-UTR)|
|`--clatype cds`|limited to coding sequence, avoid background noise due to UTRs. Usefull in case of poorly predicted UTRs.|structure inspection limited to cds|




Nicolas Lapalu's avatar
Nicolas Lapalu committed
```
Nicolas Lapalu's avatar
Nicolas Lapalu committed
ingenannot soclassify
Nicolas Lapalu's avatar
Nicolas Lapalu committed
```

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Compare: compare predictions, common/different structures

```
ingenannot compare
```

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Select: perform a selection of best structures, evidence-driven
Nicolas Lapalu's avatar
Nicolas Lapalu committed

```
ingenannot select
```

Nicolas Lapalu's avatar
Nicolas Lapalu committed
Annotation Edit Distance (AED) was proposed as metric for gene annotation prediction (ref) and was implemented in Maker to filter out predicted models based on their AED. Here we propose some options which modify the computation of this distance and take into account the different sources of evidences. All gene prediction tools are not still able to predict UTRs, despite the RNA-Seq data and Long-read based transcripts. So to avoid penalizing gene model limited to CDS, we implement an overflow penalty parameter to maximize the score of model fitting best with transcript evidence despite missing UTRs. In addition, we compute separately the AED with transcript and proteomic evidences. Some genes are only supported with a transcript evidence (new/specifcic genes), a protein evidence (gene not expressed in our data), or in both type of evidences. Then to select the best model, we classified genes according to their AED for tr and pr separately. In case where the first gene is the same in the both ranking, we select this last one. If not, we compute the two distances between models according to their ranking, and select the most divergent. 
|case|AED tr|AED pr|rank tr|rank pr|
|--|--|--|--|


blabla:
How to use the different parameters and their impact on the computed AED ratio ? Below we simulated different cases with different parameters and shown the impact on the computed AED:

Nicolas Lapalu's avatar
Nicolas Lapalu committed
***AED with proteins:***
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
Only the CDS part of the gene model is used. So UTRs were discarded.

Nicolas Lapalu's avatar
Nicolas Lapalu committed
![AED with proteins](doc/images/AED_protein.png)
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
***AED with transcripts assembled from RNA-Seq data:***

Nicolas Lapalu's avatar
Nicolas Lapalu committed
Only the CDS part of the gene model is used, to avoid bias in comparison between gene models with or without UTRs depending of the gene predictor. Moreover, UTRs inferred from RNA-Seq transcripts are possibly wrong due to the data (weak/high coverage) and the assembly software. So here the AED is the distance between the CDS of the gene model and the transcript evidence. If you absolutely want to exclude/penalize gene models non-fitting the splicing sites of the transcript on their CDS parts, you can use an option to add a penalty weigth `--penalty_overflow`, set to 0.0 (no penalty) by default.

Nicolas Lapalu's avatar
Nicolas Lapalu committed
![AED with RNA-Seq](doc/images/AED_rnaseq.png)
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
***AED with transcripts recovered with Long-reads (Iso-Seq, Nanopore):***

Nicolas Lapalu's avatar
Nicolas Lapalu committed
All the gene model is used (exons with CDS/UTRs). Long-reads transcript evidences are considered as very reliable evidences on their CDS and UTRs parts. So we expect a very good fit with the gene model and the evidence. For this reason, in case of divergence in splicing sites, a penalty weight parameter is applied, `--longreads_penalty_overflow`, set to 0.25 by default. This penalty is only applied if a difference of splicing sites was observed in the CDS parts of the gene model. We allow divergences in UTRs, that could be corrected later with the `utr_refine` command.
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
![AED with Longreads](doc/images/AED_longreads.png)
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Reduce: reduce the set of genes with AED thresholds 

```
ingenannot reduce
```
Nicolas Lapalu's avatar
Nicolas Lapalu committed

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## UTR_Refine: add/correct utrs of gene annotation with evidence

Annotation of UTR parts is a non-trivial task when using transcript assembled from RNA-seq or small reads. With long read technology, we can now have access to the full isoforms of each transcript UTRs including. `utr_refine` allows to correct/erase the UTR annotation of annotated tarnscript with the most appropiated evidence. No mix of evidence is allowed. In case you want to mix of the 3' and 5' comming from different evidences to maximize the size of the UTRs, you have to preprocess your data and submit the corrected transcript to utr_refine. It is well known that some isoforms contain intron in their UTRs, so mixing different UTR parts from different isoforms will not reflect a true isoform. 
In case of several evidences are usable to redefine UTR coordinates of annotated transcripts, you can choose to select the shortest, longest or all isoforms (see figure below).

```
ingenannot utr_refine
```

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Add_isoforms: add potential isoforms to your gene models

We recommend to use this tool with Full-length coding data (ONT, PacBio transcripts). In case of assembled transcripts, the combination of splicing sites is not always validated. This tool implements a simple algorithm with a strong assumption of the provided gene model, that's imply no possible correction of the original gene model. Moreover, transcript evidence must span only one CDS to avoid possible read-through. The default behaviour retains only isoforms with an impact of one pre-existing CDS.


Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Example: Analysis of divergent annotations of a fungal genome, Zymoseptoria tritici

To illustrate the use of ingenannot, we 

We limit the analysis to the chromosome Y of Z

### Validation and statistics of genes sets

### Filtering of gene sets

### SOClassification

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## Miscellaneous information

### Annotation files supported formats

Nicolas Lapalu's avatar
Nicolas Lapalu committed
## References