2 New File Format
To facilitate high throughput data manipulation and reduce storage usage, several file format have been proposed and generaly accepted as the standard. Due to these great efforts (e.g. SAM/BAM and VCF), data analysis and tool development become more easier and highly efficient. However, when it comes to bisulfite sequencing data, currently, available tools possess their own tool specific data format. In consequence, integrating results from several tools leads to extra efforts in unifying data format and developing custermized tools, which is time comsuming and error prone.
As one of the features of CGmapTools, we defined ATCGmap and CGmap file format to simplify downstream DNA methylation analysis and in hope to standardize the storage format of bisulfite sequencing data.
2.1 ATCGmap Format
After alignment of sequencing reads to the reference genome, all the detail information about read coverage and methylation level of a cytosine site are stored in BAM/SAM format files though requiring further interpretation. A well defined file format called pileup summarized the information of mapped reads covered on each nucleotide along the reference genome. But the pileup file does not designed for bisulfte sequencing data, which lacks DNA methylation estimation of cytosines.
Here, we defined ATCGmap file format to integrate both mapping and coverage of non-cytosine and cytosine sites with estimated DNA methylation in a single file.
| Col | Field | Type | Regexp/Range | Brief description |
|---|---|---|---|---|
| 1 | CHR | String | [!-?A-~]{1,118} | Query template NAME |
| 2 | NUC | Char | [ATCGN-] | The nucleotide on reference genome |
| 3 | POS | Int | [0,232-1] | 1-based leftmost mapping position |
| 4 | CONT | String | {“–”, "CG“,”CHG“,”CHH“} | Context |
| 5 | DINUC | String | {“–”, “CA”, “CT”, “CC”, “CG”} | Dinucleotide context |
| 6 | WA | Int | [0,214-1] | Counts of reads on Watson strand support Adenine |
| 7 | WT | Int | [0,214-1] | Counts of reads on Watson strand support Thymine |
| 8 | WC | Int | [0,214-1] | Counts of reads on Watson strand support Cytosine |
| 9 | WG | Int | [0,214-1] | Counts of reads on Watson strand support Guanine |
| 10 | WN | Int | [0,26-1] | Counts of reads on Watson strand support None |
| 11 | CA | Int | [0,214-1] | Counts of reads on Crick strand support Adenine |
| 12 | CT | Int | [0,214-1] | Counts of reads on Crick strand support Thymine |
| 13 | CC | Int | [0,214-1] | Counts of reads on Crick strand support Cytosine |
| 14 | CG | Int | [0,214-1] | Counts of reads on Crick strand support Guanine |
| 15 | CN | Int | [0,26-1] | Counts of reads on Crick strand support None |
| 16 | METH | Float | [0,1] or “na” | Methylation level or “Not Available” |
2.2 CGmap Format
In cases we only want to retain DNA methylation on cytonsines to save storage usage, we defined another file format called CGmap which provides sequence context and estimated DNA methylation level of any covered cytosines on the reference genome.
| Col | Field | Type | Regexp/Range | Brief description |
|---|---|---|---|---|
| 1 | CHR | String | [!-?A-~]{1,118} | Query template NAME |
| 2 | NUC | Char | [ATCGN-] | The nucleotide on reference genome |
| 3 | POS | Int | [0,232-1] | 1-based leftmost mapping position |
| 4 | CONT | String | {“–”, “CG”, “CHG”, “CHH”} | Context |
| 5 | DINUC | String | {“–”, “CA”, “CT”, “CC”, “CG”} | Dinucleotide context |
| 6 | METH | Float | [0,1] or “na” | Methylation level or “Not Available” |
| 7 | MC | Int | [0,212-1] | Counts of reads support methylated Cytosine |
| 8 | NC | Int | [0,212-1] | Counts of reads support all Cytosine |