2 New File Format

To facilitate high throughput data manipulation and reduce storage usage, several file format have been proposed and generaly accepted as the standard. Due to these great efforts (e.g. SAM/BAM and VCF), data analysis and tool development become more easier and highly efficient. However, when it comes to bisulfite sequencing data, currently, available tools possess their own tool specific data format. In consequence, integrating results from several tools leads to extra efforts in unifying data format and developing custermized tools, which is time comsuming and error prone.

As one of the features of CGmapTools, we defined ATCGmap and CGmap file format to simplify downstream DNA methylation analysis and in hope to standardize the storage format of bisulfite sequencing data.

2.1 ATCGmap Format

After alignment of sequencing reads to the reference genome, all the detail information about read coverage and methylation level of a cytosine site are stored in BAM/SAM format files though requiring further interpretation. A well defined file format called pileup summarized the information of mapped reads covered on each nucleotide along the reference genome. But the pileup file does not designed for bisulfte sequencing data, which lacks DNA methylation estimation of cytosines.

Here, we defined ATCGmap file format to integrate both mapping and coverage of non-cytosine and cytosine sites with estimated DNA methylation in a single file.

Col Field Type Regexp/Range Brief description
1 CHR String [!-?A-~]{1,118} Query template NAME
2 NUC Char [ATCGN-] The nucleotide on reference genome
3 POS Int [0,232-1] 1-based leftmost mapping position
4 CONT String {“–”, "CG“,”CHG“,”CHH“} Context
5 DINUC String {“–”, “CA”, “CT”, “CC”, “CG”} Dinucleotide context
6 WA Int [0,214-1] Counts of reads on Watson strand support Adenine
7 WT Int [0,214-1] Counts of reads on Watson strand support Thymine
8 WC Int [0,214-1] Counts of reads on Watson strand support Cytosine
9 WG Int [0,214-1] Counts of reads on Watson strand support Guanine
10 WN Int [0,26-1] Counts of reads on Watson strand support None
11 CA Int [0,214-1] Counts of reads on Crick strand support Adenine
12 CT Int [0,214-1] Counts of reads on Crick strand support Thymine
13 CC Int [0,214-1] Counts of reads on Crick strand support Cytosine
14 CG Int [0,214-1] Counts of reads on Crick strand support Guanine
15 CN Int [0,26-1] Counts of reads on Crick strand support None
16 METH Float [0,1] or “na” Methylation level or “Not Available”

2.2 CGmap Format

In cases we only want to retain DNA methylation on cytonsines to save storage usage, we defined another file format called CGmap which provides sequence context and estimated DNA methylation level of any covered cytosines on the reference genome.

Col Field Type Regexp/Range Brief description
1 CHR String [!-?A-~]{1,118} Query template NAME
2 NUC Char [ATCGN-] The nucleotide on reference genome
3 POS Int [0,232-1] 1-based leftmost mapping position
4 CONT String {“–”, “CG”, “CHG”, “CHH”} Context
5 DINUC String {“–”, “CA”, “CT”, “CC”, “CG”} Dinucleotide context
6 METH Float [0,1] or “na” Methylation level or “Not Available”
7 MC Int [0,212-1] Counts of reads support methylated Cytosine
8 NC Int [0,212-1] Counts of reads support all Cytosine