2 New File Format

To facilitate high throughput data manipulation and reduce storage usage, several file format have been proposed and generaly accepted as the standard. Due to these great efforts (e.g. SAM/BAM and VCF), data analysis and tool development become more easier and highly efficient. However, when it comes to bisulfite sequencing data, currently, available tools possess their own tool specific data format. In consequence, integrating results from several tools leads to extra efforts in unifying data format and developing custermized tools, which is time comsuming and error prone.

As one of the features of CGmapTools, we defined ATCGmap and CGmap file format to simplify downstream DNA methylation analysis and in hope to standardize the storage format of bisulfite sequencing data.

2.1 ATCGmap Format

After alignment of sequencing reads to the reference genome, all the detail information about read coverage and methylation level of a cytosine site are stored in BAM/SAM format files though requiring further interpretation. A well defined file format called pileup summarized the information of mapped reads covered on each nucleotide along the reference genome. But the pileup file does not designed for bisulfte sequencing data, which lacks DNA methylation estimation of cytosines.

Here, we defined ATCGmap file format to integrate both mapping and coverage of non-cytosine and cytosine sites with estimated DNA methylation in a single file.

Col	Field	Type	Regexp/Range	Brief description
1	CHR	String	[!-?A-~]{1,118}	Query template NAME
2	NUC	Char	[ATCGN-]	The nucleotide on reference genome
3	POS	Int	[0,232-1]	1-based leftmost mapping position
4	CONT	String	{“–”, "CG“,”CHG“,”CHH“}	Context
5	DINUC	String	{“–”, “CA”, “CT”, “CC”, “CG”}	Dinucleotide context
6	WA	Int	[0,214-1]	Counts of reads on Watson strand support Adenine
7	WT	Int	[0,214-1]	Counts of reads on Watson strand support Thymine
8	WC	Int	[0,214-1]	Counts of reads on Watson strand support Cytosine
9	WG	Int	[0,214-1]	Counts of reads on Watson strand support Guanine
10	WN	Int	[0,26-1]	Counts of reads on Watson strand support None
11	CA	Int	[0,214-1]	Counts of reads on Crick strand support Adenine
12	CT	Int	[0,214-1]	Counts of reads on Crick strand support Thymine
13	CC	Int	[0,214-1]	Counts of reads on Crick strand support Cytosine
14	CG	Int	[0,214-1]	Counts of reads on Crick strand support Guanine
15	CN	Int	[0,26-1]	Counts of reads on Crick strand support None
16	METH	Float	[0,1] or “na”	Methylation level or “Not Available”

2.2 CGmap Format

In cases we only want to retain DNA methylation on cytonsines to save storage usage, we defined another file format called CGmap which provides sequence context and estimated DNA methylation level of any covered cytosines on the reference genome.

Col	Field	Type	Regexp/Range	Brief description
1	CHR	String	[!-?A-~]{1,118}	Query template NAME
2	NUC	Char	[ATCGN-]	The nucleotide on reference genome
3	POS	Int	[0,2³²-1]	1-based leftmost mapping position
4	CONT	String	{“–”, “CG”, “CHG”, “CHH”}	Context
5	DINUC	String	{“–”, “CA”, “CT”, “CC”, “CG”}	Dinucleotide context
6	METH	Float	[0,1] or “na”	Methylation level or “Not Available”
7	MC	Int	[0,2¹²-1]	Counts of reads support methylated Cytosine
8	NC	Int	[0,2¹²-1]	Counts of reads support all Cytosine