Defective folate metabolism causes germline epigenetic instability and distinguishes Hira as a phenotype inheritance biomarker

Georgina E.T. Blake^1,2,†, Xiaohui Zhao^1,2, Hong wa Yung^1,2, Graham J. Burton^1,2, Anne C.Ferguson-Smith^2,3, Russell S. Hamilton^2,3 and Erica D. Watson^1,2*
¹ Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, UK
² Centre for Trophoblast Research, University of Cambridge, Cambridge UK
³ Department of Genetics, University of Cambridge, Cambridge, UK
^† Current address: College of Medicine and Health, University of Exeter Medical School, Exeter, UK
^* corresponding author: [email protected]
Nature Communications volume 12, Article number: 3714 (2021)
https://www.nature.com/articles/s41467-021-24036-5

Publication

Code Release to accompany paper published in Nature Communication:
Github Code DOI:

Whole Genome Sequencing Analysis (WGS)

Samples: C57BI/6J (Control-1, Control-2), Mtrr^{gt/gt} (800-1, 800-2, 800-3, 800-4, 800-5 and 800-6)
SNPs (single nucleotide polymorphisms) and SVs (structure variants) calling for C57BI/6J control embryos and Mtrr^{gt/gt} embryos

Sequencing Data Quality Control (FastQC, v0.11.5), Adapter trimming (Trim_galore, v0.6.4) and report summary (MultiQC, v.1.4);

Sequencing Aligned to the C57Bl/6J reference genome(GRCm38, mm10) using Bowtie2(v2.3.4) with default parameters;

Alignments processed using ClusterFlow [GitHub] [DOI]

Example command line as used in clusterflow:

READ1="160326_X169_FCHMVV5CCXX_L4_8521603000353_1_val_1.fq.gz"
READ2="160326_X169_FCHMVV5CCXX_L4_8521603000353_2_val_2.fq.gz"

bowtie2 -p 4 -t --phred33-quals \
        -x Genomes/Mus_musculus/GRCm38/GRCm38 \
        -1 ${READ1} -2 ${READ2} | \
        samtools view -bS - > ${READ1/.fq.gz/}_bowtie2.bam

Duplicates were marked using Picard tools (v2.24.0);

READ1="160326_X169_FCHMVV5CCXX_L4_8521603000353_1_val_1.fq.gz"
READ2="160326_X169_FCHMVV5CCXX_L4_8521603000353_2_val_2.fq.gz"

java -Xmx2g -jar picard-tools-2.2.4/picard.jar MarkDuplicates \
     INPUT=${READ1/.fq.gz/}_bowtie2.bam \
     OUTPUT=${READ1/.fq.gz/}_bowtie2.bam_MarkDups.bam \
     ASSUME_SORTED=false \
     REMOVE_DUPLICATES=false \
     METRICS_FILE=1${READ1/.fq.gz/}_bowtie2.bam_picardDupMetrics.txt \
    VALIDATION_STRINGENCY=LENIENT

Mtrr region masking;

Make a bed file for the 20Mb masked region

cat "13     58060780        80060780" > chr13-mtrr-mask-region_20Mb.bed

Use samtools to mask region

PROJECT="CTR_edw23_0001"
BEDFILE="chr13-mtrr-mask-region_20Mb.bed"
BAMSTEM="srtd_mrgd_srtd_dedup_chr13-mtrr-20Mb-masked"

declare -a samples=("Control-1" "Control-2" "800-1" "800-2" "800-3" "800-4" "800-5" "800-6")

for i in "${samples[@]}"
    do
      echo "sample = $i  "
      cd ${i}
      samtools view ${PROJECT}_${i}.srtd_mrgd_srtd_dedup.bam \
               -b -h \
               -o ${PROJECT}_${i}.${BAMSTEM}.bam \
               -U ${PROJECT}_${i}.${BAMSTEM}.bam \
               -L ${BEDFILE}
    done

Run Manta Structural Variant Caller

Manta (v0.29.6) [GitHub] [DOI]

Set up some variables for running Manta

PROJECT="CTR_edw23_0001"
GENOME="Genomes/Mus_musculus/GRCm38/Mus_musculus.GRCm38.dna.chromosome.all.fasta"
BAMSTEM="srtd_mrgd_srtd_dedup_chr13-mtrr-20Mb-masked"
MANTAWORKDIR="MantaWorkflow_20Mb_Masked"

declare -a samples=("Control-1" "Control-2" "800-1" "800-2" "800-3" "800-4" "800-5" "800-6")

Run Manta Prepare

for i in "${samples[@]}"
  do
    echo "sample = $i  "
    mkdir -p ${i}/${MANTAWORKDIR}
    configManta.py --bam ${PROJECT}_${SAMPLE}.${BAMSTEM}.bam --referenceFasta ${GENOME} --runDir ${i}/${MANTAWORKDIR}
 done

Run Manta workflow

for i in "${samples[@]}"
  do
    echo "sample = $i  "
    cd ${i}/${MANTAWORKDIR}
    runWorkflow.py --mode=local --jobs=10 --memGb=25
  done

Remapped to the mm10 reference mouse genome using BWA (v480 0.7.15-r1144- dirty);

Alignments processed using ClusterFlow [GitHub] [DOI]

Example command line as used in clusterflow:

READ1="160326_X169_FCHMVV5CCXX_L4_8521603000353_1_val_1.fq.gz"
READ2="160326_X169_FCHMVV5CCXX_L4_8521603000353_2_val_2.fq.gz"

bwa mem -t 12 \
        Genomes/Mus_musculus/GRCm38/Mus_musculus.GRCm38.dna.chromosome.all.fa.gz \
        ${READ1} ${READ2} |\
        samtools view -bS - > ${READ1/.fq.gz/}_GRCm38_bwa.bam

Duplicates were marked using Picard tools (v2.24.0)

java -Xmx2g -jar picard-tools-2.2.4/picard.jar MarkDuplicates \
     INPUT=${READ1/.fq.gz/}_GRCm38_bwa.bam \
     OUTPUT=${READ1/.fq.gz/}_GRCm38_bwa.bam_MarkDups.bam \
     ASSUME_SORTED=false REMOVE_DUPLICATES=false \
     METRICS_FILE=1${READ1/.fq.gz/}_GRCm38_bwa.bam_GRCm38_bwa_srtd.bam_picardDupMetrics.txt \
     VALIDATION_STRINGENCY=LENIENT

Reads were locally realigned and SNPs and short indels identified using GenomeAnalysisTK (GATK, v3.7)

Set up some variables

REFERENCE_FASTA="Genomes/Mus_musculus/GRCm38/Mus_musculus.GRCm38.dna.chromosome.all.fasta"
KNOWNSITES_1="KNOWNSNPs/dbSNP_vcf_chr_ALL.vcf.gz"
KNOWNSITES_2="KNOWNSNPs/129P2_OlaHsd.mgp.v5.snps.dbSNP142.vcf.gz"
GATKLOCN="GenomeAnalysisTK-3.8-0"
PICARDLOCN="picard-tools-2.2.4/"
TMPDIR=$(pwd)

Recalibrate Base Quality Scores

Example BAM file

INPUTBAM="Control-1.GRCm38_bwa_srtd.bam_MarkDups.RG.bam"
INPUTBAMPRERG=${INPUTBAM/MarkDups.RG.bam/MarkDups.bam}

SAMPLE=${INPUTBAM/.GRCm38_bwa_srtd.bam_MarkDups.bam/}
ID="ID:"${SAMPLE}
SM="SM:"${SAMPLE}
LB="LB:"${SAMPLE}

AddOrReplaceReadGroups

java -Djava.io.tmpdir=${TMPDIR} -Xmx5g -jar ${PICARDLOCN}/picard.jar AddOrReplaceReadGroups \
     I=${INPUTBAMPRERG} \
     O=${INPUTBAM} \
     RGID=${SAMPLE} \
     RGLB=${SAMPLE} \
     RGPL=illumina \
     RGPU=${SAMPLE} \
     RGSM=${SAMPLE}

Index BAM file

java -jar ${PICARDLOCN}/picard.jar BuildBamIndex \
     I=${INPUTBAM} \
     TMPDIR=${TMPDIR}

Analyze patterns of covariation in the sequence dataset

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T BaseRecalibrator \
     -nct 8 \
     -R ${REFERENCE_FASTA} \
     -I ${INPUTBAM} \
     -knownSites ${KNOWNSITES_1} \
     -knownSites ${KNOWNSITES_2} \
     -o ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recal_data.table}

Do a second pass to analyze covariation remaining after recalibration

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T BaseRecalibrator \
     -nct 8 \
     -R ${REFERENCE_FASTA} \
     -I ${INPUTBAM} \
     -knownSites ${KNOWNSITES_1} \
     -knownSites ${KNOWNSITES_2} \
     -BQSR ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recal_data.table} \
     -o ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.post_recal_data.table}

Generate before/after plots

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T AnalyzeCovariates \
     -l DEBUG \
     -R ${REFERENCE_FASTA} \
     -before ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recal_data.table} \
     -after ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.post_recal_data.table} \
     -plots ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recalibration_plots.pdf}

Apply the recalibration to your sequence data

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T PrintReads \
     -nct 8 \
     -R ${REFERENCE_FASTA} \
     -I ${INPUTBAM} \
     -BQSR ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recal_data.table} \
     -o ${INPUTBAM/MarkDups.RG.bam/MarkDups.RG.recal.bam}

Call Variants

INPUTBAM="Control-1.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.bam"

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T HaplotypeCaller \
     -R ${REFERENCE_FASTA} \
     -I ${INPUTBAM} \
     --variant_index_type LINEAR \
     --variant_index_parameter 128000 \
     -ERC GVCF \
     -o ${INPUTBAM/recal.bam/recal.raw_variants.vcf}

Joint Genotyping across all samples

java -Djava.io.tmpdir=${TMPDIR} -jar ${GATKLOCN}/GenomeAnalysisTK.jar \
     -T GenotypeGVCFs \
     -R ${REFERENCE_FASTA} \
     --variant 800-1/800-1.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant 800-2/800-2.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant 800-3/800-3.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant 800-4/800-4.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant 800-5/800-5.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant 800-6/800-6.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant Control-1/Control-1.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     --variant Control-2/Control-2.GRCm38_bwa_srtd.bam_MarkDups.RG.recal.raw_variants.vcf \
     -o joint_genotyping_output.vcf

SNPs filter using vcftools (vcf-annotate), Merge_Filter_Snp_Annotation.sh

   - [x] 1-StrandBias [0.0001] ---Min P-value for strand bias
   - [x] 2-BaseQualBias [0.0002] ---Min P-value for BaseQ bias
   - [x] 3-MapQualBias [0.00001] ---Min P-value for mapQ bias
   - [x] 4-EndDistBias [0.0001] ---Min P-value for end distance bias
   - [x] a-MinAB [6] ---Min number of alternate bases
   - [x] d-MinDP [14] ---Min read depth
   - [x] D-MaxDP [100] ---Max read depth
   - [x] q-MinMQ [25] ---Min RMS(Root Mean Square) mapping quality for SNPs
   - [x] Q-Qual [40] ---Min value of the QUAL field
   - [x] W-GapWin [3] ---Window size for filtering adjacent gaps
   - [x] w-SnpGap [5] ---SNP within 5bp around a gap to be filtered
   - [x] H-HWE [0.0001] ---Min P-value for HWE(Hardy–Weinberg equilibrium) and (F\<0)
   - [x] v-VDB [0] ---Min Variant Distance bias

Extra filtering the repeatitive and defind Heterozygous SNPs. R script Exclued_SNPs_Homo_Dinu_SegDup_Het.R.

  Reference: Oey, H., Isbel, L., Hickey, P., Ebaid, B. & Whitelaw, E. Genetic and epigenetic variation among inbred mouse littermates: identification of inter-individual differentially methylated regions. Epigenetics Chromatin 8, 54 (2015). <br>
   - [x] Simple repeats periodicity < 9bp;
   - [x] Homopolymer repeats >8 bp;
   - [x] dinucleotide repeats >14 bp;
   - [x] low mapping quality (<40);
   - [x] Heterozygous variant call counts > 3 within 10kb window, overlap with segment duplication or repeats;

Corresponding Figures (Figure 1(A), Supplementary Figure 3 (C), (D)) generation Code is

R script SV_SNPs_Fig1A_SFig3C_3D.R

MeDIP Data Analysis

Make genome interval files

make_genome_intervals.pl creates intervals of set sizes for a genome of interest. Requires a fai genome index file for the genome created with e.g. samtools faidx genome.fasta. Output is a BED format file

perl make_genome_intervals.pl 5000 \
Mus_musculus.GRCm38.dna.chromosome.all.fa.fai  | \
sort -k 1,1n -k2,2n > output.bed

Make Bedtools Coverage Histogram

makeBedtoolsCoverageHist.sh takes the sample bam (ending in .srtd.bam) files in a directory and runs bedtools coverage for the defined genomic interval files (hard coded, edit file to change)

./makeBedtoolsCoverageHist.sh

The resulting bed file is the output from bedtools coverage with the following columns"

The number of features in the bam file that overlapped (by at least one base pair) the genome intervals.
The number of bases in genome intervals that had non-zero coverage from features in bam file.
The length of the entry in genome intervals.
The fraction of bases in genome intervals that had non-zero coverage from features in the bam file.

Combine Interval Tables

combineTables.pl and combineTables_command.sh take all the individual sample coverage bed files e.g. C57_36.chr.srtd.25Kb.bed in a directory and makes a single combined table. The interval size in Kb is given on the command line.

 perl combineTables.pl 25 > combineTables_command.sh

The above command produces a bash script of the normalised coverage per genomic window doe each of the samples

 ./combineTables_command.sh

Running the shell script produces the final table for analysis in R

Output is a list of commands to be run on the command line. This step is only required if the samples to be analysed changes. i.e. samples removed or added.

A pre-made version of the commands available in combineTables_command.sh

 ./combineTables_command.sh

Output is a table CTR_edw23_0003_MeDIPS_Region_Coverage_Combo_${WindowSize}Kb.table

Create PCA Plots

CTR_edw23_0003_MeDIPS.R script for calculating PCA plots for genomic interval coverage

Template Script

GB_Template.R Template R script for performing MEDIPs analysis

Convert BAM Chr labels

convertBAM_chr.sh a simple script to convert chromosome nomenclature in BAM files from 1 to chr1. Utilises samtools view and reheader

ChIP-Seq and ATAC-Seq Analysis

Step1: Data availability(Supplementary files 2), relating published reference:

Mouse: Epiblast (Epi) and extraembryonic ectoderm (ExE) from E6.5, ChIPSeq (GSE84236, mm9, THSS)

Zachary D. smith, Jiantao shi, Hongcang Gu, Julie Donaghey, Kendell Clement, Davide Cacchiarelli, Andreas Gnirke, Franziska michor & Alexander meissner [[DOI](doi: 10.1038/nature23891)]

Mouse: Sperm ChIPSeq (GSE79230, mm9)

Jung YH, Sauria MEG, Lyu X, Cheema MS, Ausio J, Taylor J, Corces VG.[DOI](doi: 10.1016/j.celrep.2017.01.034.)]

Mouse: trophoblast stem cells (TSCs) and embryonic stem cells (ESCs) ChIPSeq (Supplementary Files 2, mm10)

Stefan Schoenfelder, Borbala Mifsud, Claire E. Senner, Christopher D. Todd,Stephanie Chrysanthou, Elodie Darbo, Myriam Hemberger & Miguel R. Branco[[DOI](DOI: 10.1038/s41467-018-06666-4)]

Step2: Remove 20Mb region surrounding the Mtrr gene (Chr13:58060780-80060780)

DMRs total number reduced from 893 to 459, (Mtrr^{+/gt}, Mtrr^{+/+} and Mtrr^{gt/gt})

Table: Number of summary DMRs

Genotype	fileName	no. of DMRs
Mtrr^{+/gt}	merge_C57_Vs_+gt_0.01_tab1_dmr_xy_29418_allcol.bed	203
Mtrr^{+/+}	merge_C57_Vs_++_0.01_xy_6518_tab1_dmr_allcol.bed	91
Mtrr^{gt/gt}	merge_C57_Vs_gtgt_0.01_xy_allcol.bed	599
Mtrr.remove	AllDMRs_maskMtrr20Mb_N459.bed	459

Step 3: hgLiftOver the mm9 ChIP/ATAC bw file to mm10, and clip the extra bps for MACS output, then compute profiles.

Script	fileName
bash	BedClip_MACS_overlap.sh
bash	HgLiftOver_RandomDMRs_ProfilePlot.sh

Profiles Plots (see Supplementary Fig 7)

Step 4: Selected candidates genome viewer using IGV(v2.5.2).

Gene	PNG
Hira	Hira_refined_IGV_snapshot.png
Cwc27	Cwc27_refined_IGV_snapshot.png
Tshz3	Tshz3_refined_IGV_snapshot.png

Data Array Express Link

DataType	Accession	Link
Mtrr MeDIP Data	E-MTAB-8533	E-MTAB-8533
Mtrr WGS Data	E-MTAB-8513	E-MTAB-8513

Contact

Contact Xiaohui Zhao (xz289 -at- cam.ac.uk) and Russell S. Hamilton (rsh46 -at- cam.ac.uk)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
DMR_Data		DMR_Data
Figures		Figures
MeDIPs		MeDIPs
Scripts		Scripts
UCSC_Data		UCSC_Data
LICENSE		LICENSE
README.md		README.md

License

CTR-BFX/Blake_Watson

Folders and files

Latest commit

History

Repository files navigation