Skip to content

Best Practices for Data Storage

Dan Spiegelman edited this page Nov 14, 2023 · 15 revisions

Primary Data: Raw

These should be exclusively the primary "big data" generated by sequencers, genotypers, light-sheet, EM etc.
In other words, this data cannot be regenerated without repeating the initial experiment.
 
This - and ONLY this data - will be backed up in duplicate copies following our backup policy.
 
NOTE: this platform is not intended to support the very large amounts of fundamental imaging (human MRI/PET) that Neuro PIs currently store on the BIC server.
 
If you wish this data to be backed up, its parent directory must be named "raw.data" (lowercase), found within your group's main data folder, i.e. the one named [PIname]Lab, with a folder-and-arrow icon ().
 
Within this "raw.data" folder, your data may have any structure you like. However it is good practice to have directory structures that are as informative as possible. For example this might include:

  • Having sub-folders by data type, e.g. genotyping, dnaseq, bulk Rnaseq, single-cell RNAseq, ATAC-seq, light-sheet, EM, microscopy etc.
  • Having sub-folders by project name  

Common raw data file types include: fastq/bam/cram (for sequence data), jpg/idat/txt (for genotyping data)

Secondary and Tertiary Data: Analyzed

These are the data generated from computationally-intensive analyses of your primary data.
Users should apply discretion when deciding whether or not to store non-primary data on the Neuro_Bioinfo_Core storage platform.
 
Reasons to store non-primary data include:

  • It is the product of very large amounts of computation and would be very costly/time-consuming to replace. (e.g. joint-called vcf for a large cohort)
  • It is a valuable project deliverable in itself
  • It is a common input for many downstream analyses (e.g. RNAseq gene count matrices)
  • You are required to keep it by university policy/publication requirement/grant mandate etc.

We strongly discourage users from simply stockpiling all the files generated by their analyses with no curation.
Practices to avoid include:

  • Preserving intermediary files, i.e. those used only to generate final outputs. e.g.:
    • Alignment bam files produced by post-processing steps in dnaseq/rnaseq analyses
    • Single-sample dnaseq variant vcf files
    • Genotype files in various stages of filtration
  • Preserving files which are easy/quick/inexpensive to regenerate
  • Preserving extremely verbose analysis logs, core dumps, or files of unknown use/provenance

Personal or Non-Data files

  • Should ideally be stored outside of your group's main data folder, i.e. not in the one named [PIname]Lab, with a folder-and-arrow icon ().
  • You can create any number of private folders in Nextcloud. You can name and organize them any way you like.

Common Considerations

  • Data compression should be used whenever possible to save space. In particular, large plain-text formats such as fastq should be compressed with a utility like gzip.
  • Large collections of small files should be packaged whenever possible into archive formats such as tar/dar or zip.
  • Data stored on this platform are not intended to be accessed frequently. Consider using a resource like the Digital Research Alliance of Canada (DRAC) server /project space for files that your group will re-use often.
  • Sharing files with external users via this platform is possible, but not optimal for large datasets. If you wish to share large amounts of data to a user outside this platform, please contact [email protected] for assistance.

Additional Resources

Digital Research Alliance of Canada - Data Management Best Practices
Digital Research Alliance of Canada - Building Documentation and Metadata