export_pseudobulk never generates pseudobulk files #117

SITRR · 2024-03-21T20:43:07Z

Hi!

I'm having an issue with the export_pseudobulk function, which has been previously 41. The function runs but doesn't output any files and in the text output does not state "Done!" after "Creating pseudobulk for 75G" (75G being the name of one of my samples).

I've generated a cell_data data frame including three columns, 'barcode' which has the same barcode format as the fragment files (e.g. 'TTTGTGTTCTTGTCGC-1; I got these from the CellRanger output singlecell.csv file let me know if that's not advisable), a second column called 'sample_id' with the name of the sample and fragments_dict key (e.g. 75G; repeated for the length of the 'barcode' column), and a third column 'donor' identical to the 'sample_id' column. I use the 'donor' column as the input for 'variable' in the export_pseudobulk function (see below).

Should I be extracting the barcodes from the fragments file directly? I'm new to this and don't know how to do so on python.

This is the cell_data dataframe:

This is the fragment file unzipped:

This is the code I'm running:

fragments_dict = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/fragments.tsv.gz'}
path_to_regions = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed'}
path_to_blacklist= '/sc/arion/projects/DTR-EFGR/GBMATAC_SIR/Analysis/CopyscAT/hg38-blacklist.v2.bed'

import pandas as pd
df_75G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_75G.csv', header = 0, sep = ',', index_col = 0)
df_8561G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_8561G.csv', header = 0, sep = ',', index_col = 0)
print(df_75G.head())
print(df_8561G.head())

frames = [df_75G, df_8561G]
cell_data = pd.concat(frames) 
print(cell_data.head())
print(cell_data.tail())

import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
import ray
ray.shutdown()
#sys.stderr = open(os.devnull, "w")  # silence stderr
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'donor',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/'),   # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bw_files/'), # specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                         # location of fragment fiels
                 n_cpu = 8,                                                                                  # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(temp_dir, 'ray_spill'))#,
                 #split_pattern = '_')
#sys.stderr = sys.__stderr__  # unsilence stderr

import pickle
pickle.dump(bed_paths,
            open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'wb'))
pickle.dump(bw_paths,
           open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'wb'))

SeppeDeWinter · 2024-03-25T10:13:22Z

Hi @SITRR

Looks allright what you are doing.

Could you run the following piece of code and send me the output?

sample = "75G"
variable = "75G"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

Best,

Seppe

SITRR · 2024-03-25T17:19:10Z

Hi @SeppeDeWinter,

Disclaimer: I'm an absolute beginner with python.

The line _sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample] gives the error:

Traceback (most recent call last):
File "", line 1, in
NameError: name '_sample_cell_data' is not defined

I thought to change sample_id_col to 'sample_id' (please refer to the disclaimer for my reasoning) but then _cell_type_to_cell_barcodes gives the error:

Traceback (most recent call last):
File "", line 1, in
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 8392, in groupby
return DataFrameGroupBy(
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 959, in init
grouper, exclusions, obj = get_grouper(
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 889, in get_grouper
raise KeyError(gpr)
KeyError: '75G'

Best,
Susana

SeppeDeWinter · 2024-03-26T08:20:54Z

Hi @SITRR

Can you show how you cell_data looks like?

print(cell_data)

All the best,

Seppe

SITRR · 2024-03-26T15:41:06Z

Hi @SeppeDeWinter,

Thanks (seriously, I will be forever indebted),
Susana

SeppeDeWinter · 2024-03-29T08:49:48Z

Hi Susana

My bad, the code should be like this

sample = "75G"
variable = "donor"
sample_id_col = "sample_id"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

SITRR · 2024-04-01T14:59:47Z

Hi, @SeppeDeWinter!

It printed a huge list of barcodes:

Is that to be expected? Should the output of len(_cell_type_to_cell_barcodes) be 1? I'm presuming that it shouldn't.

Best,
Susana

SeppeDeWinter · 2024-04-02T07:37:44Z

Hi @SITRR

That looks allright, I don't immediately something that is off.
The function does not produce any output at all, also not for the other clusters?

Are you running it in jupyter notebooks or via the command line environment?
Could it be that your job crashed?

All the best,

Seppe

SITRR · 2024-04-03T22:01:12Z

Hi, @SeppeDeWinter!

I’m running it via the command line environment, as an lsf job. It runs successfully but only outputs bed_paths.pkl and bw_paths.pkl files, nothing else. The output also doesn't say "Done!" as shown in the tutorial.

I was really hoping to get this to work since there is huge batch effect when I don't use a common peak set.

I wouldn't want to go forward with this cistopic object.

Best,
Susana

SITRR · 2024-04-03T22:10:47Z

I tried following this user's pipeline to verify each input 59, but get an error when trying to read the fragments file (unless I'm misunderstanding their code or their input files have a different format).

fragments = pd.read_table("/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz", header = None)
Traceback (most recent call last):
File "", line 1, in
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 317, in wrapper
return func(*args, **kwargs)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table
return _read(filepath_or_buffer, kwds)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read
return parser.read(nrows)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1772, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 52, saw 5

Does this give you any hints?

Cheers,
Susana

SeppeDeWinter · 2024-04-04T06:06:33Z

Hi @SITRR

Could you try running the command line version of this code.
Please see the documentation on how to do this https://aertslab.github.io/scatac_fragment_tools/split.html.

This will involve generating:

A tsv file <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> containing a column with sample ids and a column with fragment files (this can be created from your fragments_dict).
a tsv file < PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION > containing a sample, cell_type and cell_barcode (this can be created from your cell data dataframe)
Downloading CHROM_SIZES_FILENAME (wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes).

After this you can run

scatac_fragment_tools split \
    -f <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> \
    -b <PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION> \
    -c <CHROM_SIZES_FILENAME> \
    -o <PATH_TO_OUTPUT_FOLDER>

And this will do the same splitting as the function is trying to do.

All the best,

Seppe

SITRR · 2024-04-15T19:00:13Z

Thank you, @SeppeDeWinter!

How do I proceed after running this command? I'd like to extract the list of barcodes from the fragments file it generated to compare to the list I've been using to run the export_pseudobulk command as follows, but I get a large 5.2 GB file, which doesn't seem right to me.

df = pd.read_table(work_dir + "Testing/75G.fragments.tsv.gz", header=None)
barcodes = df[df.columns[3]]
barcodes.to_csv(work_dir + 'Testing/barcodes.tsv', sep = "\t")

I tried using the new fragments file (output from scatac_fragment_tools split) but I'm getting the same results.

Best,
Susana

SeppeDeWinter · 2024-04-17T14:08:46Z

Hi @SITRR

I'm not sure wether I'm completely understanding your question...
If this command ran successfully you can just continue with your analysis.

To get the bed_paths variable (the one that is used in the tutorial, run)

from scatac_fragment_tools.library.split.split_fragments_by_cell_type import (
    _santize_string_for_filename
)

bed_path = <PATH_TO_OUTPUT_FOLDER>
variable = "donor" #variable used to split the fragments

bed_paths = {}
for cell_type in cell_data[variable].unique():
    _bed_fname = os.path.join(
        bed_path,
        f"{_santize_string_for_filename(cell_type)}.fragments.tsv.gz")
    if os.path.exists(_bed_fname):
        bed_paths[cell_type] = _bed_fname
    else:
        print(f"Missing fragments for {cell_type}!")

Best,

S

SITRR · 2024-04-26T20:32:40Z

Hi @SeppeDeWinter!

I finally got the export_pseudobulk function to run by switching to Python v3.11 and the dev branch of SCENIC+. Oddly though, it only runs with two samples. The function immediately outputs the following message:

cisTopic INFO Reading fragments from /sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz

However, when I add two more samples to the fragment directory, the function outputs the following instead and gets stuck on this step for several hours:

INFO Splitting fragments by cell type.

Any idea what might be happening?

Best,
Susana

SITRR · 2024-04-29T14:06:27Z

Crisis averted, @SeppeDeWinter!

Patience was all that was necessary. It's running smoothly now, but I'd like to keep the issue open until I get to the end of the pipeline as I might encounter other issues and would appreciate your guidance. If you'd like me to open a separate issue instead, I can do so.

Thank you for everything so far!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export_pseudobulk never generates pseudobulk files #117

export_pseudobulk never generates pseudobulk files #117

SITRR commented Mar 21, 2024 •

edited by SeppeDeWinter

Loading

SeppeDeWinter commented Mar 25, 2024

SITRR commented Mar 25, 2024

SeppeDeWinter commented Mar 26, 2024

SITRR commented Mar 26, 2024

SeppeDeWinter commented Mar 29, 2024

SITRR commented Apr 1, 2024

SeppeDeWinter commented Apr 2, 2024

SITRR commented Apr 3, 2024

SITRR commented Apr 3, 2024

SeppeDeWinter commented Apr 4, 2024

SITRR commented Apr 15, 2024

SeppeDeWinter commented Apr 17, 2024

SITRR commented Apr 26, 2024 •

edited

Loading

SITRR commented Apr 29, 2024

export_pseudobulk never generates pseudobulk files #117

export_pseudobulk never generates pseudobulk files #117

Comments

SITRR commented Mar 21, 2024 • edited by SeppeDeWinter Loading

SeppeDeWinter commented Mar 25, 2024

SITRR commented Mar 25, 2024

SeppeDeWinter commented Mar 26, 2024

SITRR commented Mar 26, 2024

SeppeDeWinter commented Mar 29, 2024

SITRR commented Apr 1, 2024

SeppeDeWinter commented Apr 2, 2024

SITRR commented Apr 3, 2024

SITRR commented Apr 3, 2024

SeppeDeWinter commented Apr 4, 2024

SITRR commented Apr 15, 2024

SeppeDeWinter commented Apr 17, 2024

SITRR commented Apr 26, 2024 • edited Loading

SITRR commented Apr 29, 2024

SITRR commented Mar 21, 2024 •

edited by SeppeDeWinter

Loading

SITRR commented Apr 26, 2024 •

edited

Loading