Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export_pseudobulk never generates pseudobulk files #117

Open
SITRR opened this issue Mar 21, 2024 · 14 comments
Open

export_pseudobulk never generates pseudobulk files #117

SITRR opened this issue Mar 21, 2024 · 14 comments

Comments

@SITRR
Copy link

SITRR commented Mar 21, 2024

Hi!

I'm having an issue with the export_pseudobulk function, which has been previously 41. The function runs but doesn't output any files and in the text output does not state "Done!" after "Creating pseudobulk for 75G" (75G being the name of one of my samples).

I've generated a cell_data data frame including three columns, 'barcode' which has the same barcode format as the fragment files (e.g. 'TTTGTGTTCTTGTCGC-1; I got these from the CellRanger output singlecell.csv file let me know if that's not advisable), a second column called 'sample_id' with the name of the sample and fragments_dict key (e.g. 75G; repeated for the length of the 'barcode' column), and a third column 'donor' identical to the 'sample_id' column. I use the 'donor' column as the input for 'variable' in the export_pseudobulk function (see below).

Should I be extracting the barcodes from the fragments file directly? I'm new to this and don't know how to do so on python.

This is the cell_data dataframe:

Screen Shot 2024-03-21 at 1 55 52 PM

This is the fragment file unzipped:

Screen Shot 2024-03-21 at 3 45 50 PM

This is the code I'm running:

fragments_dict = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/fragments.tsv.gz'}
path_to_regions = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed'}
path_to_blacklist= '/sc/arion/projects/DTR-EFGR/GBMATAC_SIR/Analysis/CopyscAT/hg38-blacklist.v2.bed'

import pandas as pd
df_75G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_75G.csv', header = 0, sep = ',', index_col = 0)
df_8561G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_8561G.csv', header = 0, sep = ',', index_col = 0)
print(df_75G.head())
print(df_8561G.head())

frames = [df_75G, df_8561G]
cell_data = pd.concat(frames) 
print(cell_data.head())
print(cell_data.tail())

import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
import ray
ray.shutdown()
#sys.stderr = open(os.devnull, "w")  # silence stderr
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'donor',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/'),   # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bw_files/'), # specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                         # location of fragment fiels
                 n_cpu = 8,                                                                                  # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(temp_dir, 'ray_spill'))#,
                 #split_pattern = '_')
#sys.stderr = sys.__stderr__  # unsilence stderr

import pickle
pickle.dump(bed_paths,
            open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'wb'))
pickle.dump(bw_paths,
           open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'wb'))
@SeppeDeWinter
Copy link
Collaborator

Hi @SITRR

Looks allright what you are doing.

Could you run the following piece of code and send me the output?

sample = "75G"
variable = "75G"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

Best,

Seppe

@SITRR
Copy link
Author

SITRR commented Mar 25, 2024

Hi @SeppeDeWinter,

Disclaimer: I'm an absolute beginner with python.

The line _sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample] gives the error:

Traceback (most recent call last):
File "", line 1, in
NameError: name '_sample_cell_data' is not defined

I thought to change sample_id_col to 'sample_id' (please refer to the disclaimer for my reasoning) but then _cell_type_to_cell_barcodes gives the error:

Traceback (most recent call last):
File "", line 1, in
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 8392, in groupby
return DataFrameGroupBy(
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 959, in init
grouper, exclusions, obj = get_grouper(
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 889, in get_grouper
raise KeyError(gpr)
KeyError: '75G'

Best,
Susana

@SeppeDeWinter
Copy link
Collaborator

Hi @SITRR

Can you show how you cell_data looks like?

print(cell_data)

All the best,

Seppe

@SITRR
Copy link
Author

SITRR commented Mar 26, 2024

Hi @SeppeDeWinter,

Screen Shot 2024-03-26 at 11 40 15 AM

Thanks (seriously, I will be forever indebted),
Susana

@SeppeDeWinter
Copy link
Collaborator

Hi Susana

My bad, the code should be like this

sample = "75G"
variable = "donor"
sample_id_col = "sample_id"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

@SITRR
Copy link
Author

SITRR commented Apr 1, 2024

Hi, @SeppeDeWinter!

It printed a huge list of barcodes:

Screen Shot 2024-04-01 at 10 48 44 AM

Is that to be expected? Should the output of len(_cell_type_to_cell_barcodes) be 1? I'm presuming that it shouldn't.

Best,
Susana

@SeppeDeWinter
Copy link
Collaborator

Hi @SITRR

That looks allright, I don't immediately something that is off.
The function does not produce any output at all, also not for the other clusters?

Are you running it in jupyter notebooks or via the command line environment?
Could it be that your job crashed?

All the best,

Seppe

@SITRR
Copy link
Author

SITRR commented Apr 3, 2024

Hi, @SeppeDeWinter!

I’m running it via the command line environment, as an lsf job. It runs successfully but only outputs bed_paths.pkl and bw_paths.pkl files, nothing else. The output also doesn't say "Done!" as shown in the tutorial.

Screen Shot 2024-04-03 at 4 20 07 PM

I was really hoping to get this to work since there is huge batch effect when I don't use a common peak set.

HarmonizedUMAP

I wouldn't want to go forward with this cistopic object.

Best,
Susana

@SITRR
Copy link
Author

SITRR commented Apr 3, 2024

I tried following this user's pipeline to verify each input 59, but get an error when trying to read the fragments file (unless I'm misunderstanding their code or their input files have a different format).

fragments = pd.read_table("/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz", header = None)
Traceback (most recent call last):
File "", line 1, in
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 317, in wrapper
return func(*args, **kwargs)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table
return _read(filepath_or_buffer, kwds)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read
return parser.read(nrows)
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1772, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 52, saw 5

Does this give you any hints?

Cheers,
Susana

@SeppeDeWinter
Copy link
Collaborator

Hi @SITRR

Could you try running the command line version of this code.
Please see the documentation on how to do this https://aertslab.github.io/scatac_fragment_tools/split.html.

This will involve generating:

  • A tsv file <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> containing a column with sample ids and a column with fragment files (this can be created from your fragments_dict).
  • a tsv file < PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION > containing a sample, cell_type and cell_barcode (this can be created from your cell data dataframe)
  • Downloading CHROM_SIZES_FILENAME (wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes).

After this you can run

scatac_fragment_tools split \
    -f <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> \
    -b <PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION> \
    -c <CHROM_SIZES_FILENAME> \
    -o <PATH_TO_OUTPUT_FOLDER>

And this will do the same splitting as the function is trying to do.

All the best,

Seppe

@SITRR
Copy link
Author

SITRR commented Apr 15, 2024

Thank you, @SeppeDeWinter!

How do I proceed after running this command? I'd like to extract the list of barcodes from the fragments file it generated to compare to the list I've been using to run the export_pseudobulk command as follows, but I get a large 5.2 GB file, which doesn't seem right to me.

df = pd.read_table(work_dir + "Testing/75G.fragments.tsv.gz", header=None)
barcodes = df[df.columns[3]]
barcodes.to_csv(work_dir + 'Testing/barcodes.tsv', sep = "\t")

I tried using the new fragments file (output from scatac_fragment_tools split) but I'm getting the same results.

Best,
Susana

@SeppeDeWinter
Copy link
Collaborator

Hi @SITRR

I'm not sure wether I'm completely understanding your question...
If this command ran successfully you can just continue with your analysis.

To get the bed_paths variable (the one that is used in the tutorial, run)

from scatac_fragment_tools.library.split.split_fragments_by_cell_type import (
    _santize_string_for_filename
)

bed_path = <PATH_TO_OUTPUT_FOLDER>
variable = "donor" #variable used to split the fragments

bed_paths = {}
for cell_type in cell_data[variable].unique():
    _bed_fname = os.path.join(
        bed_path,
        f"{_santize_string_for_filename(cell_type)}.fragments.tsv.gz")
    if os.path.exists(_bed_fname):
        bed_paths[cell_type] = _bed_fname
    else:
        print(f"Missing fragments for {cell_type}!")

Best,

S

@SITRR
Copy link
Author

SITRR commented Apr 26, 2024

Hi @SeppeDeWinter!

I finally got the export_pseudobulk function to run by switching to Python v3.11 and the dev branch of SCENIC+. Oddly though, it only runs with two samples. The function immediately outputs the following message:

cisTopic INFO Reading fragments from /sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz

However, when I add two more samples to the fragment directory, the function outputs the following instead and gets stuck on this step for several hours:

INFO Splitting fragments by cell type.

Any idea what might be happening?

Best,
Susana

@SITRR
Copy link
Author

SITRR commented Apr 29, 2024

Crisis averted, @SeppeDeWinter!

Patience was all that was necessary. It's running smoothly now, but I'd like to keep the issue open until I get to the end of the pipeline as I might encounter other issues and would appreciate your guidance. If you'd like me to open a separate issue instead, I can do so.

Thank you for everything so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants