Skip to content

Add support to download 10X Genomics data #144

@FelixKrueger

Description

@FelixKrueger

Description of the bug

In it's current form, fetchngs does not download the relevant files required for re-processing single-cell experiments from the 10X Genomics platforms.

As discussed on the Slack channel, 10X data currently gets downloaded only as a single FastQ file. However, 10X data typically contains the the cell ID and UMI data in Read 1 (~28 bp), Read 2 is the RNA insert (~91 bp). Read 3 tends to be the Illumina multiplexing index (mostly irrelevant as they should all belong to a single sample anyway. Read 1 is flagged as a technical, so it doesn't get included when using fasterq-dump currently, rendering the single-cell experiment into one single big bulk RNA-seq dataset.

Note:

It is also worth noting that the ENA does not serve out technical reads at all, so 10X raw data can only be obtained via the SRA (prefetch, or fasterq-dump + accession).

Here is a description of the bug:

This is the command run by fetchngs with a 10X sample accession SRR9320616:

fasterq-dump --threads 6 SRR9320616 --outfile SRR9320616.fastq

it gives the following output:

SRR9320616.fastq

This output is arguably useless for single-cell (re-)analysis.

Proposal:

This is the command required for 10X data. It uses both --split-files and --include-technical:

fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress

It gives the following output:

SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq

Read 1 is the cell barcode +UMI:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
NCACCTTCTGCTGTCGCCGATGTTGT
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
#AAFFJJJJJJJJJJJJJJJJJJJJJ

Read 2 is the RNA insert read:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
NGTTACGCTAGTAATCCCTCTACCTTTAGCCACTCACTTGGCCCTAGGTAACTAAGACCCTGACATCACTTTGCCTCTTAGGGCACAAGGAGGAACTA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
#A<FAFAAJFF-<FAJFF<--FFAJ-7F-7<--7-<--7-777-7<77-7F<AJJ7J-----A7-A-FFF7<-7--7F<JF---AAAJ7<J---7--F

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
NTTGAGAA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
#AA-FFJF

Adding these options to the pipeline, either as config file or straight within the fasterq-dump process works fine.

process {
    withName: 'SRATOOLS_FASTERQDUMP' {
        ext.args = '--split-files --include-technical'
    }
}

Download, extraction into 3 files as well as the pigz compression appear to have worked well:

2023-04-24 10:47:08          0
2023-04-24 10:50:35          6 .command.begin
2023-04-24 11:33:02         90 .command.err
2023-04-24 11:35:08         90 .command.log
2023-04-24 11:33:01          0 .command.out
2023-04-24 10:47:08      13370 .command.run
2023-04-24 10:47:08        527 .command.sh
2023-04-24 11:33:02        261 .command.trace
2023-04-24 11:35:06          1 .exitcode
2023-04-24 11:33:03 3133859956 SRX6088086_SRR9320616_1.fastq.gz
2023-04-24 11:33:03 8441509889 SRX6088086_SRR9320616_2.fastq.gz
2023-04-24 11:33:03 1496357946 SRX6088086_SRR9320616_3.fastq.gz
2023-04-24 11:33:03        124 versions.yml

I have changed the file pattern recognition to:

fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'

However the files then never get published, and I suspect it has to do with how the read names are extracted afterwards:

https://github.com/FelixKrueger/fetchngs/blob/62b2bc840b14465a0ff551f614d613a15fdef582/workflows/sra.nf#L120-L132

sra.nf

SRA_FASTQ_FTP
           .out
           .fastq
           .mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
           .map { 
               meta, fastq ->
                   def reads = meta.single_end ? [ fastq ] : fastq
                   def meta_clone = meta.clone()
                   meta_clone.fastq_1 = reads[0] ? "${params.outdir}/fastq/${reads[0].getName()}" : ''
                   meta_clone.fastq_2 = reads[1] && !meta.single_end ? "${params.outdir}/fastq/${reads[1].getName()}" : ''
                   return meta_clone
           }
           .set { ch_sra_metadata }

This is the error message that brings the whole process down:

Unknown method invocation `getName` on ArrayList type
-- Check script '.nextflow/assets/FelixKrueger/fetchngs/./workflows/sra.nf' at line: 128 or see 'nf-62eTOEybyloWFq.log' file for more details
WARN: Failed to publish file: s3://altos-lab-nextflow/scratch/5c32VUHOyVZskM/aa/b062914e17b4b9d68ae187ffb920a7/SRX6088086_SRR9320616_2.fastq.gz; to: s3://testbucket/results/fastq/SRX6088086_SRR9320616_2.fastq.gz [copy] -- See log file for details

It could be really trivial to get the getName() method to work in the new data structure, but I am currently at a loss how to fix it.

Many thanks for your kind attention!

Command used and terminal output

No response

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprovement for existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions