-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Description of the bug
In it's current form, fetchngs does not download the relevant files required for re-processing single-cell experiments from the 10X Genomics platforms.
As discussed on the Slack channel, 10X data currently gets downloaded only as a single FastQ file. However, 10X data typically contains the the cell ID and UMI data in Read 1 (~28 bp), Read 2 is the RNA insert (~91 bp). Read 3 tends to be the Illumina multiplexing index (mostly irrelevant as they should all belong to a single sample anyway. Read 1 is flagged as a technical, so it doesn't get included when using fasterq-dump currently, rendering the single-cell experiment into one single big bulk RNA-seq dataset.
Note:
It is also worth noting that the ENA does not serve out technical reads at all, so 10X raw data can only be obtained via the SRA (prefetch, or fasterq-dump + accession).
Here is a description of the bug:
This is the command run by fetchngs with a 10X sample accession SRR9320616:
fasterq-dump --threads 6 SRR9320616 --outfile SRR9320616.fastq
it gives the following output:
SRR9320616.fastq
This output is arguably useless for single-cell (re-)analysis.
Proposal:
This is the command required for 10X data. It uses both --split-files and --include-technical:
fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress
It gives the following output:
SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq
Read 1 is the cell barcode +UMI:
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
NCACCTTCTGCTGTCGCCGATGTTGT
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
#AAFFJJJJJJJJJJJJJJJJJJJJJ
Read 2 is the RNA insert read:
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
NGTTACGCTAGTAATCCCTCTACCTTTAGCCACTCACTTGGCCCTAGGTAACTAAGACCCTGACATCACTTTGCCTCTTAGGGCACAAGGAGGAACTA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
#A<FAFAAJFF-<FAJFF<--FFAJ-7F-7<--7-<--7-777-7<77-7F<AJJ7J-----A7-A-FFF7<-7--7F<JF---AAAJ7<J---7--F
Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
NTTGAGAA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
#AA-FFJF
Adding these options to the pipeline, either as config file or straight within the fasterq-dump process works fine.
process {
withName: 'SRATOOLS_FASTERQDUMP' {
ext.args = '--split-files --include-technical'
}
}Download, extraction into 3 files as well as the pigz compression appear to have worked well:
2023-04-24 10:47:08 0
2023-04-24 10:50:35 6 .command.begin
2023-04-24 11:33:02 90 .command.err
2023-04-24 11:35:08 90 .command.log
2023-04-24 11:33:01 0 .command.out
2023-04-24 10:47:08 13370 .command.run
2023-04-24 10:47:08 527 .command.sh
2023-04-24 11:33:02 261 .command.trace
2023-04-24 11:35:06 1 .exitcode
2023-04-24 11:33:03 3133859956 SRX6088086_SRR9320616_1.fastq.gz
2023-04-24 11:33:03 8441509889 SRX6088086_SRR9320616_2.fastq.gz
2023-04-24 11:33:03 1496357946 SRX6088086_SRR9320616_3.fastq.gz
2023-04-24 11:33:03 124 versions.yml
I have changed the file pattern recognition to:
fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'
However the files then never get published, and I suspect it has to do with how the read names are extracted afterwards:
SRA_FASTQ_FTP
.out
.fastq
.mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
.map {
meta, fastq ->
def reads = meta.single_end ? [ fastq ] : fastq
def meta_clone = meta.clone()
meta_clone.fastq_1 = reads[0] ? "${params.outdir}/fastq/${reads[0].getName()}" : ''
meta_clone.fastq_2 = reads[1] && !meta.single_end ? "${params.outdir}/fastq/${reads[1].getName()}" : ''
return meta_clone
}
.set { ch_sra_metadata }This is the error message that brings the whole process down:
Unknown method invocation `getName` on ArrayList type
-- Check script '.nextflow/assets/FelixKrueger/fetchngs/./workflows/sra.nf' at line: 128 or see 'nf-62eTOEybyloWFq.log' file for more details
WARN: Failed to publish file: s3://altos-lab-nextflow/scratch/5c32VUHOyVZskM/aa/b062914e17b4b9d68ae187ffb920a7/SRX6088086_SRR9320616_2.fastq.gz; to: s3://testbucket/results/fastq/SRX6088086_SRR9320616_2.fastq.gz [copy] -- See log file for details
It could be really trivial to get the getName() method to work in the new data structure, but I am currently at a loss how to fix it.
Many thanks for your kind attention!
Command used and terminal output
No response
Relevant files
No response
System information
No response