Add support to download 10X Genomics data

### Description of the bug

In it's current form, `fetchngs` does not download the relevant files required for re-processing single-cell experiments from the 10X Genomics platforms.

As discussed on the Slack channel, 10X data currently gets downloaded only as a single FastQ file. However, 10X data typically contains the the cell ID and UMI data in Read 1 (~28 bp), Read 2 is the RNA insert (~91 bp). Read 3 tends to be the Illumina multiplexing index (mostly irrelevant as they should all belong to a single sample anyway. Read 1 is flagged as a **technical**, so it doesn't get included when using `fasterq-dump` currently, rendering the single-cell experiment into one single big bulk RNA-seq dataset.

### Note:

It is also worth noting that the ENA **does not serve out** technical reads at all, so 10X raw data can only be obtained via the SRA (`prefetch`, or `fasterq-dump` + accession).

Here is a description of the bug:

This is the command run by fetchngs with a 10X sample accession [SRR9320616](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3895942):

```
fasterq-dump --threads 6 SRR9320616 --outfile SRR9320616.fastq
```

it gives the following output:

```
SRR9320616.fastq
```

This output is arguably useless for single-cell (re-)analysis.

## Proposal:

This is the command required for 10X data. It uses both `--split-files` and `--include-technical`:

```
fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress
```

It gives the following output:

```
SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq
```

#### Read 1 is the cell barcode +UMI:

```
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
NCACCTTCTGCTGTCGCCGATGTTGT
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
#AAFFJJJJJJJJJJJJJJJJJJJJJ
```

#### Read 2 is the RNA insert read:

```
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
NGTTACGCTAGTAATCCCTCTACCTTTAGCCACTCACTTGGCCCTAGGTAACTAAGACCCTGACATCACTTTGCCTCTTAGGGCACAAGGAGGAACTA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
#A<FAFAAJFF-<FAJFF<--FFAJ-7F-7<--7-<--7-777-7<77-7F<AJJ7J-----A7-A-FFF7<-7--7F<JF---AAAJ7<J---7--F
```

#### Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

```
@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
NTTGAGAA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
#AA-FFJF
```

Adding these options to the pipeline, either as config file or straight within the fasterq-dump process works fine.

```nextflow
process {
    withName: 'SRATOOLS_FASTERQDUMP' {
        ext.args = '--split-files --include-technical'
    }
}
```

Download, extraction into 3 files as well as the `pigz` compression appear to have worked well:

```
2023-04-24 10:47:08          0
2023-04-24 10:50:35          6 .command.begin
2023-04-24 11:33:02         90 .command.err
2023-04-24 11:35:08         90 .command.log
2023-04-24 11:33:01          0 .command.out
2023-04-24 10:47:08      13370 .command.run
2023-04-24 10:47:08        527 .command.sh
2023-04-24 11:33:02        261 .command.trace
2023-04-24 11:35:06          1 .exitcode
2023-04-24 11:33:03 3133859956 SRX6088086_SRR9320616_1.fastq.gz
2023-04-24 11:33:03 8441509889 SRX6088086_SRR9320616_2.fastq.gz
2023-04-24 11:33:03 1496357946 SRX6088086_SRR9320616_3.fastq.gz
2023-04-24 11:33:03        124 versions.yml
```
I have changed the file pattern recognition to:

```
fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'
```

However the files then never get published, and I suspect it has to do with how the read names are extracted afterwards:

https://github.com/FelixKrueger/fetchngs/blob/62b2bc840b14465a0ff551f614d613a15fdef582/workflows/sra.nf#L120-L132  

[sra.nf](https://github.com/FelixKrueger/fetchngs/blob/62b2bc840b14465a0ff551f614d613a15fdef582/workflows/sra.nf)
 ```nextflow
SRA_FASTQ_FTP
            .out
            .fastq
            .mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
            .map { 
                meta, fastq ->
                    def reads = meta.single_end ? [ fastq ] : fastq
                    def meta_clone = meta.clone()
                    meta_clone.fastq_1 = reads[0] ? "${params.outdir}/fastq/${reads[0].getName()}" : ''
                    meta_clone.fastq_2 = reads[1] && !meta.single_end ? "${params.outdir}/fastq/${reads[1].getName()}" : ''
                    return meta_clone
            }
            .set { ch_sra_metadata }
```

This is the error message that brings the whole process down:

```
Unknown method invocation `getName` on ArrayList type
-- Check script '.nextflow/assets/FelixKrueger/fetchngs/./workflows/sra.nf' at line: 128 or see 'nf-62eTOEybyloWFq.log' file for more details
WARN: Failed to publish file: s3://altos-lab-nextflow/scratch/5c32VUHOyVZskM/aa/b062914e17b4b9d68ae187ffb920a7/SRX6088086_SRR9320616_2.fastq.gz; to: s3://testbucket/results/fastq/SRX6088086_SRR9320616_2.fastq.gz [copy] -- See log file for details
```

It could be really trivial to get the getName() method to work in the new data structure, but I am currently at a loss how to fix it. 

Many thanks for your kind attention!

### Command used and terminal output

_No response_

### Relevant files

_No response_

### System information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to download 10X Genomics data #144

Description of the bug

Note:

Proposal:

Read 1 is the cell barcode +UMI:

Read 2 is the RNA insert read:

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

Command used and terminal output

Relevant files

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support to download 10X Genomics data #144

Description

Description of the bug

Note:

Proposal:

Read 1 is the cell barcode +UMI:

Read 2 is the RNA insert read:

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

Command used and terminal output

Relevant files

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions