Skip to content

A small tool that help in cleaning, subsetting and annotation h5ad file efficiently.

Notifications You must be signed in to change notification settings

HTGenomeAnalysisUnit/h5ad_clean_annotate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

h5ad clean and annotate

A small tool that help in cleaning, subsetting and annotation h5ad file efficiently.

The tool relies on AlphaSC package from Bioturing for efficient processing, but no GPU is required.

Usage

usage: clean_and_annotate.py [-h] --h5ad H5AD --out OUT --config CONFIG

Clean, annotate, subset and other useful operations on h5ad files.

options:
  -h, --help       show this help message and exit
  --h5ad H5AD      h5ad input file
  --out OUT        Output h5ad file
  --config CONFIG  JSON file defining configuration for cleaning and processing

JSON config file

The operations to perform can be defined in a JSON file. A template structure is below and in the the config_template.json file.

{
	// Template to construct new cell IDs from obs columns. 
	// Any obs column can be accessed using curly brakets, e.g. {mycolumn}.
	// Use {index} to access the current obs index value
	"new_cell_id": "{tranche.id}--{tranche.name}--{index}", 
	
	// If true, the index will be cleaned by removing any --[0-9]+$ suffix.
	"clean_index": true,
	
	// A list of obs columns to retain
	"select_obs_columns": [
		"col1",
		"col2"
	],
	
	// A list of obs columns to remove
	"exclude_obs_columns": [
		"col1",
		"col2"
	],

	// If true, the obs column names will be sanitized by removing any special characters and spaces. 
	// >=  and <= will be replaced with greaterthan and lessthan strings
	"sanitize_obs_column_names": true,
	
	// Path to a text file containing barcodes to include in the output h5ad file.
	"subset_bc": "subset_barcodes.txt",

	// Subset the data to keep only cells where obs[column] is in values list
	"subset_on_obs": [
		{
			"column": "tranche.id",
			"values": ["T1","T2","T3"],
			"values_file": "subset_values.txt" // Optional file with one value per line to use as values list
		}
	],

	// List of paths to TSV file(s) with 2 or more columns: cell_id and annotations.
	// Annotation columns will be added to obs as new columns using the cell_id as the key.
	"annot_bc": ["cell_annotations_1.tsv", "cell_annotations_2.tsv"],
	
	// A layer to be used as the default X in the output h5ad file.
	"X_layer": "layer_name",
	"keys": [
		"uns",
		"obsm",
		"raw/X",
		"layers/layer1"
	],

	// A dictionary of "old_name": "new_name" pairs to rename obs columns.
	"rename_columns": {
		"old_name1": "new_name1",
		"old_name2": "new_name2"
	},

	// A list of dictionaries defining TSV files that can be used to annotate samples.
	// The tool will annotate obs with the table_annotation_columns from the TSV file
	// Merging keys in the input table and obs are defined by table_key_column and obs_key_column, repectively.
	"sample_annotations": [
		{
			"filename": "sample_annotations.tsv",
			"table_key_column": "sample_id",
			"obs_key_column": "sample_id",
			"table_annotation_columns": [
				"tissue",
				"treatment"
			],
			"annotation_name": "myanno1"
		},
		{
			"filename": "another_annotation.tsv",
			"table_key_column": "sample_id",
			"obs_key_column": "sample_id",
			"table_annotation_columns": [
				"ancestry"
			],
			"annotation_name": "myanno2"
		}
	]
}

Order of operations

  1. Clean index
  2. Make new cell ID
  3. Add barcode-based annotations
  4. Rename columns using the rename map provided
  5. Create new columns based on existing ones as defined in make_columns
  6. Add column-based annotations
  7. Filter obs columns based on include/exclude lists
  8. Sanitize obs columns
  9. Filter barcodes based on the subset_bc list
  10. Filter based on obs column values from subset_on_obs

About

A small tool that help in cleaning, subsetting and annotation h5ad file efficiently.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages