Welcome to CANAL’s documentation!
Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data
PyTorch implementation of CANAL, a universal cell-type annotation tool that continuously fine-tunes a pretrained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges.

The source code of CANAL is available at https://github.com/aster-ww/CANAL-torch/tree/main/CANAL
Dependencies
CANAL requires the following:
python (3.7 recommended)
torch (1.10.1)
anndata (0.8.0)
scanpy (1.9.1)
pandas (1.3.5)
scikit-learn (1.0.2)
local-attention (1.6.0)
einops (0.6.0)
numpy (1.21.6)
h5py (3.6.0)
cuda toolkit and nvidia cudnn if using GPU.
To apply the CANAL model:
Prepare the preprocessed scRNA-seq data:
gene_alignandnormalizein thepreprocessmodule are required to obtain AnnData objects for network inputs.Run CANAL at the initial stage: use
CANAL_model.trainin themodelmodule by settingcurrent_stage=1. The model is initialized by the pre-trained model checkpoint on the Panglao datasetRun CANAL at the incremental stage: use
CANAL_model.trainin themodelmodule by settingcurrent_stage≥1. The model is initialized by the model trained at previous stagePredict cell types of the test data: use
CANAL_model.predictin themodelmodule to obtain the predicted cell types of the test dataEvaluate model performance: If true cell types of the test data is available, use
CANAL_model.evaluationin themodelmodule to evaluate the performance of current fine-tuned model
There are four examples in the Tutorial to run CANAL:
Tutorial 1: Preprocess the raw scRNA-seq datasets
Tutorial 2: Run CANAL with data stream from various batches
Tutorial 3: Run CANAL with data stream from different tissues
Tutorial 4: Apply CANAL on test data with novel cells
Hyper-parameters
lambda: default 0.1, the strength of representation distillation loss
L: default 1000, the size of example bank
Data Availability
Link |
Description |
|---|---|
https://drive.google.com/drive/folders/1BMf-N-k-3aCEY7CJvUcK9nZZ2UD7p3C0?usp=sharing |
Datasets of the pancreas experiemnts |
https://drive.google.com/drive/folders/1CaBySV_EFAPPrlpSevEewFds5cjJxC_T?usp=sharing |
Datasets of the cross-tissue experiemnts |
https://drive.google.com/drive/folders/1OGMWxR7qTWd_p21d57EyNWv5X48BNN0M?usp=sharing |
Datasets of the human immune experiemnts |
The detailed information and download URL of pre-trained model checkpoint, gene2vec embedding and the Panglao dataset used for pre-training can be seen at: https://github.com/TencentAILabHealthcare/scBERT
If you have any questions, please contact: wanhui1997@pku.edu.cn
Contents: