WM-SRA

Source code for Weakly-Supervised Methods for Suicide Risk Assessment: Role of Related Domains (ACL 2021).

Due to the ethics concern, we cannot release neither the data nor the checkpoints. Please follow the guideline as stated in UMD Suicidality Dataset to get related approvals and access for the data.

Citation

Please cite our paper if you find it helpful.

@inproceedings{yang2021weakly,
  title={Weakly-Supervised Methods for Suicide Risk Assessment: Role of Related Domains},
  author={Chenghao Yang and Yudong Zhang and Smaranda Muresan},
  booktitle={Proceedings of ACL},
  year={2021}
}

Requirements

We recommend to use Anaconda to set up the environment.

conda create --name <env> --file requirements.txt

After installing the required dependencies, you also need to download necessary data files for nltk library:

import nltk
nltk.download("popular")

Instructions

After you have obtained UMD Suicidality data, extract it and move it to this project directory. The resulted directory structure may look like:

WM-SRA/
    umd_reddit_suicidewatch_dataset_v2/
        umd_reddit_suicidewatch_dataset_v2/
            crowd/ 
            expert/
            scripts/
    other_files_in_this_project

Move the data_generator.py under umd_reddit_suicidewatch_dataset_v2/umd_reddit_suicidewatch_dataset_v2/, then run the following line to extract necessary information from large csv files.

python data_generator.py

Take a look at config.py. If you do not want to use pseudo-labelling, you can set self.use_PL=False. Otherwise, you can take a look at data_generator.py to see how we create pkl files for training. Then prepare your pseudo-labelling data following the same format. (Due to the ethics concern, we cannot release our pseudo-labelling data either.)
Simply run the following line to start training and evaluation. Our codes will do evaluation every epoch and only save the checkpoint with the best macro-F1 on the validation set.

python main.py --task A

Acknowledgement

We use the pre-processing codes from hate-speech-and-offensive-language
We use bert-extractive-summarizer to do extractive summarization over the data, which is used in multi-view learning (``K-Sum'' as in our paper).

Some painful failure experience

We try to do pseudo-labelling (PL) on the fly, using the model prediction or doing random sampling to decide the labels. We even design a complicated mechanism to use these two strategies at the same time (i.e., with some probability we use the first and otherwise use the second. ) Unfortunately, all these efforts does not work.
We try to do contrastive learning, but it does not work.
To encourage more exploration, we even implemented annealed softmax in our early versions of codes, but it does not work.
We try various model architectures, including adding extra attention layers and using something complicated like Transformer-RNN. But it does not work.
Pre-processing is important.
Due to the fact that this dataset is relatively small, every single point win can be significant, so you should tune the proportion of added pseudo-labelled data very very carefully.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_generator.py		data_generator.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
suicideData.py		suicideData.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WM-SRA

Citation

Requirements

Instructions

Acknowledgement

Some painful failure experience

About

Uh oh!

Releases

Packages

Languages

License

yangalan123/WM-SRA

Folders and files

Latest commit

History

Repository files navigation

WM-SRA

Citation

Requirements

Instructions

Acknowledgement

Some painful failure experience

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages