Topic-Aware Abstractive Text Summarization

This repo is working in progress. Currently the project is modified using `transformer` from HuggingFace and NTM from @YongfeiYan. We are planning to release the checkpoints in the sooner future, and we are still working on this repo. Thank you!

This repository is the artifact associated with our paper Topic-Aware Abstractive Text Summarization. In this paper, we propose a topic-aware abstractive summarization (TAAS) framework by leveraging the underlying semantic structure of documents represented by their latent topics.

Note: our work is built on top of HuggingFace transformers. There are two key components in our paper: topic modeling component and summarization component.

For summarization component, our code is built on top of the summarization pipeline built by HuggingFace. You can visit their example code in this link: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.
For topic modeling component, we modified the code from https://github.com/YongfeiYan/Neural-Document-Modeling.

Dataset

We use CNN/Daily Mail dataset in our paper. You can download the dataset using the link provided by HuggingFace:

cd data
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
tar -xzvf cnn_dm_v2.tgz
mv cnn_cln cnndm

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Model Training

To train TAAS, you can execute the following command:

DATA_DIR=./data/cnndm/
SUFFIX=taas
OUTPUT_DIR=./log/$SUFFIX

./finetune.sh \
    --data_dir $DATA_DIR \
    --train_batch_size 32 \
    --eval_batch_size 32 \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 100 \
    --model_name_or_path sshleifer/distilbart-cnn-12-6 \
    --save_top_k 5 \
    --early_stopping_patience 50 \
    --warmup_steps 10 \

Generate Summary

You can use the trained checkpoint to generate the summary on the testing set.

export DATA_DIR=./data/cnndm
SUFFIX=taas
OUTPUT_FILE=./output/cnndm/$SUFFIX/output-$SUFFIX.txt
OUTPUT_DIR=./log/$SUFFIX
python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/test.source $OUTPUT_FILE \
    --reference_path $DATA_DIR/test.target \
    --task summarization \
    --device cuda \
    --fp16 \
    --bs 32 \
    --finetune_flag 1 \
    --checkpoint_path $OUTPUT_DIR \

Performance Evaluation

We use ROUGE score as the evaluation metric in our paper.

SUFFIX=taas
OUTPUT_FILE=./output/cnndm/$SUFFIX/output-$SUFFIX.txt
SCORE_FILE=./output/cnndm/$SUFFIX/score-$SUFFIX.txt
python evaluate.py --generated $OUTPUT_FILE --golden ./data/cnndm/org_data/test.target > $SCORE_FILE

The best performance we achieved currently is:

1 ROUGE-1 Average_R: 0.48810 (95%-conf.int. 0.48545 - 0.49062)
1 ROUGE-1 Average_P: 0.42147 (95%-conf.int. 0.41874 - 0.42408)
1 ROUGE-1 Average_F: 0.44058 (95%-conf.int. 0.43839 - 0.44281)
---------------------------------------------
1 ROUGE-2 Average_R: 0.23262 (95%-conf.int. 0.22988 - 0.23522)
1 ROUGE-2 Average_P: 0.20166 (95%-conf.int. 0.19915 - 0.20415)
1 ROUGE-2 Average_F: 0.21017 (95%-conf.int. 0.20769 - 0.21257)
---------------------------------------------
1 ROUGE-L Average_R: 0.45306 (95%-conf.int. 0.45039 - 0.45556)
1 ROUGE-L Average_P: 0.39123 (95%-conf.int. 0.38858 - 0.39384)
1 ROUGE-L Average_F: 0.40899 (95%-conf.int. 0.40667 - 0.41120)

Example

We present the summaries of the same article from CNN/DM dataset from TAAS and other baseline models in example folder.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data/config		data/config
example		example
experiments		experiments
mlutils		mlutils
models		models
pytorch_lightning		pytorch_lightning
transformers		transformers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
baseline.py		baseline.py
bs_pyrouge.py		bs_pyrouge.py
callbacks.py		callbacks.py
case_study.py		case_study.py
cnndm_length.py		cnndm_length.py
convert_model_to_fp16.py		convert_model_to_fp16.py
data_utils.py		data_utils.py
dataset.py		dataset.py
display.py		display.py
eval_pgn.py		eval_pgn.py
evaluate.py		evaluate.py
evaluate_length.py		evaluate_length.py
finetune.py		finetune.py
finetune.sh		finetune.sh
initialization_utils.py		initialization_utils.py
lightning_base.py		lightning_base.py
modeling_taas.py		modeling_taas.py
ntm.py		ntm.py
pack_dataset.py		pack_dataset.py
postprocess_cnn_dm.py		postprocess_cnn_dm.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic-Aware Abstractive Text Summarization

This repo is working in progress. Currently the project is modified using `transformer` from HuggingFace and NTM from @YongfeiYan. We are planning to release the checkpoints in the sooner future, and we are still working on this repo. Thank you!

Dataset

Local Setup

Model Training

Generate Summary

Performance Evaluation

Example

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

linshaoxin-maker/taas

Folders and files

Latest commit

History

Repository files navigation

Topic-Aware Abstractive Text Summarization

This repo is working in progress. Currently the project is modified using transformer from HuggingFace and NTM from @YongfeiYan. We are planning to release the checkpoints in the sooner future, and we are still working on this repo. Thank you!

Dataset

Local Setup

Model Training

Generate Summary

Performance Evaluation

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

This repo is working in progress. Currently the project is modified using `transformer` from HuggingFace and NTM from @YongfeiYan. We are planning to release the checkpoints in the sooner future, and we are still working on this repo. Thank you!

Packages