GrAPES

This is the repo for the Granular AMR Parsing Evaluation Suite (GrAPES). Our paper "AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite" was published in the EMNLP 2023 proceedings.

GrAPES provides specialised evaluation metrics and additional data. Throughout the documentation, we distinguish between the AMR 3.0 testset (which you probably already have) and the GrAPES testset, which is our additional data, housed in the corpus/subcorpora folder.

Set up

Dependencies

GrAPES requires the Python packages penman, prettytable statsmodels, smatch, cryptography>=3.1.

pip install prettytable penman statsmodels smatch "cryptography>=3.1"

GrAPES has been tested with Python 3.8.10 and 3.10.13.

Corpus files

GrAPES relies on three sources of data: A) our original data, B) the AMR testset, and C) original data based on external licensed corpora (A and C form the GrAPES testset). GrAPES evaluation can be run on all of them together to obtain scores for all categories, or each separately to obtain scores on only the corresponding categories. (A) requires no additional setup, but (B) and (C) do, see below.

AMR testset setup (B)

For the evaluation stage (if you want to include the AMR-testset-based categories of GrAPES), GrAPES needs the testset of the AMRBank 3.0 concatenated all into one file (specifically, with the files concatenated in alphabetical order). You can obtain such a concatenation with this script:

python concatenate_amr_files.py path/to/original/AMR/testset concatenated/testset/file/name

The file concatenated/testset/file/name will be created by this script, and in all the documentation below, concatenated/testset/file/name refers to that file.

Obtaining the full GrAPES testset (C).

For licensing reasons, two of the GrAPES categories (Unbounded Dependencies and Word Ambiguities (handcrafted)) are only available if you also have the necessary licenses. You can use GrAPES without that data and skip this setup step, but two categories will be missing. To obtain the full GrAPES corpus, use the following instructions:

Unbounded Dependencies

The Unbounded Dependencies category is built from Penn Tree Bank sentences. If you have access to the Penn Tree Bank, the following script will add them to the existing GrAPES corpus.txt file, where <ptb_pos_path> refers to the location of all the POS tagged files in the PTB (in version 2 of the PTB, this is the tagged subfolder, in version 3 it is tagged/pos).

python complete_the_corpus.py -ptb <ptb_pos_path>

Troubleshooting: These are encrypted in corpus/copyrighted_data. A version of the TSV file without the sentences is also provided in corpus/unbounded_dependencies_stripped.tsv from which the annotations can probably be reconstructed if you have the PTB but the decryption fails. IDs include PTB filenames.

Word Ambiguities

Twelve of the sentences in the Word Ambiguities (handcrafted) category are AMR 3.0 test set sentences. To add them to the GrAPES corpus.txt file, run the following script, where <amr_test_path> refers to the AMR 3.0 concatenated test set file folder (see step B)):

python complete_the_corpus.py -amr <amr_test_path>

Troubleshooting: The file without the copyrighted sentences is corpus/word_ambiguities_clean.txt. To create the right files, replace (removed -- see documentation) with the sentences (IDs are in the file), save it as corpus/subcorpora/word_disambiguation.txt, and add the entries to corpus/corpus.txt.

Usage

Running your parser

The evaluation scripts use two corpus files, the AMR 3.0 testset and the GrAPES testset provided (and possible extended in step C above). To use GrAPES, you need to generate parser output on both of those datasets. For each dataset, generate one file with AMRs like you would for computing Smatch, i.e. with the AMRs in the same order as the input corpus, and separated by blank lines (that is, the standard AMR corpus format, readable by the penman package; we only need the graphs, no metadata like IDs etc. is required).

For the GrAPES testset, simply run your parser on corpus/corpus.txt (this file was possibly extended from the version in this repo in setup step C).

For the AMR 3.0 test set, you may already have such an output file. If not, run your parser on the concatenated/testset/file/name file created during setup step (B).

If you want to evaluate only on a single category, running your parser on one of the files in corpus/subcorpora may be sufficient.

Evaluation

To run the full evaluation suite, run the following:

python evaluate_all_categories.py -gt path/to/AMR/testset -pt path/to/parser/output/AMR/testset -pg path/to/your/parser/output/GrAPES/corpus.txt

Where the arguments are:

-gt: path to your copy of the AMR testset
-pt: path to your parser output for the AMR testset
-pg: path to your parser output on the GrAPES corpus in corpus/corpus.tex. This will automatically detect whether you've added the PTB and AMR testset sentences in setup step B.
If your GrAPES gold file corpus.txt is not in corpus/corpus.tex, add the argument -gg path/to/your/gold/corpus.txt

You can also evaluate on only the AMR testset, or only the GrAPES testset, simply by leaving out the other parameters.

AMR 3.0 testset only:

python evaluate_all_categories.py -gt path/to/AMR/testset -pt path/to/parser/output/AMR/testset

GrAPES testset only:

python evaluate_all_categories.py -pg path/to/your/parser/output/GrAPES/corpus.txt

Additional options include:

--parser_name: name your parser for more specific output file naming
--smatch: running Smatch on all subcategories (slow),
--error_analysis writing the graph IDs of successes and failures to pickled dictionaries,
--all_metrics: printing Smatch results on Structural Generalisation categories and unlabelled edge recall on appropriate categories (not included in the GrAPES paper)

What do to if you are missing PTB or AMR 3.0

If you don't have AMR 3.0:

Use only the GrAPES corpus
The evaluation script will automatically leave out the Word Disambiguation category (which contains some AMR testset sentences)

If you don't have PTB:

The evaluation script will automatically leave out the Unbounded Dependencies category

Evaluate on a single category

To evaluate on just one of the 36 categories, use the evaluate_single_category.py script and give the name of the category to evaluate (-c) and provide the path to the relevant prediction file (-p).

Category names are listed below. The "relevant" predictions file is the path to your parser's output on corresponding corpus: the AMR testset, GrAPES corpus.txt file, or, if you prefer, the GrAPES subcorpus file, such as adjectives.txt.

If your gold corpus files are not in corpus/corpus.tex and corpus/subcorpora, include the path to the gold file with option -g.

For example, to evaluate on the category Multiple Adjectives, which is a GrAPES-only category, either of the following will work:

python evaluate_single_category.py -c multiple_adjectives  -p path/to/parser/full/grapes/output/file

python evaluate_single_category.py -c multiple_adjectives -p path/to/parser/output/for/adjectives.txt

To evaluate an AMR testset category, e.g. here the Rare Senses category, run the following.

python evaluate_single_category.py -c rare_predicate_senses_excl_01 -p path/to/parser/AMR/testset/output

Category names for the command line

These are also listed if you use the --help option.

-------------------------------
AMR 3.0 testset category names:
-------------------------------
pragmatic_coreference_testset
syntactic_gap_reentrancies
unambiguous_coreference
rare_node_labels
unseen_node_labels
rare_predicate_senses_excl_01
rare_edge_labels_ARG2plus
seen_names
unseen_names
seen_dates
unseen_dates
other_seen_entities
other_unseen_entities
types_of_seen_named_entities
types_of_unseen_named_entities
seen_andor_easy_wiki_links
hard_unseen_wiki_links
frequent_predicate_senses_incl_01
passives
unaccusatives
ellipsis
multinode_word_meanings
imperatives

----------------------
GrAPES category names:
----------------------
pragmatic_coreference_winograd
nested_control_and_coordination
nested_control_and_coordination_sanity_check
multiple_adjectives
multiple_adjectives_sanity_check
centre_embedding
centre_embedding_sanity_check
cp_recursion
cp_recursion_sanity_check
cp_recursion_plus_coreference
cp_recursion_plus_coreference_sanity_check
cp_recursion_plus_rc
cp_recursion_plus_rc_sanity_check
cp_recursion_plus_rc_plus_coreference
cp_recursion_plus_rc_plus_coreference_sanity_check
long_lists
long_lists_sanity_check
unseen_predicate_senses_excl_01
unseen_edge_labels_ARG2plus
word_ambiguities_handcrafted
word_ambiguities_karidi_et_al_2021
pp_attachment
unbounded_dependencies

The evaluation classes each category uses are in evaluation/full_evaluation/category_evaluation/

Running evaluations on multiple parsers at once

You can use evaluation/full_evaluation/run_all_evaluations.py if you set yourself up as follows:

In data/raw/gold, place a copy of your concatenated AMR 3.0 testset and call it test.txt
For each parser:
- Choose a name e.g. "my_parser"
- create a directory in data/processed/parser_outputs called my_parser-outputs
- place all output files here:
  - the output of the full grapes corpus as full_corpus.txt
  - any single-category output files
  - the output on the AMR 3.0 testset as testset.txt
For Python Path reasons, running this as a script can be hard. You have (at least) two choices:
1. Edit the parser_names variable at the top of the file to be your parser names, and just run the file from within your IDE
2. Run it as a script from its folder, with python path set to two directories up (../..). For each parser you want to include, include a command line argument For example:

PYTHONPATH=../../ python run_full_evaluation.py amparser amrbart

Similarly, with that setup you can go all the way from parser names to full results stored in pickles and CSV files, Vulcan-readable pickles for error analysis, and LaTeX > PDF for tables you might want in your paper with the bash script scripts/parser_outs2latex_and_vulcan.sh. (If the global run_all_smatch in run_full_evaluation.py is set to True, this will take a few minutes because it will run Smatch on every subcategory.)

bash parser_outs2latex_and_vulcan.sh parser1_name parser2_name parser3_name

Details about the construction of each category

The appendix of the paper (also in documents/grapes.pdf) provides extensive details for each of the 36 categories.

LaTeX tables

scripts/latex/csv2latex.py converts CSV outputs from run_full_evaluation or evaluate_all_categories to a LaTeX table for as many parsers as you want. We used this for the full tables in the paper. You'll need in your preamble:

\usepackage{longtable}
\usepackage{xcolor, colortbl}
\definecolor{lightlightlightgray}{gray}{0.95}
\newcommand{\successScore}[4]{#1 \scriptsize\textcolor{gray}{#4[#2,#3]}}

You probably want to print the whole table with \small in front. Results column names are taken from CSV filenames. There is a sandbox.tex file in data/processed/latex/ that you can use to check the outputs. (Default output location for the script).

Looking at example outputs

You may find Vulcan helpful for looking at your parser output and comparing it to the gold graph, when available. Git Clone the repository, and install the dependencies.

From your GrAPES main folder, create pickles of the data. This works for any pair of files with predicted and gold graphs in the same order.

See the help instructions for create_vulcan_pickles.py for more details. To get Vulcan-redable pickles of gold and predicted graphs for a category, use:

python create_vulcan_pickle.py -p path/to/prediction/file -g path/to/gold/file -o path/to/output.pickle -c category_name

If you want them further split by whether they were correct or not, make sure you've run the evaluation with the -e flag, which stores the graph IDs of correct and incorrect outputs according to each evaluation criterion. Then you can run create_vulcan_pickle.py with the -e flag, and it will create separate pickles for correct and incorrect graphs for each criterion. ((in)correct_id are for the basic criteria.)

You can then view the graphs and sentences side-by-side with Vulcan from your Vulcan folder (not from your GrAPES folder!):

python launch_vulcan.py path/to/pickle

Structure of this repository

All provided corpus files are in corpus/, including the main file corpus.txt.
All required python scripts are at the root level
The evaluation modules are in evaluation/
Code that was used for the paper (but that you don't need to use) is also included.
You may find that running scripts that are not at the root level gives you PYTHONPATH trouble. In Mac and Linux, try prepending PYTHONPATH=./ to the command. In Windows, try to add the parent directory to the Python Path environment variable.

GrAPES
├── evaluate_all_categories.py          # main script
├── evaluate_single_category.py         # main script for 1 category
├── concatenate_amr_files.py            # for setup
├── complete_the_corpus.py              # for setup
├── create_vulcan_pickle.py             # for visualising predicted/gold pairs
├── corpus                              # all GrAPES corpus files, including TSV files used for evaluation
│ ├── subcorpora                        # all GrAPES AMR files (AMR test set not included)
│ └── corpus.txt                        # the full concatenated GrAPES corpus (AMR test set not included)
├── LICENSE
├── README.md
├── docker-compose                      # Docker compose files for AM parser and AMRBART
├── error_analysis                      # a good place for Vulcan pickles
│ └── README.md
── documents
│   └── grapes.pdf                      # the paper, including detailed appendix re categories
├── evaluation                          # all evaluation modules
│ ├── full_evaluation                   # full evaluation modules
│ │ ├── category_evaluation             # category evaluation modules
│ │ │ ├── subcategory_info.py           # defines dataclass to store info about each subcategory for evaluation
│ │ │ ├── category_metadata.py          # subcategory info by category
│ │ │ ├── category_evaluation.py        # abstract class
│ │ │ ├── edge_recall.py                # edge recall evaluation class
│ │ │ ├── pp_attachment.py              # PP attachment evaluation class
│ │ │ └── etc...                        # more classes for specific evaluations
│ │ ├── corpus_statistics.py
│ │ ├── run_full_evaluation.py          # runs evaluations on multiple parsers (used for paper)
│ │ └── wilson_score_interval.py
│ └── corpus_metrics.py
├── grammars                            # Alto grammars for structural generalisation
├── scripts
│ ├── parser_outs2latex_and_vulcan.sh   # from parser names to full displayed and saved results (see documentation)
│ ├── file_manipulations                # various scripts for changing files
│ ├── latex                             # converts csv outs from evaluate_all_categories and run_all_evaluations to LaTeX table
│ └── preprocessing                     # preprocessing scripts for AM parser and AMRBART
├── data
│ ├── raw                               # good place for a copy of gold AMR testset, e.g.
│ ├── processed                         # good place parser outs
│ │ └── results                         # evaluation scripts save outputs here
└── amrbank_analysis                    # various scripts and modules used in the creation of GrAPES

Credits

Authors: Jonas Groschwitz, Shay B. Cohen, Lucia Donatelli, & Meaghan Fowlie

This work builds on (and contains parts of) the Winograd Schema Challenge, which is published under the CC BY 4.0 license.

This work also builds on the Putting Words into BERT's Mouth corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GrAPES

Set up

Dependencies

Corpus files

AMR testset setup (B)

Obtaining the full GrAPES testset (C).

Unbounded Dependencies

Word Ambiguities

Usage

Running your parser

Evaluation

What do to if you are missing PTB or AMR 3.0

Evaluate on a single category

Category names for the command line

Running evaluations on multiple parsers at once

Details about the construction of each category

LaTeX tables

Looking at example outputs

Structure of this repository

Credits

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
amrbank_analysis		amrbank_analysis
corpus		corpus
data		data
docker-compose		docker-compose
documents		documents
error_analysis		error_analysis
evaluation		evaluation
grammars		grammars
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.txt		TODO.txt
complete_the_corpus.py		complete_the_corpus.py
concatenate_amr_files.py		concatenate_amr_files.py
create_vulcan_pickle.py		create_vulcan_pickle.py
evaluate_all_categories.py		evaluate_all_categories.py
evaluate_single_category.py		evaluate_single_category.py
read_in_pickled_results.py		read_in_pickled_results.py

License

jgroschwitz/GrAPES

Folders and files

Latest commit

History

Repository files navigation

GrAPES

Set up

Dependencies

Corpus files

AMR testset setup (B)

Obtaining the full GrAPES testset (C).

Unbounded Dependencies

Word Ambiguities

Usage

Running your parser

Evaluation

What do to if you are missing PTB or AMR 3.0

Evaluate on a single category

Category names for the command line

Running evaluations on multiple parsers at once

Details about the construction of each category

LaTeX tables

Looking at example outputs

Structure of this repository

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages