DG

multiLing ST Annotations

  • multiLing_STannotations.txt file →

    Download

The multiLing_STannotations.txt file is a tab-delimited table containing token-level annotations for the multiLing source texts in the CRITT TPR-DB. These annotations were created by Haruka Ogawa, Devin Gilbert, and Samar Almazroei for the following publication: "redBird: Rendering Entropy Data and ST-Based Information into a Rich Discourse on Translation: Investigating relationships between MT output and human translation" which is a chapter in the book "Explorations in Empirical Translation Process Research," edited by Michael Carl as part of the Springer book series "Machine Translation: Technologies and Applications" (series editor Andy Way).

The "Text" and "Id" columns contain integers denoting the text number each token comes from as word as the token index (1–n) for each token within each text. These values are identical to what you will find in any TPR-DB .st table. The "Text_Id" column contains strings which are a concatenation of the first two columns.

All the rest of the columns contain strings. "SToken" is identical to what you would find in a TPR-DB .st table and is the source token. "wordClass" is a more general part-of-speech category based off of the "PoS" column. "new_PoS" is the corrected part-of-speech tag (see the above-mentioned study for more details on this). "old_PoS" is the former PoS tag that was automatically generated by the TPR-DB NLP chain; this column has the same PoS tags as the "new_PoS" column except where tags were corrected by the researchers.

Almost all the rest of the columns ("Figurative", "Passive", "Anaphora") contain annotations that are described in the above-mentioned study. All of these annotations are binary (containing either "Other" or some sort of keyword denoting that the column annotation category applies to the token in the current row) except the "Figurative" column, which contains "Metaphorical" for Metaphorical expressions and "Fixed" for Fixed expressions (i.e., idiomatic expressions). The only annotation column that was not discussed in the above-mentioned study is "adjNouns" which marks any token that is an adjectival noun with "AdjN". Again, for all of these annotation columns, "Other" denotes that that annotation category does not apply to that row.