forked from WING-NUS/RelatedWorkSummarizationDataset
-
Notifications
You must be signed in to change notification settings - Fork 0
/
RWSData.htm
327 lines (265 loc) · 16.3 KB
/
RWSData.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Dataset for Related Work Summarization Task (RWSData)</title>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
p.MsoCaption, li.MsoCaption, div.MsoCaption
{margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Times New Roman";
font-weight:bold;}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
{page:Section1;}
/* List Definitions */
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-size:20.0pt'>Data</span></b><span style='font-size:20.0pt'>set for <b>R</b>elated
<b>W</b>ork <b>S</b>ummarization Task </span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-size:20.0pt'>(<b>RWSData</b>)</span></p>
<p class=MsoNormal> </p>
<p class=MsoNormal><b><span style='font-size:14.0pt'>Introduction</span></b></p>
<p class=MsoNormal style='text-align:justify;text-autospace:none'><span
style='font-size:13.0pt'>In scientific research, a scholar needs to show an
understanding of the context of his problem and relate his work to prior
community knowledge. A related work section is often the vehicle for this
purpose; it contextualizes the scholar’s contribution and helps the reader
understand the critical aspects of the previous works that the current work addresses.
Creating such a related work summary requires the author to understand the nuances
of his own work, and to manipulate the contextual research to support the advantages
of his method.</span></p>
<p class=MsoNormal style='text-align:justify;text-autospace:none'><span
style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify;text-autospace:none'><span
style='font-size:13.0pt'>We now envision an NLP application that assists the
scholar in creating his related work summary. We propose <b>related work
summarization task</b> as a challenge to the automatic summarization community.
Given multiple articles (e.g. conference or journal papers) as input, and a
set of keywords that describe the topics of interest presented in a
hierarchical fashion, the output goal is to create a related work section that
finds the relevant related works and contextually describes them in relationship
to the topics given.</span></p>
<p class=MsoNormal style='text-align:justify;text-autospace:none'><span
style='font-size:4.0pt'> </span></p>
<p class=MsoNormal align=center style='text-align:center;page-break-after:avoid'><img
width=582 height=319 src="RWSData_files/image002.jpg"></p>
<p class=MsoCaption align=center style='text-align:center'><span
style='font-size:11.0pt'>Figure </span><span
style='font-size:11.0pt'>1</span><span style='font-size:11.0pt'>: Overview of
Related Work Summarization Task</span></p>
<p class=MsoNormal><span style='font-size:11.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>We
now release our dataset (namely <b>RWSData</b>) used in our initial experiments
for the task (see our papers in Publications for further details). We are
welcome to use our dataset for any future research.</span></p>
<p class=MsoNormal><span style='font-size:11.0pt'> </span></p>
<p class=MsoNormal><b><span style='font-size:14.0pt'>Dataset</span></b></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>This
dataset contains a collection of <b>20</b> article sets. Each set has an ID
name (set as folder name), including:</span></p>
<p class=MsoNormal style='margin-left:.5in;text-align:justify;text-indent:-.25in'><span
style='font-size:13.0pt'>A.<span style='font:7.0pt "Times New Roman"'>
</span></span><b><span style='font-size:13.0pt'>ref (folder)</span></b><span
style='font-size:13.0pt'>: includes two sub-folders:</span></p>
<p class=MsoNormal style='margin-left:1.0in;text-align:justify;text-indent:
-.25in'><span style='font-size:13.0pt;font-family:Symbol'><img width=15
height=15 src="RWSData_files/image001.gif" alt="*"><span style='font:7.0pt "Times New Roman"'>
</span></span><span style='font-size:13.0pt'>pdf: original PDF files of
referenced articles</span></p>
<p class=MsoNormal style='margin-left:1.0in;text-align:justify;text-indent:
-.25in'><span style='font-size:13.0pt;font-family:Symbol'><img width=15
height=15 src="RWSData_files/image001.gif" alt="*"><span style='font:7.0pt "Times New Roman"'>
</span></span><span style='font-size:13.0pt'>txt: text files extracted PDF
files (pre-processed, cleaned, error-fixed)</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='margin-left:.5in;text-align:justify;text-indent:-.25in'><span
style='font-size:13.0pt'>B.<span style='font:7.0pt "Times New Roman"'>
</span></span><b><span style='font-size:13.0pt'><ID name>.pdf</span></b><span
style='font-size:13.0pt'>: original PDF file of the article which contains the
related work section to be extracted.</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='margin-left:.5in;text-align:justify;text-indent:-.25in'><span
style='font-size:13.0pt'>C.<span style='font:7.0pt "Times New Roman"'>
</span></span><b><span style='font-size:13.0pt'><ID name>.rws</span></b><span
style='font-size:13.0pt'>: raw texts of related work section extracted from <b><ID
name>.pdf</b></span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='margin-left:.5in;text-align:justify;text-indent:-.25in'><span
style='font-size:13.0pt'>D.<span style='font:7.0pt "Times New Roman"'>
</span></span><b><span style='font-size:13.0pt'><ID name>.rws.seg</span></b><span
style='font-size:13.0pt'>: sentence by sentence texts from <b><ID
name>.rws</b>. Additionally, each sentence is annotated with topical and reference
information, following the formats:</span></p>
<p class=MsoNormal><span style='font-size:13.0pt;background:#FFCC99'>a)<span
style='color:red'> <topic index></span>#<span style='color:fuchsia'><list
of names of references separated by semi-colons ‘;’></span>#<span
style='color:blue'><sentence></span></span></p>
<p class=MsoNormal><span style='font-size:13.0pt;background:#FFCC99'>b)<span
style='color:red'> claim</span>#<span style='color:blue'><sentence></span></span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>For
the format a), each reference in <span style='color:blue'><sentence> </span>is
replaced by numbers referring to the order (left-to-right) of that reference in
<span style='color:fuchsia'><list of names of references separated by
semi-colons ‘;’></span>. The <span style='color:red'><topic index></span>
is mentioned in <b><ID name>.topic</b>. Further, the format b) is
simpler, just mentions that current sentence is the authors’ <b>claim</b> about
their own work related to previous works. Such a sentence may not be used in our
reported related work summarization process.</span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify'><i><u><span style='font-size:
13.0pt'>For example</span></u></i><span style='font-size:13.0pt'>:</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt;
color:red'>0.1</span><span style='font-size:13.0pt'>#<span style='color:fuchsia'>(barzilay
and mckeown 2001)</span>#<span style='color:blue'>for example , $1 evaluated
their paraphrases by asking judges whether paraphrases were "
approximately conceptually equivalent . "</span></span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt;
color:red'>0</span><span style='font-size:13.0pt'>#<span style='color:fuchsia'>(griffiths
et al., 2005);(wallach, 2006);(purver et al., 2006);(gruber et al., 2007)</span>#
<span style='color:blue'>more recent work has attempted to adapt the concepts
of topic modeling to more sophisticated representations than a bag of words ;
they use these representations to impose stronger constraints on topic
assignments $1 $2 $3 $4 .</span></span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt;
color:red'>(claim)</span><span style='font-size:13.0pt'>#<span
style='color:blue'>from the methodological side , that body of prior work is
largely driven by local pairwise constraints , while we aim to encode global
constraints .</span></span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt;
color:blue'> </span></p>
<p class=MsoNormal style='margin-left:.5in;text-align:justify;text-indent:-.25in'><span
style='font-size:13.0pt'>E.<span style='font:7.0pt "Times New Roman"'>
</span></span><b><span style='font-size:13.0pt'><ID name>.topic</span></b><span
style='font-size:13.0pt'>: the details of the topic hierarchy tree (manually
annotated). It has a XML-like format with tags: </span></p>
<p class=MsoNormal style='margin-left:1.0in;text-align:justify;text-indent:
-.25in'><span style='font-size:13.0pt;font-family:Symbol'><img width=15
height=15 src="RWSData_files/image001.gif" alt="*"><span style='font:7.0pt "Times New Roman"'>
</span></span><span style='font-size:13.0pt'><topic index> </topic
index>: indicates the position of current topic in topic tree</span></p>
<p class=MsoNormal style='margin-left:1.0in;text-align:justify;text-indent:
-.25in'><span style='font-size:13.0pt;font-family:Symbol'><img width=15
height=15 src="RWSData_files/image001.gif" alt="*"><span style='font:7.0pt "Times New Roman"'>
</span></span><span style='font-size:13.0pt'><title> </title>:
describe the title of current topic</span></p>
<p class=MsoNormal style='margin-left:1.0in;text-align:justify;text-indent:
-.25in'><span style='font-family:Symbol'><img width=13 height=13
src="RWSData_files/image001.gif" alt="*"><span style='font:7.0pt "Times New Roman"'>
</span></span><span style='font-size:13.0pt'><keywords> </keywords>:
a list of key word/phrases describing current topic</span> </p>
<p class=MsoNormal style='text-align:justify'><i><u><span style='font-size:
11.0pt'><span style='text-decoration:none'> </span></span></u></i></p>
<p class=MsoNormal style='text-align:justify'><i><u><span style='font-size:
13.0pt'>For example</span></u></i><span style='font-size:13.0pt'>:</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:4.0pt'> </span></p>
<table class=MsoTableGrid border=1 cellspacing=0 cellpadding=0 width=915
style='width:686.5pt;border-collapse:collapse;border:none'>
<tr style='height:17.35pt'>
<td width=454 valign=top style='width:340.3pt;border:solid windowtext 1.0pt;
padding:0in 5.4pt 0in 5.4pt;height:17.35pt'>
<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-size:13.0pt;background:silver'>XML-like format of topic tree</span></b></p>
</td>
<td width=462 valign=top style='width:346.2pt;border:solid windowtext 1.0pt;
border-left:none;padding:0in 5.4pt 0in 5.4pt;height:17.35pt'>
<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-size:13.0pt;background:silver'>Visual format of topic tree</span></b></p>
</td>
</tr>
<tr style='height:228.8pt'>
<td width=454 valign=top style='width:340.3pt;border:solid windowtext 1.0pt;
border-top:none;padding:0in 5.4pt 0in 5.4pt;height:228.8pt'>
<p class=MsoNormal><span style='font-size:11.0pt'><topic index> 0
</topic index></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><title> paraphrase
evaluation methods </title></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><keywords>
paraphrase;evaluation;quality; </keywords></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'> </span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><topic index> 0.1
</topic index></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><title> subjective
manual evaluation </title></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><keywords> judge;judgment;judgement;assess;human;subjective;manual;
</keywords></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'> </span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><topic index> 0.2
</topic index></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><title> improving
performance on particular tasks </title></span></p>
<p class=MsoNormal><span style='font-size:11.0pt'><keywords> improve;gain;quality;performance;translation;answering;SMT;statistical
machine translation;machine translation;QA;question answering;
</keywords></span></p>
</td>
<td width=462 valign=top style='width:346.2pt;border-top:none;border-left:
none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;
padding:0in 5.4pt 0in 5.4pt;height:228.8pt'>
<p class=MsoNormal align=center style='text-align:center'><img width=650
height=187 src="RWSData_files/image003.jpg"></p>
</td>
</tr>
</table>
<p class=MsoNormal style='text-align:justify'><span style='font-size:11.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>Further,
a detailed statistics of the RWSData dataset is provided in the <b>DataStatistics.xls</b>
file (Excel file). Also, an additional log file (<b>Logs.txt</b>) which
describes the progression of the creation of the RWSData dataset is also
attached.</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'> </span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>In
particular, you can download this dataset from the following URL: <a
href="http://www.comp.nus.edu.sg/~hcdvu/RWSData/RWSData_1.10.zip">Dataset for Related
Work Summarization (Zip file, ~73MB)</a>. </span></p>
<p class=MsoNormal> </p>
<p class=MsoNormal><b><span style='font-size:14.0pt'>Publications</span></b></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>Cong
Duy Vu Hoang (2010). Towards Automated Related Work Summarization. MSc Thesis.
Department of Computer Science, School of Computing, National University of Singapore.</span></p>
<p class=MsoNormal style='text-align:justify'><span style='font-size:13.0pt'>Cong
Duy Vu Hoang & Min-Yen Kan (2010). Towards Automated Related Work
Summarization. In <i>Proceedings of the 23rd International Conference on
Computational Linguistics (COLING 2010)</i>. Beijing, China.</span></p>
<p class=MsoNormal> </p>
<p class=MsoNormal><b><span style='font-size:14.0pt'>Group Members</span></b></p>
<p class=MsoNormal><span style='font-size:13.0pt'><a
href="http://www.comp.nus.edu.sg/~hcdvu">Cong Duy Vu Hoang</a> (NUS) – <a
href="mailto:[email protected]">[email protected]</a> or <a
href="mailto:[email protected]">[email protected]</a> </span></p>
<p class=MsoNormal><span style='font-size:13.0pt'><a
href="http://www.comp.nus.edu.sg/~kanmy">Min-Yen Kan</a> (NUS) – <a
href="mailto:[email protected]">[email protected]</a> </span></p>
</div>
</body>
</html>