This will review performances of different methods on

This paper is a review of the works presented at the 10th International Workshop on
Semantic Evaluation 2016, specifically on the task of performing sentiment analysis
on Twitter data (Nakov, Ritter, Rosenthal, Sebastiani, & Stoyanov, 2016). SemEval
2016 offered 14 tasks in five tracks that teams from around the world were invited to
participate and solve. This review focuses on the fourth task on the list, and talks about
the five subtasks under task4. The coordinators and organizers of SemEval 2016 have
set evaluation metrics for all subtasks, which will be discussed in section 2. The subtasks
are based on classifying and quantifying the sentiment in tweets over the period
of a few years. Along with a general classification, we will review performances of
different methods on topic wise tweet emotion analysis.
Subtask A, B, and C are for classification of tweet sentiment across a set of classes,
while as subtasks D and E involve calculation of distribution over them. Furthermore,
subtask A has a three target classes, while as subtasks B and D have binary ones, and
subtasks C and E have an ordinal set of classes that the tweets need to be grouped into.
Twitter data, like any other huge datasets, present an opportunity to model a collective
structure out of the sparsely connected and widespread individual datasets. Sentiment
analysis on twitter data , especially quantification of the sentiment, can be used
to study political polarization in the masses (Borge-Holthoefer, Magdy, Darwish, &
Weber, 2015),stock market prediction from tweets (Bollen, Mao, & Zeng, 2011),marketing
(Burton & Soboleva, 2011), and global happiness indication (Dodds, Harris,
Kloumann, Bliss, & Danforth, 2011), amongst many other applications. We will discuss
these applications in more detail in section 4. Similar tasks from the previous runs
of SemEval on tweet sentiment analysis have drawn the most submissions amongst all
tasks. This clearly shows how researchers and analysts are considering human twitter
interactions as a source of data that can be used for shaping a global thought process,
mindset or even predicting events in the near future.
In this review paper, we will be talking about how the coordinators at the SemEval
2016 presented the tasks and set the evaluation metrics, and what data was presented
to the participants to work on. We will also review and compare methods used to
collect and preprocess data, and to achieve the best scores for each subtask. We will
talk about the general consensus of choosing the type of classification/quantification
method, as well as a bit more in detail about the approach taken by the winners of each
subtask.
3
2 Data, Tasks, and Evaluation Metrics
In this section, we will talk about the data used and presented to the participants for
this task, as well as subtask definitions and evaluation metrics. We will go into the
details of each subtasks, discussing the methods for measuring accuracy/errors and the
possible changes the coordinators could have made.
2.1 Data Used
In this sub-section, we will discuss the methods of data collection, validation and distribution
over topics and sentiments. The dataset presented to the participants was
called Tweet2016. All the tweets in the dataset have been manually classified for training
and testing the performances of each subtask submission. The participants were
provided with datasets of the previous runs of similar SemEval Tasks (Rosenthal et al.,
2015) (Rosenthal, Nakov, Ritter, & Stoyanov, 2014) (Nakov et al., 2013) for training.
An extra dataset for training was extracted from the tweets in 3rd quarter of 2015, and
a testing dataset was used from tweets taken from 4th Quarter of 2015. The tweets
were based on 200 topics that were extracted using an entity recognition system developed
by Alan Ritter et al in 2011 (Ritter, Clark, Mausam, & Etzioni, 2011).Separate
topics for testing and training were manually chosen so that they were not arguable or
equivocal.
For annotation of the data, manual services like Amazon’s Mechanical Turk(AMT) and
CrowdFlower were used respectively for training and test sets. This is a fine process
for validation of classes in which tweets are organized for training/testing, however, in
our opinion, the organizers should have used cross validation on the AMT annotation
to remove user responses that could be randomly inconsistent. This can simply be done
by singling out the participants that have continuously annotated tweets in the wrong
classes as averaged against a large number of responses, and then invalidating their
responses overall. The authors do make a point of taking similar measures to weed out
the predicted wrong responses on the test set from CrowdFlower. For all tweets, the
authors have converted the five classes of sentimentHighly positive, positive, neutral,
negative, highly negative into integers in the range -2,2. This helps in assigning a
class to a tweet by averaging the score, in cases where there is not a clear majority of
the classification.
4
2.2 Tasks and Evaluation metrics
Task 4 of the SemEval 2016 (Nakov, Ritter, Rosenthal, Sebastiani, & Stoyanov, 2016)
is divided into five sub-tasks, each having either a different goal or different granularity.
Subtask A was a repetition of tasks from the previous runs of semantic evaluations, and
was a simple polarity task for classification of tweets regardless of the topic. Subtasks
D and E are for quantification of sentiment while as B and C are for classification of
the same. To add diversity and interest in these tasks, one task of quantification and
classification each had binary target classes and the other had ordinal classes.
2.2.1 Subtask A
The first subtask asked participants to classify a tweet into positive, negative or neutral
classes. This is a single-label-multi-class(SLMC) classification where the sentiments
of each data-point(tweet) are not converted to ordinal values. To evaluate this subtask,
the authors are considering the F-Score of the partitioned predicted classes of
sentiment. For each of the positive, negative and neutral class, a separate precision
and recall score are calculated, which are then used to calculate the F score. Thus the
measure F
PN
1
is:
F
PN
1 =
(F
P
1 +F
N
1
)
2
Here,F
P
1
is the F score for the positive class:
F
P
1 =
2?
P?
P
?
P +?
P
where ?
P
is precision for positive class, and ?
P
is recall for the same. Similarly precision
and recall for negative and neutral class is calculated and worked into the overall
F Score formula. However, in our opinion, the author should have given weightage to
the neutral class as well, while calculating the final score F
PN
1
. Also, the authors could
have used the recall over all classes as a good score, since it does show an unbiased accuracy
whether or not the classes are balanced (Esuli & Sebastiani, 2015). This would
affect the score of the baseline classifier where for imbalanced training set, the F Score
would be greater than one-third, which would bias the prediction score towards one of
the positive, negative or neutral classes. However, all teams that submitted a solution
for this subtask did pretty well than the baseline and than the previous year scores of
the same task, so the F score did a well enough job to handle the inconsistencies.
5
2.2.2 Subtask B
The second subtask asks for classification of tweets given a particular topic into positive
and negative binary target classes. There is no neutral class here. For each of the
200 topics that the twitter data had been extracted about, a score for the prediction of
correct sentiment class is calculated separately, and the mean score is taken as a final
indication of the submitters performance. The authors have rightly used macroaveraged
recall for this subtask as mentioned by (Esuli & Sebastiani, 2015). The recall
score is the mean of the recall for each of the positive and negative classes( remember
there is no neutral class here). This would result in a higher score for a better classifier
and a exact score of 0.5 for the classifier that predicts only one class. Ideally, all
acceptable classifiers should score better than this baseline classifier. The recall score
for this task is given by
?
PN =
(?
P +?
N)
2
where ?
P
and ?
N are recall values for positive and negative classes respectively.
2.2.3 Subtask C
Subtask C is a task for classification of tweets under a specific topic into a five class
ordinal group. The target sentiment classes here are {Highly positive, Positive, Neutral,
Negative, Highly negative}, which are ordered as numbers {-2,-1,0,1,2} over the
set of sentiments. This subtask is evaluated on the basis of the margin of error in classification,
which means that a misclassification of the extreme classes is a bigger error
than one between adjacent classes (Gaudette & Japkowicz, 2009). In simple terms, if
a +2 class is misclassified as a -2, the error would be four times than if a +2 gets misclassified
as a +1. This is because the authors have used a Mean Absolute Error(MAE)
over all tweet classifications for each topic. The MAE is calculated for all classes and
the mean is taken as a score for a teams submission for this subtask. Since this error
calculates the distance between the predicted class and the actual class of a tweet, the
lower value of MAE would infer a better classification job done (Baccianella, Esuli, &
Sebastiani, 2009). The formula for MAE in this case is defined as:
MAE =
1
C
C
?
i=1
1
T ?xi
(h(xi)?yi)
where C is the set of target classes, T is the number of tweets in the i
th class, and
(h(xi)?yi) is the error measure of each tweets classification. Again, as with the previous
task, this error is averaged across topics.
6
2.2.4 Subtask D
Subtask D (and E) is different from the previous three because it asks the participants
to quantify the estimate of a particular class within tweets instead of classifying tweets
into one of the classes. For this subtask, binary quantification needs to be done for the
positive and negative (only) classes for all tweets within a single topic. Remember that
quantification cannot provide an unbiased view of classification over the same dataset
because even though chances are that a single class is quantified and classified correctly
with coincidence, for all classes, quantification cancels out Type I and Type II errors
for the same class (Esuli & Sebastiani, 2015) (Forman, 2008). Alan Ritter et al write
“Quantification is thus a more lenient task since a perfect classifier is also a perfect
quantifier, but a perfect quantifier is not necessarily a perfect classifier” (Nakov et
al., 2016). The authors thus employ the Kullback-Leibler Divergence(KLD) (Forman,
2005) formula for evaluating the error made in calculating the estimated distribution
over the actual distribution for all sentiment classes as follows:
KLD(p
1
, p,C) = ?ci
p(ci)loge
p(c j)
p
1(c j)
where p
1
(ci) is the estimated distribution of the i
th sentiment class and p is its actual
distribution. Note that since the denominator contains the predicted distribution
of the class label, it could lead to horribly inconsistent results if there are no tweets
classified in the test set towards a particular class. This is solved by smoothing the
values at the denominator and equalizing with a similar value in the numerator as well.
The perfect result in this case would be if KLD value is calculated as zero, where all
classes are quantified correctly, and larger positive values as we go down reducing our
performance of correctly quantified classes