People tweet more than 500 Million times daily, yielding a
noisy, informal, but sometimes informative corpus of 140-character messages
that mirrors the zeitgeist in an unprecedented
manner. In the domain of surveillance systems, one is interested in
extracting the specific information about "who" is doing "what", "where"
and "when", that is conveyed in these
short messages. This task is commonly called named entity recognition and
aims to automatically extract mentions of rigid designators from text
belonging to named-entity types
such as persons, organizations, events, locations and timexes.
In this project, we propose to improve the robustness of named entity
recognition systems for Twitter starting by following a line like the
recent work of (Saha et al., 2015a) with
biomedical texts. Indeed, most existing systems learn classifiers based on
small amount of labeled data. Although strong results can be obtained
through cross-validation, these
systems may not scale up in real-world environments such as social media
due to high variety of unseen text contents. One solution to this problem
is to apply ensemble learning
techniques to take advantages of different learning paradigms such as
conditional random fields or support vector machines (Joachims, 1998) that
may combine in some optimum
consensus. By doing so, we expect that robust classifiers can be built and
used reliably in a real-word environment such as Twitter.
To bridge the gap between unstructured text and structured machine readable
knowledge bases, entity linking is performed and consists in mapping each
entity mention in a tweet
to a unique entity, i.e. an entry ID of a knowledge base such as Wikipedia
or YAGO (Suchanek et al., 2007). As such, each tweet is not an isolated
segment of text but instead links
to a knowledge base, which allows multilingual reasoning. A great deal of
studies has been tackling entity linking (Ferragina and Scaiella, 2013) and
more recently entity linking
for social media texts (Liu et al., 2013). Tweets pose special challenges
to entity linking. First, a tweet is often too concise and too noisy to
provide enough information for similarity
computing. Second, tweets have rich variations of named entities, and many
of them fall out of the scope of the knowledge bases. Within the scope of
this project, we propose
to tune the strategy used for robust entity linking in (Hoffart et al.,
2011) for social media texts. We will study the introduction of named
entity continuous space (Lin et al., 2015)
in the disambiguation process. A promising new possibility can be the
introduction of the recent work of one of the team members (Brazdil et al.,
2015) in the domain of affinity
mining, specially to discover and resolve apparently unrelated entities,
mentioned in the social networks.