Mobs, riots, and crowds greatly impact societies in numerous ways. On a mid-term basis, the collective behavior of
crowds can be an agent of social change and an affirmation of existing social mores and structures. On a short-term
basis, they may have dramatic consequences such as killings, slaughters, or material damages, which badly reflect
how fractured our societies can be.
Daily reports of demonstrations in various parts of the world emerge. More recently we have seen anti-government
protests in the US, Venezuela, and some European countries. Between 2012 and 2015, hundreds of thousands
of people protested in the streets of Portugal (goo.gl/WeH7qR) against troika's public policies, sometimes resulting
in violent clashes. In France, one recent anti-police brutality protest turned violent, with masked youths and police
engaging in running street battles, after the death of a young environmental activist. Seven police officers were injured
after around 2,000 people gathered to protest, with some throwing Molotov Cocktails (goo.gl/FQQCdi). In India, one
of the recent news that attracted attention is the Gujarat violence, which happened due to the caste-based protest led
by Patel Community (goo.gl/tymXhc). This caused the death of 9 people, and 18 were injured. Army personnel had
to be deployed. Schools were closed and trains bound to the city were canceled.
Within this project, we propose to develop a multilingual surveillance system capable of detecting emerging crowds
by identifying rising events that foster high focus, high energy and high emotion on social media. Our fundamental
hypothesis is that virtual crowds evidence similar characteristics to real crowds, which may allow their modelization
in terms of complex computer systems by relying on advanced natural language processing and machine learning
techniques. The current project lays at the intersection of important scientific research topics, namely urban
informatics, natural language processing for social media, predictive analytics over big social data and image
It will begin by presenting the activities of R&D to develop, followed by the organisation and
logic of structuring the plan of work, the justification of the research strategy to adopt and its adequacy
to the objectives of the project. The project management structure and decision-making mechanisms will subsequently
be described. The activities to be developed and their temporal distribution in the project are as follows:
People tweet more than 500 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented
manner. In the domain of surveillance systems, one is interested in extracting the specific information about "who" is doing "what", "where" and "when", that is conveyed in these
short messages. This task is commonly called named entity recognition and aims to automatically extract mentions of rigid designators from text belonging to named-entity types
such as persons, organizations, events, locations and timexes.
In this project, we propose to improve the robustness of named entity recognition systems for Twitter starting by following a line like the recent work of (Saha et al., 2015a) with
biomedical texts. Indeed, most existing systems learn classifiers based on small amount of labeled data. Although strong results can be obtained through cross-validation, these
systems may not scale up in real-world environments such as social media due to high variety of unseen text contents. One solution to this problem is to apply ensemble learning
techniques to take advantages of different learning paradigms such as conditional random fields or support vector machines (Joachims, 1998) that may combine in some optimum
consensus. By doing so, we expect that robust classifiers can be built and used reliably in a real-word environment such as Twitter.
To bridge the gap between unstructured text and structured machine readable knowledge bases, entity linking is performed and consists in mapping each entity mention in a tweet
to a unique entity, i.e. an entry ID of a knowledge base such as Wikipedia or YAGO (Suchanek et al., 2007). As such, each tweet is not an isolated segment of text but instead links
to a knowledge base, which allows multilingual reasoning. A great deal of studies has been tackling entity linking (Ferragina and Scaiella, 2013) and more recently entity linking
for social media texts (Liu et al., 2013). Tweets pose special challenges to entity linking. First, a tweet is often too concise and too noisy to provide enough information for similarity
computing. Second, tweets have rich variations of named entities, and many of them fall out of the scope of the knowledge bases. Within the scope of this project, we propose
to tune the strategy used for robust entity linking in (Hoffart et al., 2011) for social media texts. We will study the introduction of named entity continuous space (Lin et al., 2015)
in the disambiguation process. A promising new possibility can be the introduction of the recent work of one of the team members (Brazdil et al., 2015) in the domain of affinity
mining, specially to discover and resolve apparently unrelated entities, mentioned in the social networks.
Social media have emerged as powerful means of communication for people looking to share and exchange information on a wide variety of real-world events. Short messages
posted on Twitter can typically reflect these events as they happen. For this reason, the content of such social media sites is particularly useful for real-time identification of realworld
events and their associated user-contributed messages. As such, a crowd can be viewed as a community sharing a common focus about some specific event.
Event detection has intensively been studied in the last decade mainly due to the advent of social media (Farzindar and Khreich, 2015). Two different approaches have been
proposed: document-pivot and feature-pivot (Section X). All the studied techniques are interesting but do not cover the overall picture. Once an event is detected, it is crucial to
track it, i.e. to follow its evolution. Within this scope, topic models have shown successful results for text clustering tasks (Section X). However, they only rely on a term-document
matrix to compute similarity, which may be insufficient for social media texts that are short and lack in contextual information. This is confirmed by the recent work of (Vikre and
Wold, 2015), who show that the use of locality-sensitive hashing combined with named entity recognition achieves better performance for detecting news than using the topic
modeling approach. Moreover, topics are represented as sets of words, that may not all be bursty, and thus may include more general topics than specific ones.
Therefore, we propose a new strategy based on the recent findings of (Moreno et al., 2014) which proposed the Dual C-means clustering algorithm allowing to mix document-pivot
and feature-pivot techniques into a unique model. Dual C-means showed to perform likewise topic models for word sense induction (Acharaya et al., 2016). The advantage of the
Dual C-means algorithm is that different similarity measures can be implemented, including knowledge-based metrics, that may lead to improved results in the line of (Vikre and
Wold, 2015), as tweets are linked to YAGO by their entities. Moreover, the clustering process can be driven by bursty keywords or named entities, and thus better adapt to the
dynamic environment of Twitter, instead of relying on all possible terms present in the time window. It will also allow the integration of richer sources of knowledge in clustering.
One possibility is to explore entailment dependencies according to recent work from the team members (Pais et al., 2014).
WORK PACKAGE 3 : M6 to M36
EXTREMISM AND COLLECTIVE RADICALIZATION UNDERSTANDING
Each cluster of tweet messages focusing on a bursty topic may constitute a potential threat. However, the overwhelming majority of clusters are armless and represent casual,
conventional or expressive crowds as well as noisy data (Becker et al., 2011). To identify acting or protest crowds, we propose to understand the typical language usage present
in each cluster as well as its network activity. Indeed, ultimately, a crowd is characterized by its dominant emotion, its level of interaction and shared focus.
(Krumm, 2015) showed that specific radicalized language is used within acting and protest crowds. Therefore, we propose that each tweet inside a cluster is classified as radical
or non-radical in terms of language use, so that the collective radicalization of a cluster can be measured. As far as we know, there exists no previous work on modeling radicalized
language. Radicalization is a process by which an individual or group comes to adopt increasingly extreme political, social, or religious ideals and aspirations (goo.gl/Jz08cD). As
such, we hypothesize that radicalized language mainly expresses negative emotions (such as anger, fear, or anxiety) with high intensity, following the classification of Plutchik's
wheel of emotion (Plutchik, 1980).
In this project we propose to learn weak classifiers based on dictionaries of emotional words (Qadir and Riloff, 2014) to retrieve many roughly classified emotional tweets, in a
similar way as (Tang et al., 2014). We will build word embeddings that consider emotional and intensity contexts at the same time together with syntactic context. For that purpose,
weak classifiers for intensity detection will be built based on recent work of (Sharma et al., 2015) on language intensity.
Finally, we will study the introduction of demographically-driven word embeddings. Recently, (Bamman et al., 2014) proposed to develop word embeddings considering the
localization of the issuer of the conveyed message. These findings open a great deal of improvements, as tweets can be geo-localized but also include information such as age
and gender of the issuer.
Each event cluster is supposed to focus on a specific event (i.e. high focus), that may imply radicalized talks (high emotion) of its members. One particularity of crowds is also its
level of internal relationship. A community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group. This
problem has intensively been investigated in recent years (Pizzuti, 2008). In this part of the project, we propose to develop a new algorithm capable of optimizing different parallel
fitness functions to identify densely connected groups of nodes with sparse connections between groups following the multi-objective paradigm (Coello, 1999).
Many social media posts are accompanied by images or are solely composed of them. Hence, to take full advantage of the information in each tweet, we wish to process not
only the text but also the accompanying image content.
Extracting sentiment information from images is a hard task given the fact that the same image can be interpreted by different people as conveying a different sentiment; even for
the same person, a single image might have a different interpretation depending on the occasion that it is observed.
Nonetheless, several approaches to Image Sentiment Analysis (ISA) have been proposed ((Jindal et al., 2015), (You et al., 2015), (Yuan et al. 2015)) and there are already
commercial services that implement ISA, such as Microsoft Cognitive Services (https://azure.microsoft.com/en-us/services/cognitive-services) (although in this case, only face
images are processed for extracting sentiment).
In this WP, we propose to use a three-step approach for ISA. First, we intend to use deep learning approaches (Goodfellow et al., 2016), such as convolutional neural networks or
residual nets, to model sentiment in images from pre-labeled databases, such as (Borth et al. 2013). Second, we will use methods for auto-labeling of images, such as ((Vinyals
et al. 2015), (Karpathy et al., 2017)), and infer the sentiment from the produced labels. Thirdly, we will take advantage of the sentiments in the text that accompanies the images
to automatically label a large dataset of tweets and then train the deep learning models using this dataset. We will also combine these approaches to improve the accuracy and
make comparative evaluations to understand their advantages with respect to each other and to infer under which conditions they should be used.
We expect this WP to make a strong contribution to complement and enrich the text analysis made on the rest of the project.
The structure of the work plan of the MOVES project is described by the activities (Work Packages) and schedule presented
above, and is directly oriented to the accomplishment of the objectives that are described in these activities.
During the reserach and experimentation of the work done in all activity, we plan to launch dissemination
elements (Work Package 5: Dissemination and Exploitation of Results) that will inform about the project,
specifically about the research, experimentation, evaluation and results that come from the activity (according
to the strategy of intellectual property protection that was defined).
University, Department and Associated Laboratories