Project

Back to overview

Mining Conversational Content for Topic Modelling and Author Identification (ChatMiner)

Applicant Crestani Fabio
Number 130208
Funding scheme Project funding (Div. I-III)
Research institution Istituto del Software (SI) Facoltà di scienze informatiche
Institution of higher education Università della Svizzera italiana - USI
Main discipline Information Technology
Start/End 01.04.2010 - 31.03.2013
Approved amount 156'540.00
Show all

Keywords (6)

topic modelling; text author identification; text mining; author identification; topic modelling; information retrieval

Lay Summary (Italian)

Lead
Il progetto Chatminer ha come scopo l’analisi di documenti originati da conversazioni online per trovare gli argomenti di discussione (di cosa si parla?) e caratterizzare l’identità delle persone coinvolte (chi parla?).
Lay summary
Al crescere dell’utilizzo di Internet, anche il modo di comunicare tra loro delle persone è cambiato e si affida oggigiorno sempre di più a strumenti informatici. Queste “conversazioni online” si ritrovano in diversi servizi presenti su Internet, da quelli di messaggeria istantanea (IRC, Internet Relay Chat), fino ad arrivare ai blog o micro-blog (come Twitter) e ai social network (come Facebook). Tutti questi servizi generano una grande mole di dati testuali che offrono enormi potenzialità dal punto di vista della loro analisi, sopratutto per quanto riguarda il reperimento di conversazioni sulla base degli argomenti di discussione e del profilo delle persone coinvolte.
Il nostro studio si è incentrato sull’analisi comparativa tra il nuovo tipo di documenti originato da conversazioni online (IRC Chat, Forum, Newsgroup, Twitter) e i tipici documenti usati in letteratura nel caso di analisi dei testi o “retrieval” di documenti. Abbiamo verificato come questi tipi di testo possano effettivamente essere considerati a metà tra il parlato e lo scritto, per la loro scarsa correttezza formale e aderenza a regole grammaticali e sintattiche. Identificando le categorie nelle quali dividere tali documenti e le loro proprietà, abbiamo condotto esperimenti per 3 casi di studio: SocialTv, Analisi di MicroBlog e Identificazione di Pedofili nelle Conversazioni Online. Dopo aver creato un apposito dataset  contenente un campione di conversazioni online (IRC), lo abbiamo utilizzato per sviluppare nuovi modelli per l’identificazione degli autori nelle conversazioni online. Lo stesso dataset stesso verrà usato per valutare l’influenza degli argomenti di discussione nell’identificazione di un particolare autore o nel reperimento di una specifica conversazione. I risultati finali del progetto (ma preliminari a tutto il lavoro) sono molto incoraggianti; sono stati publicati in conferenze internazionali e hanno ricevuto molta attenzione dalla comunita' scientifica.
Direct link to Lay Summary Last update: 08.05.2013

Lay Summary (English)

Lead
The ChatMiner project aimed at analyzing messages exchange (conversations) between users of online services, principally Internet Relay Chats (IRC), to discover the topics of the conversation and the profile of the authors therein involved.
Lay summary
With the ever-increasing use of the Internet, computer-mediated communication via textual messaging has become popular. This type of electronic discourse is observed in point-to-point or multicast, text-based online messaging services such as chat servers, discussion forums, email and messaging services, newsgroups, IRCs (Internet relay chat), blogging and "micro-blogging" platforms (such as Twitter and Facebook). These services generate large amounts of textual data, providing interesting research opportunities for mining such data. According to recent studies, conversational content is neither writing nor speech, but rather written speech or spoken writing, or something unique. Due to its mostly informal nature, conversational content has major syntactic differences from standard texts (e.g., word frequencies, use of punctuation marks, word orderings, intentional typos). The informal nature of conversational content makes the information obtained more realistic and reflects the author's personality more accurately. Thus, the analysis of conversational content may provide clues about both the attributes of the author of a discourse and the attributes of the discourse itself.In this project we will extend the latest models of statistical content analysis, that are proving successful in the areas of text mining and information retrieval, for the mining of conversational content for topic identification (what is the conversation about?) and author identification (who are the people involved in the conversation?). In particular, we will develop new topic models specifically tailored for conversational content capable not only to identify the topical content of a segment of a conversation, but also to characterise the language of the actors involved in the conversation. In addition, the technology can be used for more advanced analysis of the topics being discussed, like for example tracking the evolution of a topic in time and across conversations (temporal topic modelling). Statistical content analysis can also be used to carry out more advanced analysis of the author's conversational style, useful to build authors' profiles (for age, gender, personality, etc.) that can be used for a variety of purposes ranging from security to advertising.The models developed will be evaluated with real data and real users, in the context of a real set of tasks. In particular, thanks to collaborations established in previous projects, we plan to apply and evaluate the models to the mining of chat room logs for the identification of anomalous and malicious conversational behaviour.
Direct link to Lay Summary Last update: 08.05.2013

Responsible applicant and co-applicants

Employees

Publications

Publication
Overview of the International Sexual Predator Identification Competition
Inches G., Crestani F. (2012), Overview of the International Sexual Predator Identification Competition, in CLEF Online Working Notes/Labs/Workshop, Rome, Italy.
University of lugano at TREC 2011 microblog track
Inches G (2011), University of lugano at TREC 2011 microblog track, in NIST Special Publication, 1.
Investigating the Statistical Properties of User-Generated Documents
Inches Giacomo, Carman Mark James (2011), Investigating the Statistical Properties of User-Generated Documents, in Crestani, fabio, 7022, Springer Berlin Heidelberg, 7022.
On the generation of rich content metadata from social media
Inches Giacomo, Basso Andrea (2011), On the generation of rich content metadata from social media, in Crestani, fabio, ACM Press.
Online conversation mining for author characterization and topic identification
Inches Giacomo, Crestani Fabio (2011), Online conversation mining for author characterization and topic identification, in Proceedings of the 4th Workshop for Ph.D. students in information & knowledge management, ACM Press.
University of lugano at TREC 2010
Keikha M, Mahdabi P, Gerani S, Inches G, Parapary J, Carman M, Crestani F (2010), University of lugano at TREC 2010, in NIST Special Publication, 1.

Collaboration

Group / person Country
Types of collaboration
Monash University Australia (Oceania)
- Publication
University of Maryland United States of America (North America)
- in-depth/constructive exchanges on approaches, methods or results
AT&T Research Labs United States of America (North America)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Exchange of personnel

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
CLEF 2012 Conference on Multilingual and Multimodal Information Access Evaluation 17.09.2012 Rome, Italy
Workshop on Information and Knowledge Management (PIKM) 28.10.2011 Glasgow, UK
Workshop on Search and Mining User-generated Contents (SMUC) 28.10.2011 Glasgow, UK
Conference on Flexible Query-Answering Systems (FQAS) 26.10.2011 Ghent, Belgium
Conference on Information Retrieval (SIGIR) 19.07.2010 Geneva, Switzerland


Knowledge transfer events

Active participation

Title Type of contribution Date Place Persons involved
Visit to Universidad Politécnica de Valencia, Spain 19.02.2012 Valencia, Spain


Abstract

In this project we will use the latest models of statistical content analysis that are proving successful in the areas of text mining and information retrieval for the mining of conversational content (e.g. Twetter, FaceBook, etc.) for topic identi?cation (what is the conversation about?) and author identi?cation (who are the people involved in the conversation?). Thus, the work proposed has four measurable objectives: (1) Develop a proper evaluation framework for mining conversational content; (2) Develop a number of models for topic modelling and authorship pro?ling for conversational content; (3) Develop an integrated model for topic and author identi?cation/profiling of conversational content; (4) Implement and evaluate a demonstration system of the above integrated model in a realistic application scenario. These objectives will be achieved by applying to the mining of conversational content our past experience in text mining, language and topic modelling, and user/author profiling acquired in a number of past and current research projects. In addition, the project will take advantage and strengthen existing collaborations between the applicants and some very strong research groups in language and topic modelling and author identi?cation.
-