Projekt

Zurück zur Übersicht

Mining Conversational Content for Topic Modelling and Author Identification (ChatMiner)

Gesuchsteller/in Crestani Fabio
Nummer 130208
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Facoltà di scienze informatiche Università della Svizzera italiana
Hochschule Università della Svizzera italiana – USI
Hauptdisziplin Informatik
Beginn/Ende 01.04.2010 - 31.03.2013
Bewilligter Betrag 156'540.00
Alle Daten anzeigen

Keywords (7)

parallel; topic modelling; text author identification; text mining; author identification; topic modelling; information retrieval

Lay Summary (Italienisch)

Lead
Il progetto Chatminer ha come scopo l’analisi di documenti originati da conversazioni online per trovare gli argomenti di discussione (di cosa si parla?) e caratterizzare l’identità delle persone coinvolte (chi parla?).
Lay summary
Al crescere dell’utilizzo di Internet, anche il modo di comunicare tra loro delle persone è cambiato e si affida oggigiorno sempre di più a strumenti informatici. Queste “conversazioni online” si ritrovano in diversi servizi presenti su Internet, da quelli di messaggeria istantanea (IRC, Internet Relay Chat), fino ad arrivare ai blog o micro-blog (come Twitter) e ai social network (come Facebook). Tutti questi servizi generano una grande mole di dati testuali che offrono enormi potenzialità dal punto di vista della loro analisi, sopratutto per quanto riguarda il reperimento di conversazioni sulla base degli argomenti di discussione e del profilo delle persone coinvolte.
Il nostro studio si è incentrato sull’analisi comparativa tra il nuovo tipo di documenti originato da conversazioni online (IRC Chat, Forum, Newsgroup, Twitter) e i tipici documenti usati in letteratura nel caso di analisi dei testi o “retrieval” di documenti. Abbiamo verificato come questi tipi di testo possano effettivamente essere considerati a metà tra il parlato e lo scritto, per la loro scarsa correttezza formale e aderenza a regole grammaticali e sintattiche. Identificando le categorie nelle quali dividere tali documenti e le loro proprietà, abbiamo condotto esperimenti per 3 casi di studio: SocialTv, Analisi di MicroBlog e Identificazione di Pedofili nelle Conversazioni Online. Dopo aver creato un apposito dataset  contenente un campione di conversazioni online (IRC), lo abbiamo utilizzato per sviluppare nuovi modelli per l’identificazione degli autori nelle conversazioni online. Lo stesso dataset stesso verrà usato per valutare l’influenza degli argomenti di discussione nell’identificazione di un particolare autore o nel reperimento di una specifica conversazione. I risultati finali del progetto (ma preliminari a tutto il lavoro) sono molto incoraggianti; sono stati publicati in conferenze internazionali e hanno ricevuto molta attenzione dalla comunita' scientifica.
Direktlink auf Lay Summary Letzte Aktualisierung: 08.05.2013

Lay Summary (Englisch)

Lead
The ChatMiner project aimed at analyzing messages exchange (conversations) between users of online services, principally Internet Relay Chats (IRC), to discover the topics of the conversation and the profile of the authors therein involved.
Lay summary
With the ever-increasing use of the Internet, computer-mediated communication via textual messaging has become popular. This type of electronic discourse is observed in point-to-point or multicast, text-based online messaging services such as chat servers, discussion forums, email and messaging services, newsgroups, IRCs (Internet relay chat), blogging and "micro-blogging" platforms (such as Twitter and Facebook). These services generate large amounts of textual data, providing interesting research opportunities for mining such data. According to recent studies, conversational content is neither writing nor speech, but rather written speech or spoken writing, or something unique. Due to its mostly informal nature, conversational content has major syntactic differences from standard texts (e.g., word frequencies, use of punctuation marks, word orderings, intentional typos). The informal nature of conversational content makes the information obtained more realistic and reflects the author's personality more accurately. Thus, the analysis of conversational content may provide clues about both the attributes of the author of a discourse and the attributes of the discourse itself.In this project we will extend the latest models of statistical content analysis, that are proving successful in the areas of text mining and information retrieval, for the mining of conversational content for topic identification (what is the conversation about?) and author identification (who are the people involved in the conversation?). In particular, we will develop new topic models specifically tailored for conversational content capable not only to identify the topical content of a segment of a conversation, but also to characterise the language of the actors involved in the conversation. In addition, the technology can be used for more advanced analysis of the topics being discussed, like for example tracking the evolution of a topic in time and across conversations (temporal topic modelling). Statistical content analysis can also be used to carry out more advanced analysis of the author's conversational style, useful to build authors' profiles (for age, gender, personality, etc.) that can be used for a variety of purposes ranging from security to advertising.The models developed will be evaluated with real data and real users, in the context of a real set of tasks. In particular, thanks to collaborations established in previous projects, we plan to apply and evaluate the models to the mining of chat room logs for the identification of anomalous and malicious conversational behaviour.
Direktlink auf Lay Summary Letzte Aktualisierung: 08.05.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Publikationen

Publikation
Overview of the International Sexual Predator Identification Competition
Inches G., Crestani F. (2012), Overview of the International Sexual Predator Identification Competition, in CLEF Online Working Notes/Labs/Workshop, Rome, Italy.
University of lugano at TREC 2011 microblog track
Inches G (2011), University of lugano at TREC 2011 microblog track, in NIST Special Publication, 1.
Investigating the Statistical Properties of User-Generated Documents
Inches Giacomo, Carman Mark James (2011), Investigating the Statistical Properties of User-Generated Documents, in Crestani, fabio, 7022, Springer Berlin Heidelberg, 7022.
On the generation of rich content metadata from social media
Inches Giacomo, Basso Andrea (2011), On the generation of rich content metadata from social media, in Crestani, fabio, ACM Press.
Online conversation mining for author characterization and topic identification
Inches Giacomo, Crestani Fabio (2011), Online conversation mining for author characterization and topic identification, in Proceedings of the 4th Workshop for Ph.D. students in information & knowledge management, ACM Press.
University of lugano at TREC 2010
Keikha M, Mahdabi P, Gerani S, Inches G, Parapary J, Carman M, Crestani F (2010), University of lugano at TREC 2010, in NIST Special Publication, 1.

Zusammenarbeit

Gruppe / Person Land
Formen der Zusammenarbeit
Monash University Australien (Ozeanien)
- Publikation
University of Maryland Vereinigte Staaten von Amerika (Nordamerika)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
AT&T Research Labs Vereinigte Staaten von Amerika (Nordamerika)
- vertiefter/weiterführender Austausch von Ansätzen, Methoden oder Resultaten
- Publikation
- Austausch von Mitarbeitern

Wissenschaftliche Veranstaltungen

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
CLEF 2012 Conference on Multilingual and Multimodal Information Access Evaluation 17.09.2012 Rome, Italy
Workshop on Information and Knowledge Management (PIKM) 28.10.2011 Glasgow, UK
Workshop on Search and Mining User-generated Contents (SMUC) 28.10.2011 Glasgow, UK
Conference on Flexible Query-Answering Systems (FQAS) 26.10.2011 Ghent, Belgium
Conference on Information Retrieval (SIGIR) 19.07.2010 Geneva, Switzerland


Veranstaltungen zum Wissenstransfer

Aktiver Beitrag

Titel Art des Beitrags Titel des Artikels oder Beitrages Datum Ort Beteiligte Personen
Visit to Universidad Politécnica de Valencia, Spain 19.02.2012 Valencia, Spain


Abstract

In this project we will use the latest models of statistical content analysis that are proving successful in the areas of text mining and information retrieval for the mining of conversational content (e.g. Twetter, FaceBook, etc.) for topic identi?cation (what is the conversation about?) and author identi?cation (who are the people involved in the conversation?). Thus, the work proposed has four measurable objectives: (1) Develop a proper evaluation framework for mining conversational content; (2) Develop a number of models for topic modelling and authorship pro?ling for conversational content; (3) Develop an integrated model for topic and author identi?cation/profiling of conversational content; (4) Implement and evaluate a demonstration system of the above integrated model in a realistic application scenario. These objectives will be achieved by applying to the mining of conversational content our past experience in text mining, language and topic modelling, and user/author profiling acquired in a number of past and current research projects. In addition, the project will take advantage and strengthen existing collaborations between the applicants and some very strong research groups in language and topic modelling and author identi?cation.
-