Smart Probabilistic Modeling of Heterogeneous Information and Text Networks
University:University of Athens
We propose TAHINI, an intelligent and scalable probabilistic framework for mining Text Augmented Heterogeneous Information Networks (TA-HINets) that composed of interconnected entities characterized by free text attributes (e.g., papers, web pages), or other Bag of Words (BoW) representations (e.g., user actions) and related side information (e.g., labels, meta-data, tags, images).
At first, we propose an innovative workflow for transforming different data kinds and modalities into multiple interrelated BoW vectors that form a star around one central entity capturing all information spaces.
Then, building upon well established Latent Dirichlet Allocation (LDA), we propose MIX-LDA, a new multi-modal probabilistic topic model for interrelated count data that infers both single (private) and multi-modal (shared) topics, adapting to the extent of correlation between the different modalities and leveraging statistical strength among them. Finally, we present a scalable Gibbs sampling technique for inference and demonstrate the efficiency of the proposed framework on several, real world experiments inferring interesting patterns, groups, similarities and latent interrelationships within and across different data types and modalities.