Algorithms for the Partition and Labelling of Natural Language Document Sets

  • Jean-François Puissant

Thèse de l'étudiant: Master typesMaster en sciences informatiques


When looking for information online, search engines can return thousands of
documents for a single query. Current solutions display results in a list fashion,
and face difficulties showing more than a few items without drowning the user in
information. Humans tend to only process a few chunks of information at once,
but can work faster if the data is organised in a small number of groups. Fortunately,
a combination of natural language processing algorithms and graphical
representations have the potential to enable users to get a bird's eye view on
the results, and help them navigate down to the content they are looking for.
We faced two main challenges to build such a system.
First, how to group similar documents together based on their textual content?
We propose a new scalable solution for document clustering based on
Google's Doc2Vec model and the Louvain algorithm for community detection.
This solution shows better performance than competing algorithms such as Latent
Dirichlet Allocation and K-Means. We also discovered techniques to remove
the reliance on meta-parameters with minimal performance impact. With the
use of the t-SNE dimensionality reduction algorithm, visualisation of clusters
and documents on a flat screen is possible.
Second, how to label document clusters to help users understand their content
easily? We propose two models applicable to cluster and document labelling
with promising results.
la date de réponse21 juin 2017
langue originaleAnglais
L'institution diplômante
  • Universite de Namur
SuperviseurAnthony Cleve (Président) & BENOIT FRENAY (Promoteur)

Contient cette citation