Tutorial: “Deep and Machine Learning methods for document clustering and classification”


On Wednesday, 17 April 2019 at 3:00 PM in the Conference Hall of the Laboratory of Information Technologies JINR, the tutorial “Deep and Machine Learning methods for document clustering and classification” will be held in the frames of the AYSS-2019 Conference. The Tutorial is organized by the HybriLIT heterogeneous computation team on the basis of the developed ecosystem for ML/DL.

The tutorial will be led by Priv.-Doz. Dr Alexej I. Streltsov.

Registration and instructions are available via the link.

IMPORTANT: Please bring your own laptops!


In this tutorial, we consider a complete workflow of a typical Data Science project dealing with text documents. We define a problem, generate data, analyze data, explore relevant features – discuss several ways how to extract and describe semantic information, and show how to incorporate/augment it by an additional non-semantic one (which might help to improve the results). Next, we consider, construct and apply several standard Machine Learning (ML) models to describe our data: we cast it to a classification and regression problems. Then, we analyze an efficiency of the ML methods as well as a role, impact and relevance of our semantic and non-sematic features. Next, we show how to apply Deep Learning methods to attack the same problem – we consider simple DNN (Deep Neural Network) and CNN (Convolutional Neural Network) models. At the end, we contrast our ML and DL results, discuss their pluses and minuses: efficiencies, required computational resources, possible way to improve them.

Tutorial supports an active and passive participations. I will use an alive Jupiter Notebook presentation to describe, discuss and execute each end every block of the Python-code requited for the above program/workflow. The corresponding blocks will be shared/available on a dedicated Slack channel (HybriLIT subscription required: https://web-stc.jinr.ru). If you have a valid account on the HybriLIT cluster you will be able to copy/paste them from the Slack channel and re-execute it in on-line mode in your own Notebook via GitLab (https://jhub.jinr.ru/) service. No extra work on your side to install, tune, support the required python packages: JHub – already did it for you.