Tutorial: «Deep and Machine Learning methods for document clustering and classification»


В среду, 17 апреля 2019 г. в 15:00 в конференц-зале Лаборатории информационных технологий ОИЯИ, в рамках конференции ОМУС-2019 состоится практическое занятие «Deep and Machine Learning methods for document clustering and classification», организованное командой HybriLIT на базе разрабатываемой экосистемы для глубокого и машинного обучения.

Практическое занятие проведет д-р Алексей Стрельцов.

Регистрация на занятие и необходимые инструкции доступны по ссылке.

Обращаем ваше внимание, что для участия в практическом занятии потребуется личный ноутбук.


In this tutorial, we consider a complete workflow of a typical Data Science project dealing with text documents. We define a problem, generate data, analyze data, explore relevant features – discuss several ways how to extract and describe semantic information, and show how to incorporate/augment it by an additional non-semantic one (which might help to improve the results). Next, we consider, construct and apply several standard Machine Learning (ML) models to describe our data: we cast it to a classification and regression problems. Then, we analyze an efficiency of the ML methods as well as a role, impact and relevance of our semantic and non-sematic features. Next, we show how to apply Deep Learning methods to attack the same problem – we consider simple DNN (Deep Neural Network) and CNN (Convolutional Neural Network) models. At the end, we contrast our ML and DL results, discuss their pluses and minuses: efficiencies, required computational resources, possible way to improve them.

Tutorial supports an active and passive participations. I will use an alive Jupiter Notebook presentation to describe, discuss and execute each end every block of the Python-code requited for the above program/workflow. The corresponding blocks will be shared/available on a dedicated Slack channel (HybriLIT subscription required: https://web-stc.jinr.ru). If you have a valid account on the HybriLIT cluster you will be able to copy/paste them from the Slack channel and re-execute it in on-line mode in your own Notebook via GitLab (https://jhub.jinr.ru/) service. No extra work on your side to install, tune, support the required python packages: JHub – already did it for you.