How one Kaggler took top marks across multiple Covid-related challenges.
Today we interview Daniel, whose notebooks earned him top marks in Kaggle’s CORD-19 challenges. Kaggle hosted multiple challenges that worked with the Kaggle CORD-19 dataset, and Daniel won 1st place three times, including by a huge margin in the TREC-COVID challenge. (He had a score of 0.9, 2nd place overall had a score of 0.75, and 2nd place on Kaggle had a score of 0.6.)
Daniel: I’m Daniel Wolffram, a graduate student in mathematics and a data science student assistant at Karlsruhe Institute of Technology (KIT), in Germany. My research interests include probabilistic forecasting, causal inference and machine learning.
As part of the Kaggle CORD-19 challenge I developed discovid.ai — a search engine for COVID-19 literature. Right now, I’m working on the German COVID-19 forecast hub and writing my master thesis about building and evaluating forecast ensembles for COVID-19 death counts.
Well, it’s no surprise you took top marks in the CORD-19 Challenge! That’s quite relevant!
Daniel: Indeed. I’m also a student assistant where I’ve worked on several data science projects for the last 3 years and had the opportunity to work with real world data from different companies in highly diverse domains — from predicting the waste in a sawmill to analyzing flaws in the process of surface galvanization and testing the efficiency of a marketing campaign.
During my time as a student assistant, we’ve also consulted a company that works with a lot of text data — that’s where I gained my first experience in NLP and also came across the idea of finding similar documents with the help of a topic model. At that time, our client wanted to stick with another approach, so I never really got to try out the LDA approach, but it always stayed in the back of my mind.
How did you get started competing on Kaggle?
During my undergraduate studies I joined a university group where we taught ourselves the basics of data science — mostly by working on Kaggle projects such as the Titanic or Instacart challenge. That’s also how I got my job as a student assistant, because I met one of my now-colleagues there.
What made you decide to enter this particular competition?
A friend of mine showed me this competition and I was excited right away. I remembered the LDA approach and just wanted to try it out.
Moreover, when the competition was launched, Covid cases were climbing in Germany, where I live. The first protective measures to flatten the curve were taken here — all restaurants, shops (except supermarkets and drugstores) and leisure facilities were closed. My university was closed and all exams got cancelled. More shocking were the numbers from Italy and elsewhere. It was a very intimidating and uncertain atmosphere, so this challenge was actually a way to gain back some control by facing the crisis head on by simply using my skills for the best. I was aware that it might not have the biggest impact, but what kept me going was the thought that if even one medical researcher uses my model and stumbles upon something useful, my efforts were already worth it.
What preprocessing and feature engineering did you do?
To normalize the documents I removed stop words and performed tokenization and lemmatization. This last step was rather critical here, since the CORD-19 dataset contains highly technical papers with scientific language that can’t be processed successfully by standard packages. It was important to use scispacy, which is a package that is specialized on processing biomedical, scientific or clinical text and thus could also normalize technical terms (such as chemical elements, drug names, etc.).
For the topic model to work properly, it was also necessary to perform language detection and remove non-English documents.
All the details can be found in my preprocessing notebook: https://www.kaggle.com/danielwolffram/cord-19-create-dataframe.
To further augment the data, I also searched each article for clinical trial ids to link the document to the WHO International Clinical Trials Registry Platform (ICTRP), which required hand crafting several regular expressions — the details can be found in https://www.kaggle.com/danielwolffram/cord-19-match-clinical-trials.
What machine learning methods did you use?
I used Latent Dirichlet Allocation (LDA), which is an unsupervised topic model that learns hidden semantic relationships within the corpus. Initially, this was used to find relevant articles for each task of the CORD-19 challenge. But as we moved the approach to our website, we implemented a more common search engine with Whoosh, that allows for classical keyword searches or more complex boolean queries.
On discovid.ai the topic model is now used to find related articles — the idea is that each article is composed of a set of underlying topics and if we find articles with a similar topic mixture or an overlap in topics, they might be interesting for the reader and could spark new insights.
Here you can explore 50 topics that our model found within the corpus — each topic is a distribution over words and each document can then be seen as a mixture of these topics.
What was your most important insight into the data?
Before removing the non-English articles from the corpus, interestingly, the following topics had been discovered by our topic model:
- Topic #46: der die und bei mit von eine ist werden zu für sind oder einer des den nicht das als nach zur auf durch auch ein
- Topic #40: de les des en une est dans du par un ou sont pour plus au que avec chez sur d’une qui cas être pas ces
- Topic #32: de en el los que se con las por un es para pacientes como más virus son tratamiento su infección puede ha casos enfermedad entre
- Topic #7: un che con sono nel alla più ha tra gli degli come rischio ed pazienti nella nei osteonecrosis ad essere stato studio salute anche have
As you can see, there was one for German, French, Spanish and Italian. To me this was very encouraging, because it demonstrates how powerful LDA is in learning hidden structures and that it actually learns something meaningful.
Were you surprised by any of your findings?
When people first tried out our search engine, it became clear that they only search for a few keywords — unlike the tasks on Kaggle, that were composed of much more text. This was quite a problem, because the queries were simply too short to infer topics in a useful manner. That’s when I decided to implement a more common search engine with Whoosh as an initial search (https://www.kaggle.com/danielwolffram/whoosh-search). The topic model is now only used to find related articles that are composed of similar topics, which enables users to easily browse the corpus and discover new insights.
How did you spend your time on this competition?
As so often, most of my efforts went into data preparation and cleaning, especially in the beginning there were many changes in the data structure which required a lot of adjustments. I’ve also read a lot in the forum and talked to some people with medical background to identify needs of the community. That’s why we are also extracting methodological keywords as a first quality indicator and add cross references to clinical trials that are mentioned in the papers. I’ve also spent a good amount of time learning and figuring out new things, such as language detection or building a custom search engine with Whoosh, which I’ve never done before.
What was the run time for both training and prediction of your winning solution?
Transforming the documents and training the topic model takes roughly a day.
How did your team form?
I started out on my own and built some widgets in a Kaggle notebook to easily explore the CORD-19 dataset. But with the good feedback and increasing interest in my approach, I wanted to make it more user-friendly, so it could also be used without a technical background. That’s when I got in touch with one of my colleagues, who didn’t hesitate to assist me and who assembled a small team to build our website discovid.ai.
How did your team work together?
Two of my colleagues were working on the backend and frontend, another one got it up and running on the server and my girlfriend came up with the great design and also animated our introduction video.
How did competing on a team help you succeed?
It definitely helped me to build a more well-rounded solution that is user-friendly and accessible by anyone.
What is your dream job?
I’m really drawn to data science in the medical field, because I wish to use my analytical skills in a meaningful project that helps others. I think that’s also what kept me going throughout the CORD-19 challenge — it was never about winning, but more about using my strengths for the best and doing my part in this global crisis.
What have you taken away from this competition?
It was a very meaningful project to me and along the way I got to know many interesting and inspiring people from all over the world. It was great to see how researchers from all around the globe rushed together to search answers to this global pandemic that affects each one of us in different ways and paradoxically unites us all.
Do you have any advice for those just getting started in data science?
Just get started! I think it’s important to get practical experience and learn how to handle different kinds of data, so you can easily transform it to a format you can work with. But as a math student, I also have to say that you shouldn’t neglect the fundamentals such as probability theory and statistics, because after all data science is a science, so it’s important to get an intuition about uncertainty and the limitations of different approaches.
Also, I think it’s always important to first get a clear understanding of the problem you are trying to solve, before throwing the most complex machine learning models on it.
You can find Daniel’s winning submission for CORD-19 here: https://www.kaggle.com/danielwolffram/discovid-ai-a-search-and-recommendation-engine