Let’s learn about David!
David Mezzetti is the founder of NeuML, a data analytics and machine learning company that develops innovative products backed by machine learning. He previously co-founded and built Data Works into a 50+ person well-respected software services company. In August 2019, Data Works was acquired and Dave worked to ensure a successful transition.
David, what can you tell us about your background?
David: My technical background is in ETL, data extraction, data engineering and data analytics. I spent over a decade of my career developing large-scale data pipelines to transform both structured and unstructured data into formats that can be utilized in downstream systems. I also have experience in building large-scale distributed text search and Natural Language Processing (NLP) systems.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I’ve worked in the data analytics space for 15+ years but did not have prior knowledge of medical documents or the medical industry.
How did you get started competing on Kaggle?
I’ve participated in a couple March Madness competitions. I was looking forward to the 2020 tournament and had a model I was very excited about. The way the season went was perfect for the strengths of the model but we’ll never know how it would have performed.
What made you decide to enter this competition?
When the 2020 March Madness competition was cancelled and COVID-19 was really starting to hit hard, I wanted to find a way to get involved and help. NeuML was working on a real-time sports event tracking application, neuspo but sports along with everything else was being shut down and there were no sports to track.
With sports and life on a hiatus, I saw the Kaggle CORD-19 challenge and felt I had the background to be able to contribute. On top of everything going on, my Mom passed away in early March. She was a high school biology teacher and would have been happy to know I was involved. This effort was also a good distraction from everything going on and a way to feel like I could do my part to help beat COVID-19.
Tell us about the overall architecture or approach to the problem.
The solution consisted of two main parts, a sentence embeddings based search index and a custom BERT QA model to extract column based answers, known as summary tables for a specific list of questions.
For each query, an embeddings query identifies the list of best matching documents. Common fields including date, title, authors and the reference url are stored as search result columns.
A custom BERT QA model was developed to add additional columns to the list of search results. For example, given a search of the CORD-19 dataset for “hypertension”, an additional column for the question “What is the risk factor of developing severe symptoms for patients with hypertension?” is added as a separate column.
Did any past research or previous competitions inform your approach?
Much of the search logic was based on a prior project, codequestion (https://github.com/neuml/codequestion). codequestion builds a sentence embeddings index over coding questions to match developers questions with previously asked questions/answers. Given that I already had that code base, I took that approach when starting with the CORD-19 dataset and much of the code is still derived from codequestion today.
What preprocessing and feature engineering did you do?
The CORD-19 dataset has a metadata CSV file with the full list of documents along with full-text stored in separate JSON files. An ETL process was built to take the CSV, find the corresponding text articles and load the data into a SQLite database. The text is then broken down into sentences per document, and those sentences are mapped to sentence embeddings using a BM25 + fastText method described in this Medium article.
What supervised learning methods did you use?
All search and question-answering was unsupervised using fastText+BM25 and a BERT based model for QA.
An important concept discovered early on was the importance of study design. All articles are not considered equal, and the medical community puts more weight behind different study types. For example, studies with a larger sample size (i.e. more patients) or systematic reviews are held in higher regard vs mathematical modeling/forecasting articles. A Random Forest classifier was built to analyze articles to determine the study design based on the word tokens and named entities within an article.
What was your most important insight into the data?
The CORD-19 dataset is dynamic and growing. I saw that almost everyone, including myself took the first approach of building a search index that allowed finding documents based on matching tokens. Additionally, summarization was seen as a way to also add value. The thought being to show researchers all data on a particular term or concept. Building on the previous point on study design, not all documents are of equal value. Labeling documents with a study design proved to be greatly beneficial in allowing researchers to review a document vs just showing documents with matching tokens.
Additionally, where tokens show up in a document is important. Some articles reference a concept in the introduction or discussion sections but the article doesn’t cover that concept. Most medical articles have methods & results sections and matches in those sections are more important.
Were you surprised by any of your findings?
I had little to no expectations entering this competition, so I wouldn’t say I was surprised by anything. It was great to see so many smart and capable people all working together to try to help in whatever way they could.
Which tools did you use?
All of the work is driven by the Kaggle platform. The list of notebooks cover all the submissions for Round 1 and Round 2 of the CORD-19 challenge. All of the notebooks are in Python.
There is also a separate Python project on github, cord19q. cord19q has the logic for ETL, building the embeddings index and running the custom BERT QA model.
How did you spend your time on this competition?
The early days of the effort were spent on EDA and exchanging ideas with other members of the community. Before models could be built, gaining an understanding of the data, strengths and weaknesses of the dataset and what researchers are looking for out of the CORD-19 dataset was needed. I was fortunate enough to find like-minded data scientists who were willing to roll up their sleeves and write code to help discover what we want from the data. It wasn’t until 1–2 months into the effort that machine learning models and feature engineering were even considered. Most of Round 1 was focused on data extraction, parsing, requirements analysis and building a system to search for documents.
The work of Round 1 led to discovering that building summary tables with extracted answers to a series of questions, would be most beneficial to the medical community. Fortunately, a team of medical experts manually curated a dataset that could be used to help build machine learning models. In Round 2, a BERT based QA model was developed to be able to extract answers from medical documents. This required building a custom question-answer dataset to teach a model how to answer medical questions. In Round 2, the majority of time was spent on building this model.
What does your hardware setup look like?
All of the submissions were built on the Kaggle platform as CPU Notebooks. Development was done on a quad core laptop with a 8GB GPU and 32GB of RAM. The fastText embeddings, study design models, and custom BERT QA models were built offline using this laptop.
What was the run time for both training and prediction of your winning solution?
Given that the data is continually updated, there is a recurring job that runs each update (using kernelpipes). It takes about 6 hours to fully ETL, build the models and run all the solution notebooks on Kaggle.
What have you taken away from this competition?
This challenge was unique for a number of reasons. First there was no known answer, this was a real-world problem like you would encounter in industry, where someone has a large dataset and they aren’t sure what to do with it. This approach requires an iterative process of exploring the data, sharing feedback with experts and building a workflow to solve the problem. The data scientists involved in this effort were extremely fortunate to be guided by Savanna Reid, an epidemiologist volunteering her time. We were also fortunate Kaggle was heavily involved with Paul Mooney and Anthony Goldbloom helping guide the effort. I was fortunate to be able to bounce ideas off other data scientists working the effort, specifically Ken Miller and Andy White.
It was an honor to volunteer and while I’ll never know the true impact these contributions made, I like to think it did a small part to help.
Looking back, what would you do differently now?
Entering the competition, my first instinct was to use sentence embeddings since I had an existing similar project. If starting over, I would have explored different methods to search the documents to see if any other methods performed better.
Do you have any advice for those just getting started in data science?
Much of your time will be spent on data preparation and feature engineering. The best way to learn data science is to solve a problem you’re interested in. Sports analytics is how I got started in data science. This was an engaging way for me to stay focused not only in the algorithms but the data itself.
Additional Medium posts by David Mezzetti: