The title of this article is mouthful, but it does not exaggerate. Solutions for all the above problems are actively researched on Google’s AirBnB by data scientists and artificial intelligence enthusiasts. There is a lot to cover in this article, so let us get started!
Given below is a short summary of what I will share in this article:
- I will explain who Kaggle caters to and what it can give to data scientists as well as entrepreneurs.
- I will share my personal experience with Kaggle challenges and what I learned.
- I will describe some interesting Kaggle challenges that may get you inspired.
- I will share and describe a solution for the Titanic challenge that I have participated. (*This article has 3 parts, and links to other parts are given at the end of this part.)
AirBnB for data scientists
Kaggle is where data scientists spend their nights and weekends. It’s where artificial intelligence enthusiasts are getting their practical experience solving real-life problems. On a high level – it allows companies or individuals to start and found Kaggle Contests. Contests range from small to large impact and founding, with new ones being added every week. Currently there are 7 paid reward contests ranging from $6000 to $100,000. Cash rewards are interesting but they are not all that Kaggle is about. Some challenges are added by academics and offer nothing more than a swag (Stuff We All Get – basically mugs, t-shirts) or kudo which is basically just a token of appreciation. Weather you go for a Heritage Health Price of $500 000 or a Kaggle Swag, you can have fun with Reversing the famous Game of Live and as well learn a lot in the process. Each contest has a little community feature in the form of a leader board and notebooks section. Notebooks is a platform for participants to share their ideas and even complete solutions including full analysis and well described code.
What can we learn in real-life challenges?
While it may not be as cool as predicting volcanic eruptions, my choice of challenge was Cornell Birdcall Identification. The challenge was to recognize species of a bird recorded in its natural habitat (often not so quiet forests, meadows etc.). The goal was to help enhance the automated bird population and migration tracking systems. The challenge might appear as a standard classification problem, but not really! There are over 10,000 bird species in the world. Training and testing the data for our classifier consisted of 23 GB of audio files varying in length and quality. The audio content was grouped into 234 directories, each representing different species’ recordings. So we are not talking about a simple 2 layer cat and dog classifier. We are dealing with a complex, possibly a multi-input, multi-model network architecture with a sophisticated input data processing pipeline. Here are the key things I learned in approaching this problem:
- Any neural network problem that works on audio streams is really an image recognition problem. Making a prediction based on a list of numeric values, which is ideally the audio stream, is not that easy. If, however, you convert an audio stream into an image via many existing libraries you will be able to use Convolutional neural networks which are extremely good image patter finders.
I ended up building a little program that went through all the available files generating 10 different visual representations of sound for every one of them. Then the challenge was to pick one or a combination of these visual representations that yield the best results in training.
- Data pre-processing is the key. It really does not matter how good the model of your classifier is if you train it on garbage, noisy inconsistent audio files. One of the things I learned here was noise reduction which in Python you can achieve with a combination of Librosa and the Noisereduce libraries.
Get inspired by Kaggle Contests
Below are some of the current and historical challenges to get you motivated and perhaps inspired. For the historic ones, check the notebooks and discussion sections for the vast knowledge hubs.
With a price reward of $30,000 LYFT asks you to help with the motion prediction of common traffic participants such as cars, pedestrians and cyclists. Your efforts can help bring autonomous vehicles to our streets in not so distant future. As a bonus you will get access to the largest collection of traffic agent motion data consisting of 1000+ hours of traffic agent movement.
If you ever got stuck in a never ending screening queue at a large airport, you will definitely understand there is a plenty of room for improvement in the screening procedures. The US department of Homeland Security (DHS) has asked the Kaggle community to help improve on their Apex Screening at Speed Program. The issue of long queues is directly related to a high number of false alarms forcing airport staff to perform manual security screening. If you do not care about airports and people wasting time in endless queues, you may have gotten behind it for the record 1,500,000 dollars the DHS has allotted for the winners!
Airbus needed Kaggle community’s help to speed up and improve on the quality of recognizing ships on satellite images. The test data consisted of 40,000 image chips which apparently is a typical size of a single satellite image… The results achieved by the Kaggle community helped in improving open sea monitoring services for preventing drug trafficking, illegal cargo movement as well as tragic open sea catastrophes.
Pulmonary Embolism is causing 60,000 to 100,000 deaths in the United States every year. The Radiological Society of North America (RSNA®) needs your help in analysing an overwhelmingly large (hundreds) image datasets that get created in every single patient scan. Radiologists time is not the only issue, because of the volume and complexity, Pulmonary Embolism cases are extremely prone to over-diagnosis, putting a strain on the healthcare system itself. A machine learning model capable of image analysis and accurate diagnosis would be of great value, not only for RSNA but for the society in general.
What about the Titanic challenge?
Well, it appears I have over done the content a bit, and so the practical case study of a real Kaggle contest solution and submission has to be included as 2 individual articles:
- Part 1 will introduce us to the challenge context and available data
- Part 2 will go over the data processing which is the key thing in this type of challenges
And as for this high-level overview of Kaggle, as always please leave your thoughts in the comments box below.
Author of this blog is Patryk Borowa, Aspire Systems.