My first kaggle competition is about Titanic. The most famous shipwrecks in the history. All of us has seen film. So we cut to the problem directly. The kaggle competition asks us to predict whether the passenger survived crash. We are given two data sets( Train and Test) each of which include predictor variables such as Age, Passenger class, Sex, etc. With these two data sets we will do the following.
1. Create a model which will predict whether a passenger survived using only Train data set.
2. Predict whether the passengers survived in the Test data set based on the model we created.
First we will set up the working directory in RStudio using command setwd() function. Then we will input Test and Train data sets. We will code to find the "survival and deceased between male and female". The screenshot looks like this.
Then we will code to find "survival and deceased between male and female according to cabin classes". The screen shot looks like this.
We have to remove variables not used for models, replacing gender variables(male/female) with a dummy variable(0/1), and making inferences on missing age values. Then we have to code to create new variables to strengthen our model such as child, family, mother using for loop. Now we have to clean the Test data.
After coding the model we have to "fit the logistic regression model". From the Test data set observations now our model calculates survival predictions for the Test data set observations.
The prediction model we created will have two columns(passenger, survived). Passenger column will have passenger number and survival column will have 0 or 1. 0-deceased, 1-survived.
We now output the data into a csv file which can be submitted for kaggle.
After submitting to kaggle it gives rank and score. This is the screenshot of my first submission.
No comments:
Post a Comment