Pages

Wednesday, 26 August 2015

STATISTICAL ANOMALY

ANOMALY :

  Something that deviates from what is normal, standard or expected.

 A statistical anomaly is an occurrence when something fall out of a normal scope for one group but at the same time it is not a result of being that group. 
 In web analytics anomalies detection plays a crucial role for site owners. For example here is a picture of a page views chart, generated by t.onthe.io. At 15.15 something happened, so there were no page views on a site. It is an anomaly, so the developers should find the problem and fix it.
The best example of a statistical anomaly is a person who has survived after a flight accident. Assuming there is a plane that crashed somewhere in the jungle. According to number of sold tickets, there were 30 passengers aboard. None of the passengers survived. While investigating plane crash, it appeared that one of the passengers missed departure time and didn’t get that flight.

So, this person changed the normal range of passengers group but not in a result of being a member of that group.

In fact, this lucky person is an example of statistical anomaly. Sometimes statistical anomaly is compared to guardian angel as it saves peoples lives.

Tuesday, 18 August 2015

SELLING A PEN

In "The Wolf of Wall Street" movie Jordan Belfort ( Leonardo DiCaprio ) has visually explained 'how to sell a pen'. We all are impressed by it (right!).
How???
Sell the need first..... Then sell the pen.
What if the person doesn't have the need or he pretends that he doesn't have a need???
 In an interview if we come across the question  "Sell me this pen?"
this is one of the several methods we can move forward.


For easy flow of lines assume I am being interviewed.
.......................................................................................

Interviewer: Do me a favor, sell me this pen. (reaches across to hand me the pen)

Me: (I slowly roll the pen between my index and thumb fingers.) When was the last time you used a pen?

Interviewer: This morning.

Me: Do you remember what kind of pen that was?

Interviewer: No.

Me: Do you remember why you were using it to write?

Interviewer: Yes. Signing a few new customer contracts.

Me: Well I’d say that’s the best use for a pen (we have a subtle laugh).

Wouldn’t you say signing those new customer contracts is an important event for the business? (nods head) Then shouldn’t it be treated like one. What I mean by that is, here you are signing new customer contracts, an important and memorable event. All while using a very unmemorable pen.

We grew up, our entire lives, using cheap BIC pens because they get the job done for grocery lists and directions. But we never gave it much thought to learn what’s best for more important events.

This is the pen for more important events. This is the tool you use to get deals done. Think of it as a symbol for taking your company to the next level. Because when you begin using the right tool, you are in a more productive state of mind, and you begin to sign more new customer contracts.

Actually. You know what? Just this week I shipped ten new boxes of these pens to Elon Musk’s office.

Unfortunately, this is my last pen today (reach across to hand pen back to CEO). So, I suggest you get this one. Try it out. If you’re not happy with it, I will personally come back next week to pick it up. And it won’t cost you a dime.
What do you say?

Interviewer: Yes.

...............................................................................................................................

Assume that interviewer is not generous as above
How we proceed......

.................................................................................................................................


ME: I was just curious to know which pen do you use currently?

Interviewer: Why do you care?


ME: I work for a company that produces pens that are optimized for better handwriting and faster writing, if you are already not using our pen, I would like to show it to you.

Interviewer: I use "X brand", I am happy with it, I don't need your pen, thanks.


ME: Sure, X brand is good too, but do you know your handwriting could improve twice as much by simply switching to our pen, here is my pen, why don't you try writing something yourself, here is the paper, see the magic yourself.

Interviewer: No really, I don't need another pen.


ME: That's no problem, you don't need to buy, if you like it maybe you will tell your friends or gift some child you know who has bad handwriting, you know bad handwriting affects grades in schools, you would be doing someone a favor just by recommending

ME: Just a sentence, this is magic, just see the improvement in your now handwriting

Interviewer (writes): Ok, not bad, my writing does look cleaner


ME: Yes, and you know what, the more you write with this, the better your handwriting will be

Interviewer:Cool, let me take one of these, i can gift this to my nephew


ME: Great idea, since this is my first meeting with you, let me give you a discount of 10% if you buy for your friends and family too, if you buy 5 or more you get a straight 10% off, buy 10 and get 20% off, just this time.

Interviewer: Ah I don't know, ok i'll buy 5 buy give me 15% discount.


ME: Wish I could, that's beyond my authority, fine, if you can buy 7 you get 15%, please say yes and seal the deal ;-)

Interviewer: Haha, alright, deal.


ME: Ah by the way, this pen comes with special Ink too, if you buy one bottle...




If we can sell a pen; we can sell anything( I guess) :)






Tuesday, 11 August 2015

CORREALTION AND CAUSATION

Correlation can also be called association. When a correlation is found between two variables, it is normally presumed that one causes another. But it is not the case always. In statistics there are lot more variables that should taken into account.
For example we go with a study
   Children who eat breakfast have higher grades, so eating breakfast causes higher grades.
There may be a third factor causing both higher grades and eating breakfast such as more responsible and loving parents who help children with homework as well as make sure they eat breakfast.  Causation can only be inferred from a randomised experiment.
The below video gives a complete explanation of correlation and causation and other things that should taken into consideration before jumping into conclusions.

Monday, 10 August 2015

Osama Bin Laden

We all have played Cricket...... Who is the World's greatest batsman?? Sachin Tendulkar, Don Bradman... we can name a few right. We all have played Hide and Seek....... Who is the greatest Hide and Seek player the world has ever seen???? OSAMA BIN LADEN. Being born in one of the richest families in Saudi and having lot of political connections he completed major part of his higher education in U.S.Putting aside all the bad things he did to this World let us focus on his great capabilities. 
*  He worked for C.I.A in the early stage of his career, he understood how C.I.A operates, in fact how any intelligence agency operates ; so he has been evading the manhunt for decades.-"Analytical skills". 
*  He taught himself Nuclear science - one of the most difficult subjects on the Planet -        "Intelligence and Concentration".
* He is hiding in Afghanistan and his men working and dying for him all over the globe -"Leadership and Motivational skills". 
Had been if he used his skills other way around; the World would have been a better place to live.

P,S.- This is the topic given to me in extempore and above is the way I presented it.


DATA STORYTELLING - PILL INSIDE PEANUT BUTTER

Hello All, I am Yashwanth. In my childhood I used to have a puppy(Tommy). When he was two years old he fell sick. I took him to vet. He examined the dog, gave pills and instructed to give one each everyday for three days. I went home and it was the time to give tommy medicine. I called " tommy come.... open your mouth" and placed the pill on his tongue and asked him to swallow. You know what tommy has did !!!. He spit it out. Then again I called "Tommy this time I'm serious don't spit ; swallow." Again he spit out. Then not knowing what to do, I called my friend "dude, my dog is not taking the pill, what should I do". He then asked me "do you have peanut butter?" "yes", I said.
"Hide the pill in the peanut butter". He said. Then I took a jar of peanut butter and hid the pill as a camouflage inside butter. I called " Tommy, peanut butter." He came running and as you know what happens when dogs see peanut butter, same with my dog; no need to clean the jar. The point here is the pill went down, my puppy eat the pill. In this metaphor that pill is the good medicine that we have to communicate with the audience, to our costumers, to our employees, to our stake holders. That's our idea, that's our concept, that's our information but sometimes would you agree with me that pill is little hard to swallow for our audience. Sometimes it is boring, sometimes it is technical and sometimes it is literally bad news. Yet somehow we need to find a way to communicate this information in a way that they are able to swallow it. It is  not easy. So what is peanut butter in this metaphor "STORY". And this my friends is data storytelling. 

Sunday, 9 August 2015

AIRLINE ROUTE HISTOGRAM

Let's step into real world. We will solve an airline data set problem. For this we need two data sets airport data and route data which can be downloaded from OpenFlights Data Page. Both the files will be in dat format(.dat).In addition to that we need geo_distance function( ) which will be used later and can be downloaded from here. Now we have airport and route data and geo_function python programme in working directory. This should get us started.
Observe the data of airports in  text editor to get an overview. First we import data into programme and print the name of the every airport. As we have observed the data in every row second column is the airport name, so we have used index 1 (index 0 is the first field).
We decided to print airports only for a certain countries such as Australia and Russia. Here if condition will check third index of every row whether Australia or Russia and if it is true it prints the first index of that row which is airport name.

We now calculate how far each route travels and draw histogram showing  the distribution of distances flown. First we create the latitudes and longitudes as dictionaries.
Now as we have downloaded geo_distance we will now import into our programme and find distances making distances as list
Finally we will create a histogram displaying the frequency of flights by distance.



PYTHON PROGRAMMING- 2 ( RADISH SURVEY CHARTS)

Data visualisation is very helpful for clear understanding of data. For example we create a chart for values (3,2,5,0,1).
We have understood that who is the winner that is which radish variety got more votes. Now we depict that in a chart. We use matplotlib to generate a bar graph to display the vote counts from the radish variety program.

Saturday, 8 August 2015

PYTHON PROGRAMMING- 1 ( RADISH SURVEY)

                      We have 300 lines of survey data in the file radishsurvey.txt. Each line consists of a name, a hyphen then a radish variety. For example Evie Pulsford is a name and April Cross is a radish variety.


Now here comes the business perspective of analysing the data. We decided to find out 
* Did anyone vote twice?
* What are the least popular?
* What is the most popular radish variety?
We do it by python language using spyder ide(Integrated Development Environment).

We have to save the file radishsurvey.txt in the Documents-> Python Scripts folder.
We open a new file in spyder and name it as radishsurvey.py(.py is an extension for python files).
We will code to find who voted for which radish variety. We will use for loop.
On the right corner we can see the output. It is just a modification into readable language. Now we will see how many people voted for "White Icicle" radishes?
It gives names like "Amy Clunie likes White Icicle". The next step is counting votes.
We have counted votes for White Icicle. 59 votes. If needed to count votes for other varieties, no need to write code every time. Generic function can be defined and it can be called where ever necessary. We defined the count_votes function and using that we found out votes for other 2 varieties 
Counting all the votes.
Yes, we have counted all the votes but it looks clumsy.So we will do this easy for people to read. Programmers call it "Pretty Printing". Instead of print(counts) in the above screenshot we will for loop as an option. It will gives the name of the variety and number of votes it has gained.
Now we understood that there is some weird stuff in our vote count. There is red king and there is another Red king. To a computer "red king" and "Red king" looks different because of different capitalisation. We need to clean up(sometimes called "munge") the data so it all looks the same.
But again there are some double spaces between first and second names. So again cleaning up the data. And checking if anyone voted twice.

We will make the code easier to understand by breaking it down and adding comments. For big programs it is essential to factor so it will be easier to understand and reuse.
Now we come to the finale of finding the winner.
The winner is Champion (radish variety name).

Tuesday, 4 August 2015

FIRST KAGGLE COMPETITON

My first kaggle competition is about Titanic. The most famous shipwrecks in the history. All of us has seen film. So we cut to the problem directly. The kaggle competition asks us to predict whether the passenger survived crash. We are given two data sets( Train and Test) each of which include predictor variables such as Age, Passenger class, Sex, etc. With these two data sets we will do the following.
1. Create a model which will predict whether a passenger survived using only Train data set.
2. Predict whether the passengers survived in the Test data set based on the model we created.

First we will set up the working directory in RStudio using command setwd() function. Then we will input Test and Train data sets. We will code to find the "survival and deceased between male and female". The screenshot looks like this.

Then we will code to find "survival and deceased between male and female according to cabin classes". The screen shot looks like this.
We have to remove variables not used for models, replacing gender variables(male/female) with a dummy variable(0/1), and making inferences on missing age values. Then we have to code to create new variables to strengthen our model such as child, family, mother using for loop. Now we have to clean the Test data.
After coding the model we have to "fit the logistic regression model". From the Test data set observations now our model calculates survival predictions for the Test data set observations.
The prediction model we created will have two columns(passenger, survived). Passenger column will have passenger number and survival column will have 0 or 1. 0-deceased, 1-survived.

We now output the data into a csv file which can be submitted for kaggle.
After submitting to kaggle it gives rank and score. This is the  screenshot of my first submission.