data sceptre: December 2015

Saturday, 19 December 2015

Hadoop with Hive - Baseball Problem

What is Hive?

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP.Hive provides a database query interface to Apache Hadoop.People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complete so you can do tasks in either tool. However you will find one tool or the other will be preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.

Task

We are going to find the player with the highest runs for each year. This file has all the statistics from 1871–2011 and contains more that 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

The files downloading and uploading has already been done (refer previous program for process).

After starting hadoop environment we can go to the link 127.0.0.1:8000 which displays a web page as follows.(Click on the images to enlarge)

Then we can go to the file browser by clicking the icon in the header bar to see the files uploaded.

We can click on the second icon in the header bar to go to hive ui environment. We get directed to query editor where we can write, edit and execute code.( Note: In hive query editor we should execute one query at a time: for next query we should erase previous query and write a new one, not below it)

STEP 1:

Create table called temp_batting

create table temp_batting (col_value STRING);

After writing query we should execute it by clicking execute in the footer. To check whether the table has been created click on the table in the second header bar.

To view table select table and click on the view icon as it is empty table we will not find any values.

STEP 2:

We load data into the hive environment by giving in path query as

LOAD DATA INPATH '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

STEP 3:

To view the data whether it is loaded properly we can use select command as

SELECT * FROM temp_batting LIMIT 100;

It gives results as

STEP 4:

We create another table batting with "player_id, year and runs"

create table batting (player_id STRING, year INT, runs INT);

We can view the table created by clicking view as we did previously

Then it takes sometimes to run the query

STEP 5:

Now we select year and max runs from table batting and group it by year

SELECT year, max(runs) FROM batting GROUP BY year;

It gives results as

We can scroll down and even to next page at the end of page

STEP 6:

To get names of players we use join as

SELECT a.year, a.player_id, a.runs from batting a
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;

The results will be

This makes end of program with desired results (pretty simple and interesting I guess :) )

Saturday, 5 December 2015

Hadoop with Pig script - Program

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. We are going to write a Pig script that will do our data analysis task.

Task

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

We will download file from the following link

http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Once you have the file we will need to unzip the file into a directory. We will be uploading just the master.csv and batting.csv files.

click on the images to enlarge

After we start the virtual box we have to go to the link http://127.0.0.1:8000/ , then we will see an GUI like the below image

As marked above we click on the file browser in the top left and upload csv files using upload button in the right corner.

After uploading we can see the files uploaded.

Now we move to the Pig script console by clicking pig image on the title bar which leads us to pig console where we can write, edit, save and execute pig commands.

Task

1. We need to load the data first. For that we use load statement.

batting = load 'Batting.csv' using PigStorage(',');

2. To filter out the first row of the data we add FILTER statement.

raw_runs = FILTER batting BY $1>0;

3. Now we name the fields, We will use FOREACH statement to iterate batting data object. We can use Pig helper to provide us with a template if required.So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs.

runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;

4. We will use GROUP statement to group the elements in runs by the year field.

grp_data = GROUP runs by (year);

5. We will use FOREACH statement to find the maximum runs for each year.

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

6. We use the maximum runs we need to join this with the runs data object so we can pick up the player.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);

7. The result will be a dataset with "Year, PlayerID and Max Run".

join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

8. At the end we DUMP data to the output.

DUMP join_data;