What is Pig?
Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. We are going to write a Pig script that will do our data analysis task.
Task
We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
We will download file from the following link
Once you have the file we will need to unzip the file into a directory. We will be uploading just the master.csv and batting.csv files.
click on the images to enlarge
After we start the virtual box we have to go to the link http://127.0.0.1:8000/ , then we will see an GUI like the below image
As marked above we click on the file browser in the top left and upload csv files using upload button in the right corner.
After uploading we can see the files uploaded.
Now we move to the Pig script console by clicking pig image on the title bar which leads us to pig console where we can write, edit, save and execute pig commands.
Task
1. We need to load the data first. For that we use load statement.
batting = load 'Batting.csv' using PigStorage(',');
2. To filter out the first row of the data we add FILTER statement.
raw_runs = FILTER batting BY $1>0;
3. Now we name the fields, We will use FOREACH statement to iterate batting data object. We can use Pig helper to provide us with a template if required.So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs.
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
4. We will use GROUP statement to group the elements in runs by the year field.
grp_data = GROUP runs by (year);
5. We will use FOREACH statement to find the maximum runs for each year.
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
6. We use the maximum runs we need to join this with the runs data object so we can pick up the player.
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
7. The result will be a dataset with "Year, PlayerID and Max Run".
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
8. At the end we DUMP data to the output.
DUMP join_data;
Then we will save the program and execute it.
We can check in the job browser the progress of the job initiated.
After completion of the job we can see it is succeded.
We can see the results as we mentioned in the code as "Year", "Player_id", and "Max_run".
Scroll the results to view all years.
We should always check the log to see if the script was executed correctly.
We can download the results as txt file for further reference.
No comments:
Post a Comment