Pig Recipes
Recipe 005
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Goal
Create a list of Attribute Names from the second column in the input file.
The list of attribute names needs to be a unique list and must not have double
quotes around the name. Sort the output in ascending order.
Source File: /home/hduser/data/KeyValuePair.txt
Target File: /user/hduser/data/Recipe005.out/
Parameter Substitution - None
After execution, the output directory should look like this:
hduser> ls -la total 12 drwxrwxr-x. 2 hduser hduser 85 Apr 23 23:57 . drwxrwxr-x. 3 hduser hduser 59 Apr 23 23:57 .. -rw-r--r--. 1 hduser hduser 53 Apr 23 23:57 part-r-00000 -rw-rw-r--. 1 hduser hduser 12 Apr 23 23:57 .part-r-00000.crc -rw-r--r--. 1 hduser hduser 0 Apr 23 23:57 _SUCCESS -rw-rw-r--. 1 hduser hduser 8 Apr 23 23:57 ._SUCCESS.crc
After execution, the output will contain:
hduser> cat *0 DVD_ReleaseDate Released Status Studio UPC
The following is the script file.
/*****************************************************************************/ /* Recipe005.pig */ /* */ /* Purpose: */ /* Create a list of unique attribute names from the 2nd column in the source */ /* file. The list of attributes names must not be contained in double */ /* quotes. */ /* */ /* Parameter Substitution - None */ /* */ /* Pig Execution Mode: local */ /* Pig Batch Execution: */ /* pig -x local Recipe005.pig */ /* */ /* The target directory must not exist prior to executing this script. Use */ /* this command to safely delete the target directory: */ /* rm -rf /home/hduser/data/Recipe005.out */ /* */ /*****************************************************************************/ /* Date Initials Description */ /* -------- -------- ------------------------------------------------------- */ /* 20160521 Reo Initial. */ /*****************************************************************************/ /*****************************************************************************/ /* The source file contains fields where all of the values are enclosed in */ /* double quotes. In some cases, there are commas (,) within the double */ /* quotes. If the Load is used with PigStorage (,), the data will be parsed */ /* incorrectly. Therefore, the CSVExcelStorage method will be used to */ /* ensure good parsing. CSVExcelStorage is in the PiggyBank, so it must be */ /* REGISTERED. */ /*****************************************************************************/ REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar'; /*****************************************************************************/ /* Set up an alias to the java package. */ /*****************************************************************************/ DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage(); /*****************************************************************************/ /* Read in the data using a comma (,) as the delimiter. */ /*****************************************************************************/ DVDData = LOAD '/home/hduser/data/KeyValuePair.txt' USING CSVExcelStorage(',') AS ( DVDName:chararray, AttributeName:chararray, AttributeValue:chararray ); B = ORDER (DISTINCT (FOREACH DVDData GENERATE AttributeName)) BY AttributeName ASC; /*****************************************************************************/ /* Time to STORE the data that was just read in. */ /*****************************************************************************/ STORE B INTO '/home/hduser/data/Recipe005.out';