reosoftproductions.com
RODNEY AND ARLYN'S WEB SITE
Pig

Pig Recipes

Recipe 005

Navigation

Goal

Create a list of Attribute Names from the second column in the input file. The list of attribute names needs to be a unique list and must not have double quotes around the name. Sort the output in ascending order.
Source File: /home/hduser/data/KeyValuePair.txt
Target File: /user/hduser/data/Recipe005.out/
Parameter Substitution - None

After execution, the output directory should look like this:

hduser> ls -la
total 12
drwxrwxr-x. 2 hduser hduser 85 Apr 23 23:57 .
drwxrwxr-x. 3 hduser hduser 59 Apr 23 23:57 ..
-rw-r--r--. 1 hduser hduser 53 Apr 23 23:57 part-r-00000
-rw-rw-r--. 1 hduser hduser 12 Apr 23 23:57 .part-r-00000.crc
-rw-r--r--. 1 hduser hduser  0 Apr 23 23:57 _SUCCESS
-rw-rw-r--. 1 hduser hduser  8 Apr 23 23:57 ._SUCCESS.crc

After execution, the output will contain:

hduser> cat *0
DVD_ReleaseDate
Released
Status
Studio
UPC

The following is the script file.

/*****************************************************************************/
/* Recipe005.pig                                                             */
/*                                                                           */
/* Purpose:                                                                  */
/* Create a list of unique attribute names from the 2nd column in the source */
/* file.  The list of attributes names must not be contained in double       */
/* quotes.                                                                   */
/*                                                                           */
/* Parameter Substitution - None                                             */
/*                                                                           */
/* Pig Execution Mode:  local                                                */
/* Pig Batch Execution:                                                      */
/*   pig -x local Recipe005.pig                                              */
/*                                                                           */
/* The target directory must not exist prior to executing this script.  Use  */
/* this command to safely delete the target directory:                       */
/*   rm -rf /home/hduser/data/Recipe005.out                                  */
/*                                                                           */
/*****************************************************************************/
/* Date     Initials Description                                             */
/* -------- -------- ------------------------------------------------------- */
/* 20160521 Reo      Initial.                                                */
/*****************************************************************************/

/*****************************************************************************/
/* The source file contains fields where all of the values are enclosed in   */
/* double quotes.  In some cases, there are commas (,) within the double     */
/* quotes.  If the Load is used with PigStorage (,), the data will be parsed */
/* incorrectly.  Therefore, the CSVExcelStorage method will be used to       */
/* ensure good parsing.  CSVExcelStorage is in the PiggyBank, so it must be  */
/* REGISTERED.                                                               */
/*****************************************************************************/
REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar';

/*****************************************************************************/
/* Set up an alias to the java package.                                      */
/*****************************************************************************/
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

/*****************************************************************************/
/* Read in the data using a comma (,) as the delimiter.                      */
/*****************************************************************************/
DVDData = LOAD '/home/hduser/data/KeyValuePair.txt'
  USING CSVExcelStorage(',')
  AS
  (
    DVDName:chararray,
    AttributeName:chararray,
    AttributeValue:chararray
  );

B = ORDER (DISTINCT (FOREACH DVDData GENERATE AttributeName)) 
       BY AttributeName ASC;

/*****************************************************************************/
/* Time to STORE the data that was just read in.                             */
/*****************************************************************************/
STORE B INTO '/home/hduser/data/Recipe005.out';