Rodney and Arlyn Ogden

Pig Recipes

Recipe 007

Navigation

Goal

The input file contains attribute names in the second column. Use the Pig FILTER statement to allow for a single attribute name that will be used as a filter. Build a sorted list of attribute values that are associated with that attribute name. Output the attribute name and unique values.
Source File: /home/hduser/data/KeyValuePair.txt
Target File: /user/hduser/data/Recipe007.out/
Parameter Substitution

argAttrName (Not required, no default value defined) - This optional parameter filters the output to a single attribute name. Only one attribute name can be specified. When not provided, the process will output all records with no filtering.

After execution, the output directory should look like this:

hduser> ls -la
total 212
drwxrwxr-x. 2 hduser hduser     84 Apr 27 10:18 .
drwxrwxr-x. 3 hduser hduser     49 Apr 27 10:18 ..
-rw-r--r--. 1 hduser hduser 206224 Apr 27 10:18 part-r-00000
-rw-rw-r--. 1 hduser hduser   1620 Apr 27 10:18 .part-r-00000.crc
-rw-r--r--. 1 hduser hduser      0 Apr 27 10:18 _SUCCESS
-rw-rw-r--. 1 hduser hduser      8 Apr 27 10:18 ._SUCCESS.crc

Discussion

The filter statement allows you to select which records will be retained in your data pipeline. A filter contains a predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.

Predicates can contain the equality operators you expect, including == to test equality, and !=, >, >=, <, and <=. These comparators can be used on any scalar data type. == and != can be applied to maps and tuples. To use these with two tuples, both tuples must have either the same schema or no schema. None of the equality operators can be applied to bags.

Pig Latin follows the operator precedence that is standard in most programming languages, where arithmetic operators have precedence over equality operators. So, x + y == a + b is equivalent to (x + y) == (a + b).

For chararrays, users can test to see whether the chararray matches a regular expression:

-- filter_matches.pig
divs        = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                  date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';

Pig uses Java's regular expression format. This format requires the entire chararray to match, not just a portion as in Perl-style regular expressions. For example, if you are looking for all fields that contain the string "fred", you must say '.*fred.*' and not 'fred'. The latter will match only the chararray fred.

You can find chararrays that do not match a regular expression by preceding the test with not:

-- filter_not_matches.pig
divs           = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                    date:chararray, dividends:float);
notstartswithcm = filter divs by not symbol matches 'CM.*';

You can combine multiple predicates into one by using the Boolean operators and and or, and you can reverse the outcome of any predicate by using the Boolean not operator. As is standard, the precedence of Boolean operators, from highest to lowest, is not, and, or. Thus a and b or not c is equivalent to (a and b) or (not c).

Pig will short-circuit Boolean operations when possible. If the first (left) predicate of an and is false, the second (right) will not be evaluated. So in 1 == 2 and udf(x), the UDF will never be invoked. Similarly, if the first predicate of an or is true, the second predicate will not be evaluted. 1 == 1 or udf(x) will never invoke the UDF.

For Boolean operators, nulls follow the SQL trinary logic. Thus x == null results in a value of null, not true (even when x is null also) or false. Filters pass through only those values that are true. So for a field that had three values 2, null, and 4, if you applied a filter x == 2 to it, only the first record where the value is 2 would be passed through the filter. Likewise, x != 2 would return only the last record where the value is 4. The way to look for null values is to use the is null operator, which returns true whenever the value is null. To find values that are not null, use is not null.

Likewise, null neither matches nor fails to match any regular expression value.

Just as there are UDFs to be used in evaluation expressions, there are UDFs specifically for filtering records, called filter funcs. These are eval funcs that return a Boolean value and can be invoked in the filter statement. Filter funcs cannot be used in foreach statements.

Solution

The following is the script file.

/*****************************************************************************/
/* Recipe007.pig                                                             */
/*                                                                           */
/* Purpose:                                                                  */
/* The input file contains attribute names in the second column.  Use the    */
/* Pig FILTER statement to allow for a single attribute name that will be    */
/* used as a filter.  Build a sorted list of attribute values that are       */
/* associated with that attribute name.  Output the attribute name and       */
/* unique values.                                                            */
/*                                                                           */
/* Parameter Substitution -                                                  */
/* argAttrName (Not required, no default value defined)                      */ 
/*   This optional parameter filters the output to a single attribute name.  */
/*   Only one attribute name can be specified. When not provided, the        */
/*   process will output all records with no filtering.                      */
/*                                                                           */
/* Pig Execution Mode:  local                                                */
/* Pig Batch Execution:                                                      */
/*   pig -x local Recipe007.pig                                              */
/*   pig -x local -p argAttrName=Studio Recipe007.pig                        */
/*   pig -x local -p argAttrName=Studio -dryrun Recipe007.pig                */
/*                                                                           */
/* The target directory must not exist prior to executing this script.  Use  */
/* this command to safely delete the target directory:                       */
/*   rm -rf /home/hduser/data/Recipe007.out                                  */
/*                                                                           */
/*****************************************************************************/
/* Date     Initials Description                                             */
/* -------- -------- ------------------------------------------------------- */
/* 20170527 Reo      Initial.                                                */
/*****************************************************************************/

/*****************************************************************************/
/* The source file contains fields where all of the values are enclosed in   */
/* double quotes.  In some cases, there are commas (,) within the double     */
/* quotes.  If the Load is used with PigStorage (,), the data will be parsed */
/* incorrectly.  Therefore, the CSVExcelStorage method will be used to       */
/* ensure good parsing.  CSVExcelStorage is in the PiggyBank, so it must be  */
/* REGISTERED.                                                               */
/*****************************************************************************/
REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar';

/*****************************************************************************/
/* Set up an alias to the java package.                                      */
/*****************************************************************************/
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

/*****************************************************************************/
/* Define the input parameter and initialize it to NULL.                     */
/*****************************************************************************/
%DEFAULT argAttrName NULL

/*****************************************************************************/
/* Read in the data using a comma (,) as the delimiter.                      */
/*****************************************************************************/
DVDData = LOAD '/home/hduser/data/KeyValuePair.txt'
  USING CSVExcelStorage(',')
  AS
  (
    DVDName:chararray,
    AttributeName:chararray,
    AttributeValue:chararray
  );

/*****************************************************************************/
/* Let's build a Relation that contains the fields that we are really        */
/* interested in.                                                            */
/*****************************************************************************/
A = FOREACH DVDData GENERATE AttributeName, AttributeValue;

/*****************************************************************************/
/* Filter out everything except what the input argument specifies.           */
/*****************************************************************************/
B = FILTER A BY AttributeName == '$argAttrName';

/*****************************************************************************/
/* Remove the duplicate records.  Since DISTINCT works only on entire        */
/* records, there is no need to specify individual fields.                   */
/*****************************************************************************/
C = DISTINCT B;

/*****************************************************************************/
/* Time to STORE the data that was just read in.                             */
/*****************************************************************************/
STORE C INTO '/home/hduser/data/Recipe007.out';