Pig Recipes
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Goal
The input file contains attribute names in the second column. Use the Pig FILTER
statement to allow for a single attribute name that will be used as a filter. Build a sorted list of attribute values that are associated with that attribute name. Output the attribute name and unique values.
Source File: /home/hduser/data/KeyValuePair.txt
Target File: /user/hduser/data/Recipe007.out/
Parameter Substitution
- argAttrName (Not required, no default value defined) - This optional parameter filters the output to a single attribute name. Only one attribute name can be specified. When not provided, the process will output all records with no filtering.
After execution, the output directory should look like this:
hduser> ls -la total 212 drwxrwxr-x. 2 hduser hduser 84 Apr 27 10:18 . drwxrwxr-x. 3 hduser hduser 49 Apr 27 10:18 .. -rw-r--r--. 1 hduser hduser 206224 Apr 27 10:18 part-r-00000 -rw-rw-r--. 1 hduser hduser 1620 Apr 27 10:18 .part-r-00000.crc -rw-r--r--. 1 hduser hduser 0 Apr 27 10:18 _SUCCESS -rw-rw-r--. 1 hduser hduser 8 Apr 27 10:18 ._SUCCESS.crc
Discussion
The filter
statement allows you to select which records will be
retained in your data pipeline. A filter
contains a predicate.
If that predicate evaluates to true for a given record, that record will be
passed down the pipeline. Otherwise, it will not.
Predicates can contain the equality operators you expect, including == to test equality, and !=, >, >=, <, and <=. These comparators can be used on any scalar data type. == and != can be applied to maps and tuples. To use these with two tuples, both tuples must have either the same schema or no schema. None of the equality operators can be applied to bags.
Pig Latin follows the operator precedence that is standard in most programming languages, where arithmetic operators have precedence over equality operators. So, x + y == a + b is equivalent to (x + y) == (a + b).
For chararrays, users can test to see whether the chararray matches a regular expression:
-- filter_matches.pig divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); startswithcm = filter divs by symbol matches 'CM.*';
You can find chararrays that do not match a regular expression by preceding the test with not:
-- filter_not_matches.pig divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); notstartswithcm = filter divs by not symbol matches 'CM.*';
You can combine multiple predicates into one by using the Boolean operators and and or, and you can reverse the outcome of any predicate by using the Boolean not operator. As is standard, the precedence of Boolean operators, from highest to lowest, is not
, and
, or
. Thus a and b or not c
is equivalent to (a and b) or (not c)
.
Pig will short-circuit Boolean operations when possible. If the first (left) predicate of an and is false, the second (right) will not be evaluated. So in 1 == 2 and udf(x)
, the UDF will never be invoked. Similarly, if the first predicate of an or
is true, the second predicate will not be evaluted. 1 == 1 or udf(x)
will never invoke the UDF.
For Boolean operators, nulls follow the SQL trinary logic. Thus x == null
results in a value of null
, not true
(even when x is null also) or false
. Filters pass through only those values that are true. So for a field that had three values 2
, null
, and 4
, if you applied a filter x == 2
to it, only the first record where the value is 2
would be passed through the filter. Likewise, x != 2
would return only the last record where the value is 4
. The way to look for null values is to use the is null
operator, which returns true whenever the value is null
. To find values that are not null, use is not null
.
Likewise, null
neither matches nor fails to match any regular expression
value.
Just as there are UDFs to be used in evaluation expressions, there are UDFs specifically for filtering records, called filter funcs. These are eval funcs that return a Boolean value and can be invoked in the filter
statement. Filter funcs cannot be used in foreach
statements.
Solution
The following is the script file.
/*****************************************************************************/ /* Recipe007.pig */ /* */ /* Purpose: */ /* The input file contains attribute names in the second column. Use the */ /* Pig FILTER statement to allow for a single attribute name that will be */ /* used as a filter. Build a sorted list of attribute values that are */ /* associated with that attribute name. Output the attribute name and */ /* unique values. */ /* */ /* Parameter Substitution - */ /* argAttrName (Not required, no default value defined) */ /* This optional parameter filters the output to a single attribute name. */ /* Only one attribute name can be specified. When not provided, the */ /* process will output all records with no filtering. */ /* */ /* Pig Execution Mode: local */ /* Pig Batch Execution: */ /* pig -x local Recipe007.pig */ /* pig -x local -p argAttrName=Studio Recipe007.pig */ /* pig -x local -p argAttrName=Studio -dryrun Recipe007.pig */ /* */ /* The target directory must not exist prior to executing this script. Use */ /* this command to safely delete the target directory: */ /* rm -rf /home/hduser/data/Recipe007.out */ /* */ /*****************************************************************************/ /* Date Initials Description */ /* -------- -------- ------------------------------------------------------- */ /* 20170527 Reo Initial. */ /*****************************************************************************/ /*****************************************************************************/ /* The source file contains fields where all of the values are enclosed in */ /* double quotes. In some cases, there are commas (,) within the double */ /* quotes. If the Load is used with PigStorage (,), the data will be parsed */ /* incorrectly. Therefore, the CSVExcelStorage method will be used to */ /* ensure good parsing. CSVExcelStorage is in the PiggyBank, so it must be */ /* REGISTERED. */ /*****************************************************************************/ REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar'; /*****************************************************************************/ /* Set up an alias to the java package. */ /*****************************************************************************/ DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage(); /*****************************************************************************/ /* Define the input parameter and initialize it to NULL. */ /*****************************************************************************/ %DEFAULT argAttrName NULL /*****************************************************************************/ /* Read in the data using a comma (,) as the delimiter. */ /*****************************************************************************/ DVDData = LOAD '/home/hduser/data/KeyValuePair.txt' USING CSVExcelStorage(',') AS ( DVDName:chararray, AttributeName:chararray, AttributeValue:chararray ); /*****************************************************************************/ /* Let's build a Relation that contains the fields that we are really */ /* interested in. */ /*****************************************************************************/ A = FOREACH DVDData GENERATE AttributeName, AttributeValue; /*****************************************************************************/ /* Filter out everything except what the input argument specifies. */ /*****************************************************************************/ B = FILTER A BY AttributeName == '$argAttrName'; /*****************************************************************************/ /* Remove the duplicate records. Since DISTINCT works only on entire */ /* records, there is no need to specify individual fields. */ /*****************************************************************************/ C = DISTINCT B; /*****************************************************************************/ /* Time to STORE the data that was just read in. */ /*****************************************************************************/ STORE C INTO '/home/hduser/data/Recipe007.out';