Parameter Substitution

Pig Latin scripts that are used frequently often have elements that need to change based on when or where they are run. A script that is run every day is likely to have a date component in its input files or filters. Rather than edit and change the script every day, you want to pass in the date as a parameter. Parameter substitution provides this capability with a basic string-replacement functionality. Parameters must start with a letter or an underscore and can then have any amount of letters, numbers, or underscores. Values for the parameters can be passed in on the command line or from a parameter file:

--daily.pig
daily     = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
                date:chararray, open:float, high:float, low:float, close:float,
                volume:int, adj_close:float);
yesterday = filter daily by date == '$DATE';
grpd      = group yesterday all;
minmax    = foreach grpd generate MAX(yesterday.high), MIN(yesterday.low);

When you run daily.pig, you must provide a definition for the parameter DATE; otherwise, you will get an error telling you that you have undefined parameters:

pig -p DATE=2009-12-17 daily.pig

You can repeat the -p command-line switch as many times as needed. Parameters can also be placed in a file, which is convenient if you have more than a few of them. The format of the file is parameter=value, one per line. Comments in the file should be preceded by a #. You then indicate the file to be used with -m or -param_file:

pig -param_file daily.params daily.pig

Parameters passed on the command line take precedence over parameters provided in files. This way, you can provide all your standard parameters in a file and override a few as needed on the command line.

Parameters can contain other parameters. So, for example, you could have the following parameter file:

#Param file
YEAR=2009-
MONTH=12-
DAY=17
DATE=$YEAR$MONTH$DAY

A parameter must be defined before it is referenced. The parameter file here would produce an error if the DAY line came after the DATE line. The other caveat is that there is no special character to delimit the end of a parameter. Any alphanumeric or underscore character will be interpreted as part of the parameter, and any other character will be interpreted as itself. So, if you had a script that ran at the first of every month, you could not do the following:

wlogs = load 'clicks/$YEAR$MONTH01' as (url, pageid, timestamp);

This would try to resolve a parameter MONTH01 when you meant MONTH.

When using parameter substitution, all parameters in your script must be resolved after the preprocessor is finished. If not, Pig will issue an error message and not continue. You can see the results of your parameter substitution by using the -dryrun flag on the Pig command line. Pig will write out a version of your Pig Latin script with the parameter substitution done, but it will not execute the script.

You can also define parameters inside your Pig Latin script using %declare and %default. %declare allows you to define a parameter in the script itself. %default is useful to provide a common default value that can be overridden when needed. Consider a case where most of the time your script is run on one Hadoop cluster, but occasionally it is run on a different cluster with different hardware:

%default parallel_factor 10;
wlogs = load 'clicks' as (url, pageid, timestamp);
grp   = group wlogs by pageid parallel $parallel_factor;
cntd  = foreach grp generate group, COUNT(wlogs);

When running your script in the usual configuration, there is no need to set the parameter parallel_factor. On the occasions it is run in a different setup, the parallel factor can be changed by passing a value on the command line.