reosoftproductions.com
RODNEY AND ARLYN'S WEB SITE
Pig

Pig Recipes

Pig Recipes

Navigation

Pig Recipes

Pig Scripts

Use Pig scripts to place Pig Latin statements and Pig commands in a single file. While not required, it is good practice to identify the file using the .pig extension.

You can run Pig scripts from within the Grunt shell and from the command line. Use either the exec command or run command to execute the script. Both the run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.

run

The run command can interact with the Grunt shell (interactive mode). The script has access to aliases defined externally via the Grunt shell. The Grunt shell has access to aliases defined within the script. All commands from the script are visible in the command history. With the run command, every store triggers execution. The statements from the script are put into the command history and all the aliases defined in the script can be referenced in subsequent statements after the run command has completed. Issuing a run command on the grunt command line has basically the same effect as typing the statements manually.

hduser> pig
grunt> cat myscript.pig
b = ORDER a BY name;
c = LIMIT b 10;

grunt> a = LOAD 'student' AS (name, age, gpa);

grunt> run myscript.pig

grunt> d = LIMIT c 3;

grunt> DUMP d;
(alice,20,2.47)
(alice,27,1.95)
(alice,36,2.27)

exec

In contrast, the exec command to run a Pig script with no interaction between the script and the Grunt shell (batch mode). Aliases defined in the script are not available to the shell; however, the files produced as the output of the script and stored on the system are visible after the script is run. Aliases defined via the shell are not available to the script. With the exec command, store statements will not trigger execution; rather, the entire script is parsed before execution starts. Unlike the run command, exec does not change the command history or remembers the handles used inside the script. exec without any parameters can be used in scripts to force execution up to the point in the script where the exec occurs.

hduser> pig
grunt> cat myscript.pig
a = LOAD 'student' AS (name, age, gpa);
b = LIMIT a 3;
DUMP b;

grunt> exec myscript.pig
(alice,20,2.47)
(luke,18,4.00)
(holly,24,3.27)

Comments

You can include comments in Pig scripts:

/*****************************************************************************/
/* scriptname.pig                                                            */
/* Description of the pig.                                                   */
/*****************************************************************************/
/* Date     Initials Description                                             */
/* -------- -------- ------------------------------------------------------- */
/* 20160419 Reo      Initial.                                                */
/*****************************************************************************/

Pig scripts allow you to pass values to parameters using parameter substitution.

PigStorage

PigStorage is a built-in function of Pig, and one of the most common functions used to load and store data in pigscripts. PigStorage can be used to parse text data with an arbitrary delimiter, or to output data in an delimited format.

If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter. For example, to load a space-separated file:

data = LOAD 's3n://input-bucket/input-folder' USING PigStorage(' ')
            AS (field0:chararray, field1:int);

Limitations

PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context. For this reason, when loading a CSV file it is recommended to use CSVExcelStorage rather than PigStorage with a comma delimiter.

Execution Mode

When Pig is started, via the pig command, it can be started in a particular Execution Mode. The Execution Mode is specified by an argument on the Pig command line. The -x option designates the Execution Mode. If the -x is omitted, then by default, Pig will start in MapReduce mode.

Local Mode

Local mode must be used when you are on a single machine. All files are installed and run using your local host and file system. This means that all files that are used by the Pig script are on the local host and local file system. To specify local mode, use the -x flag with the value local. This mode is generally used for testing purposes.

hduser> pig -x local

To execute in local execution mode with a batch file, use:

hduser> pig -x local myscript.pig

Mapreduce Mode

To run Pig in mapreduce mode, Pig must have access to a Hadoop cluster and HDFS installation. To specify mapreduce mode, use the -x flag with the value mapreduce. The -x flag can be omitted since this is the default setting.

hduser> pig -x mapreduce     or
hduser> pig

To execute in mapreduce execution mode with a batch file, use:

hduser> pig -x mapreduce myscript.pig     or
hduser> pig myscript.pig