Pig Recipes
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Pig Recipes
- 001: Load the KeyValuePair.txt file in Local Mode
- 002: Load the KeyValuePair.txt file in MapReduce Mode
- 003: Using Parameters
- 004: Create sorted, distinct list of attributes
- 005: Create sorted, distinct list of attributes, nested commands
- 006: Create a Pig bag for each of the Attribute Values (GROUP)
- 007: Create a sorted list of attribute values for a single attribute name which can be provided by optional input parameter (FILTER).
Pig Scripts
Use Pig scripts to place Pig Latin statements and Pig commands in a single file. While not required, it is good practice to identify the file using the .pig
extension.
You can run Pig scripts from within the Grunt shell and from the command line. Use either the exec
command or run
command to execute the script. Both the run
and exec
commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.
run
The run
command can interact with the Grunt shell (interactive mode). The script has access to aliases defined externally via the Grunt shell. The Grunt shell has access to aliases defined within the script. All commands from the script are visible in the command history. With the run
command, every store triggers execution. The statements from the script are put into the command history and all the aliases defined in the script can be referenced in subsequent statements after the run
command has completed. Issuing a run
command on the grunt command line has basically the same effect as typing the statements manually.
hduser> pig grunt> cat myscript.pig b = ORDER a BY name; c = LIMIT b 10; grunt> a = LOAD 'student' AS (name, age, gpa); grunt> run myscript.pig grunt> d = LIMIT c 3; grunt> DUMP d; (alice,20,2.47) (alice,27,1.95) (alice,36,2.27)
exec
In contrast, the exec
command to run a Pig script with no interaction between the script and the Grunt shell (batch mode). Aliases defined in the script are not available to the shell; however, the files produced as the output of the script and stored on the system are visible after the script is run. Aliases defined via the shell are not available to the script. With the exec
command, store statements will not trigger execution; rather, the entire script is parsed before execution starts. Unlike the run
command, exec
does not change the command history or remembers the handles used inside the script. exec
without any parameters can be used in scripts to force execution up to the point in the script where the exec
occurs.
hduser> pig grunt> cat myscript.pig a = LOAD 'student' AS (name, age, gpa); b = LIMIT a 3; DUMP b; grunt> exec myscript.pig (alice,20,2.47) (luke,18,4.00) (holly,24,3.27)
Comments
You can include comments in Pig scripts:
- For multi-line comments use
/* ... */
- For single-line comments use
--
/*****************************************************************************/ /* scriptname.pig */ /* Description of the pig. */ /*****************************************************************************/ /* Date Initials Description */ /* -------- -------- ------------------------------------------------------- */ /* 20160419 Reo Initial. */ /*****************************************************************************/
Pig scripts allow you to pass values to parameters using parameter substitution.
PigStorage
PigStorage is a built-in function of Pig, and one of the most common functions used to load and store data in pigscripts. PigStorage can be used to parse text data with an arbitrary delimiter, or to output data in an delimited format.
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter. For example, to load a space-separated file:
data = LOAD 's3n://input-bucket/input-folder' USING PigStorage(' ') AS (field0:chararray, field1:int);
Limitations
PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context. For this reason, when loading a CSV file it is recommended to use CSVExcelStorage rather than PigStorage with a comma delimiter.
Execution Mode
When Pig is started, via the pig
command, it can be started in a particular Execution Mode. The Execution Mode is specified by an argument on the Pig command line. The -x
option designates the Execution Mode. If the -x
is omitted, then by default, Pig will start in MapReduce mode.
Local Mode
Local mode must be used when you are on a single machine. All files are installed and run using your local host and file system. This means that all files that are used by the Pig script are on the local host and local file system. To specify local mode, use the -x
flag with the value local
. This mode is generally used for testing purposes.
hduser> pig -x local
To execute in local execution mode with a batch file, use:
hduser> pig -x local myscript.pig
Mapreduce Mode
To run Pig in mapreduce mode, Pig must have access to a Hadoop cluster and HDFS installation. To specify mapreduce mode, use the -x
flag with the value mapreduce
. The -x
flag can be omitted since this is the default setting.
hduser> pig -x mapreduce or hduser> pig
To execute in mapreduce execution mode with a batch file, use:
hduser> pig -x mapreduce myscript.pig or hduser> pig myscript.pig