Pig Recipes
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Goal
Copy a file from one location to another with the local file system. Prior to copying the file, attempt delete the target directory. The deletion of the target directory should not cause a failure that would result in the script failing.
When writing to the output directory, supress the writing of the _SUCCESS
file.
- Source File: /home/hduser/data/KeyValuePair.txt
- Target Directory: /user/hduser/data/Recipe001.out/
- Parameter Substitution: None
After execution, the output directory should look like this:
hduser> ls -la total 88176 drwxrwxr-x. 2 hduser hduser 4096 Apr 27 11:06 . drwxrwxr-x. 3 hduser hduser 49 Apr 27 11:06 .. -rw-r--r--. 1 hduser hduser 30481805 Apr 27 11:06 part-m-00000 -rw-rw-r--. 1 hduser hduser 238148 Apr 27 11:06 .part-m-00000.crc -rw-r--r--. 1 hduser hduser 30153719 Apr 27 11:06 part-m-00001 -rw-rw-r--. 1 hduser hduser 235584 Apr 27 11:06 .part-m-00001.crc -rw-r--r--. 1 hduser hduser 28941223 Apr 27 11:06 part-m-00002 -rw-rw-r--. 1 hduser hduser 226112 Apr 27 11:06 .part-m-00002.crc
Note: The .crc
files are for a cyclical redundancy
check checksum file used to verify the data integrity of another file.
CRC is a popular technique for checking data integrity as it has excellent
error detection abilities, uses little resoruces and is easily used. For
more information, see the Wiki page on cyclic redundancy check.
The following is the script file.
/*****************************************************************************/ /* Recipe001.pig */ /* */ /* Purpose: */ /* Copy a file from one location to another with the local file system. */ /* Prior to copying the file, attempt delete the target directory. The */ /* deletion of the target directory should not cause a failure that would */ /* result in the script failing. */ /* */ /* Parameter Substitution: None */ /* */ /* Pig Execution Mode: local */ /* Pig Batch Execution: */ /* pig -x local Recipe001.pig */ /* */ /* When writing to a filesystem, the output will be in a directory with */ /* part files rather than a single file. But how many part files will be */ /* created? That depends on the parallelism of the last job before the */ /* store. If it has reduces, it will be determined by the parallel level */ /* set for that job. In testing this recipe, the output file sizes are */ /* about 31 MBs per file. */ /*****************************************************************************/ /* Date Initials Description */ /* -------- -------- ------------------------------------------------------- */ /* 20160419 Reo Initial. */ /*****************************************************************************/ /*****************************************************************************/ /* The following SET command will suppress the creation of the _SUCCESS file */ /* in the output directory. */ /*****************************************************************************/ SET mapreduce.fileoutputcommitter.marksuccessfuljobs false; /*****************************************************************************/ /* The source file contains fields where all of the values are enclosed in */ /* double quotes. In some cases, there are commas (,) within the double */ /* quotes. If the Load is used with PigStorage (,), the data will be parsed */ /* incorrectly. Therefore, the CSVExcelStorage method will be used to */ /* ensure good parsing. CSVExcelStorage is in the PiggyBank, so it must be */ /* REGISTERED. */ /*****************************************************************************/ REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar'; /*****************************************************************************/ /* Set up an alias to the java package. */ /*****************************************************************************/ DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage(); /*****************************************************************************/ /* The rmf command is equivalent to the Unix command rm -f. It does not */ /* error out if the indicated file/directory is not present to remove. */ /*****************************************************************************/ rmf /home/hduser/data/Recipe001.out /*****************************************************************************/ /* Read in the data using a comma (,) as the delimiter. */ /*****************************************************************************/ DVDData = LOAD '/home/hduser/data/KeyValuePair.txt' USING CSVExcelStorage(',') AS ( DVDName:chararray, AttributeName:chararray, AttributeValue:chararray ); /*****************************************************************************/ /* Time to STORE the data that was just read in. */ /*****************************************************************************/ STORE DVDData INTO '/home/hduser/data/Recipe001.out';