reosoftproductions.com
RODNEY AND ARLYN'S WEB SITE
Pig

Pig Recipes

Recipe 001

Navigation

Goal

Copy a file from one location to another with the local file system. Prior to copying the file, attempt delete the target directory. The deletion of the target directory should not cause a failure that would result in the script failing.

When writing to the output directory, supress the writing of the _SUCCESS file.

After execution, the output directory should look like this:

hduser> ls -la
total 88176
drwxrwxr-x. 2 hduser hduser     4096 Apr 27 11:06 .
drwxrwxr-x. 3 hduser hduser       49 Apr 27 11:06 ..
-rw-r--r--. 1 hduser hduser 30481805 Apr 27 11:06 part-m-00000
-rw-rw-r--. 1 hduser hduser   238148 Apr 27 11:06 .part-m-00000.crc
-rw-r--r--. 1 hduser hduser 30153719 Apr 27 11:06 part-m-00001
-rw-rw-r--. 1 hduser hduser   235584 Apr 27 11:06 .part-m-00001.crc
-rw-r--r--. 1 hduser hduser 28941223 Apr 27 11:06 part-m-00002
-rw-rw-r--. 1 hduser hduser   226112 Apr 27 11:06 .part-m-00002.crc

Note: The .crc files are for a cyclical redundancy check checksum file used to verify the data integrity of another file. CRC is a popular technique for checking data integrity as it has excellent error detection abilities, uses little resoruces and is easily used. For more information, see the Wiki page on cyclic redundancy check.

The following is the script file.

/*****************************************************************************/
/* Recipe001.pig                                                             */
/*                                                                           */
/* Purpose:                                                                  */
/* Copy a file from one location to another with the local file system.      */
/* Prior to copying the file, attempt delete the target directory.  The      */
/* deletion of the target directory should not cause a failure that would    */
/* result in the script failing.                                             */
/*                                                                           */
/* Parameter Substitution:  None                                             */
/*                                                                           */
/* Pig Execution Mode:  local                                                */
/* Pig Batch Execution:                                                      */
/*   pig -x local Recipe001.pig                                              */
/*                                                                           */
/* When writing to a filesystem, the output will be in a directory with      */
/* part files rather than a single file.  But how many part files will be    */
/* created?  That depends on the parallelism of the last job before the      */
/* store.  If it has reduces, it will be determined by the parallel level    */
/* set for that job.  In testing this recipe, the output file sizes are      */
/* about 31 MBs per file.                                                    */
/*****************************************************************************/
/* Date     Initials Description                                             */
/* -------- -------- ------------------------------------------------------- */
/* 20160419 Reo      Initial.                                                */
/*****************************************************************************/

/*****************************************************************************/
/* The following SET command will suppress the creation of the _SUCCESS file */
/* in the output directory.                                                  */
/*****************************************************************************/
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;

/*****************************************************************************/
/* The source file contains fields where all of the values are enclosed in   */
/* double quotes.  In some cases, there are commas (,) within the double     */
/* quotes.  If the Load is used with PigStorage (,), the data will be parsed */
/* incorrectly.  Therefore, the CSVExcelStorage method will be used to       */
/* ensure good parsing.  CSVExcelStorage is in the PiggyBank, so it must be  */
/* REGISTERED.                                                               */
/*****************************************************************************/
REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar';

/*****************************************************************************/
/* Set up an alias to the java package.                                      */
/*****************************************************************************/
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

/*****************************************************************************/
/* The rmf command is equivalent to the Unix command rm -f.  It does not     */
/* error out if the indicated file/directory is not present to remove.       */
/*****************************************************************************/
rmf /home/hduser/data/Recipe001.out

/*****************************************************************************/
/* Read in the data using a comma (,) as the delimiter.                      */
/*****************************************************************************/
DVDData = LOAD '/home/hduser/data/KeyValuePair.txt' 
  USING CSVExcelStorage(',')
  AS
  (
    DVDName:chararray,
    AttributeName:chararray,
    AttributeValue:chararray
  );
/*****************************************************************************/
/* Time to STORE the data that was just read in.                             */
/*****************************************************************************/
STORE DVDData INTO '/home/hduser/data/Recipe001.out';