Pig
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Pig is an ETL library for Hadoop. It generates MapReduce jobs just like Hive does. It's a higher-level abstraction. ETL is a different kind of process than a generic query process. ETL stands for Extract, Transform and Load. The Pig library uses the Pig Latin language.
Pig vs. Hive
When should you use Pig versus Hive or MapReduce? Pig should be used when you have processes that are ETL-like. If you come out of the relational world, you might have used ETL tools for SQL Server, such as SQL Server Integration Services, which allow you to create ETL data transformation workflows. Other implementations of these kind of tools in the relational world are things like Informatica, and Oracle has a set of ETL tools. When you have ETL-like jobs or tasks, use Pig as an abstraction language. Pig was specifically written to make those kinds of processes easier. You will see in our example that the implementation of word count is actually more elegant and simpler using Pig than using Hive.
To Clean Data
It's really common in big data scenarios that you're bringing in behavioral data, and it might have junk or data that is incorrect, improper or that you don't care about. For example, you might be collecting data from sensors, and maybe there was a blip or an outage, or there's some incorrect data. So you wanna look for data that is incomplete, and then you wanna discard it. To process data. In big data scenarios, the volumes of data can be exponentially more than you're used to dealing with.
Using the Pig language against HDFS and Hadoop Platform really makes a lot of sense when you've got a tremendous volume of data, and maybe you just want to aggregate it up. An example might be, you might get location data from all of your customers worldwide as they go through your stores or go through a certain section of the offering that you have. And maybe you only want the data from a certain time period, from Time A to Time B, and the rest of the data you don't need. So it's a great use of resources to dump all that data into Hadoop cluster, and then use Pig to filter and process that data.
So how does Pig work? Generally, there is an ETL process and flow that looks something like this.
All-capitalized words are keywords in the Pig Latin language. So you're going to load some file or files, and again, just like with Hive, you can load from a file system, a standard file system, a cloud-based file system, or you can load from HDFS.
And then you're gonna perform some operations. And you're gonna find, as you work with Pig, that there's a rich set of functions that are associated with the Pig Latin language. Functions that you're used to and you're going to find very useful, like FILTER, JOIN, GROUP BY, FOREACH. I wanna call your attention to the GENERATE function, because this is really commonly used in ETL Pig workflows. GENERATE allows you to do some process on the intermediate data, and then to generate some new values.
So it's a really common programming paradigm in Pig ETL process flows. And then you'll usually take the output and either dump it to the screen for testing, so you can look at it and see if your process is correct, or, if you're satisfied with it, you'll store it to a new file location. And, as with other processes on Hadoop, you can choose to store your output as new files into the HDFS file system, or you can store it into some other file system which can be local or cloud-based.
Another consideration, if you're working with HDFS, is to remember that HDFS files are immutable. So if you are transforming, and you're creating new results, or really making any change whatsoever, you're gonna have to create new HDFS files. So sometimes as part of the process, you'll move or put the old HDFS information back into a less-redundant file storage that can be part of a workflow. For example, on Amazon sometimes, I'll move the original HDFS back into S3 because the storage is cheaper.
You might remember that HDFS by default is triple-redundant, and so if you already processed those files, it's a real common workload that you don't need to keep those in HDFS anymore, the original files. Maybe the results will stay in there, or maybe they'll go back, in my case, into S3 as well.
Data Representation
How does Pig represent data? It's pretty simple. It has fields, which is a single piece of data; a tuple, which is a set of fields like a key and a value, for example; and a bag, which is a collection of tuples. So multiple sets of keys and values. You can see how this aligns nicely with the MapReduce paradigm because, as you'll remember from those sections on MapReduce, the output from the mappers and the reducers is a single tuple and then a set of tuples, which is a bag.
Pig also has the concept of a relation, which is a complete database. It's different than the idea of a relation in the relational database world. That obviously is the relation between the data in tables. In the Pig world, a relation is a complete database, so I wanted to call it out here.
Filters
Another important concept in the Pig world, because I really think of it as a
dump truck for processing data, is to filter. I wanted to take a minute to
talk about filtering. It's a really, really common methodology in working
with Pig scripts. The syntax of filtering is you filter some set by some
value. It's like a WHERE
clause out of the relational world,
but you use the word FILTER
. It's really very similar, and for
those of us that come from the relational world, it's simple and natural to
use. It supports all of the operators that we're used to in terms of the
logical operators. They work the same way, as well as the relational
operators, so I just called them out here. It just helps you get very
productive very quickly with the Pig language to understand that there's
a very good compatibility with ANSI SQL here.
Functions
Another aspect of the Pig language is a very rich function library. When I was talking about Hive earlier, I cautioned you to look at which aspects of HQL were supported as it relates to the superset of ANSI SQL. In Pig, what I find is Pig actually has more functions that what you find in ANSI SQL. It's kind of the other way. The have pretty much everything I've seen there plus more, so I called some of them out here. Again, depending on the type of data manipulation, the type of ETL processes you're doing, you're gonna wanna dig into these function libraries. Rather than killing yourself trying write something manually, you might be surprised that the function already exists.
I put the top level categories here just so you can get an understanding of what function libraries are available. There's the general or standard, and you have average, min, max. I did call out tokenize. Tokenize is a function that I'm gonna show you in a demo coming up. What it does is it splits words. It splits text into words. It's a really useful function when you're analyzing string-based or text data, and it's much simpler, I think, than the Hive implementation I showed you in an earlier movie. These functions are available and really commonly used.
There's a set of relational functions. I showed you filter. I also wanna call out that because this is a superset, there are functions like MapReduce, where you can actually call a MapReduce job from within a Pig script. It's kind of interesting cause you might think it would go the other way, that'd you'd call a Pig job from within MapReduce, but nope. You have a Pig script and then you have a MapReduce job, usually written in Python or Java, that you can call, which is sort of like a super powerful aspect of a Pig script. Again, it's a different way of thinking about an ETL process kinda stringing these things together, but very, very commonly done in production situations when you're processing these huge amounts of behavioral data and getting them down to something useful.
You, of course, have also the string functions that you'd expect. Things like upper case, lower case, substring. That's pretty standard stuff. You have your math functions, absolute value, logarithmic functions, rounding. Then you have a unique set of functions around loading and storage. I called out here the JSONLOADER because there's more and more JSON data coming from the web. There's also some functions around compression that can really make your Pig workflows run more quickly. Again, I really advise as a tip, if you're working with production-level ETL Pig scripts that you take the time to investigate the function library.
If in Pig the functions that are presented are either not sufficient or not performant, Pig also includes the idea that you can write your own functions. This is a very advanced topic. However, I have used this functionality in working with production customers, so I wanted to mention it here. The idea is that if you do not get adequate performance out of a Pig-defined function, you can write your own. If you're doing that, you're gonna need to write the function, register the function. I highly recommend you test the function under load. The functions can be written in Java or Python, and then they can be combined as part of your Pig workflow.
Run Pig Scripts
There are several different ways to run a Pig Script. Let's go over them here. The first way is in Script or Batch mode. You just run your script from the Hadoop shell. The second way is, and this is just goofy, but in Grunt or Interactive mode. From the Hadoop shell, you type the word pig, which will start the Pig shell, and the third way, is the Embedded mode, and that's within Java. Now you might be wondering, why do we have these ways? When do you use them? Script or Grunt are used during the testing phase, and the Embedded mode is very often used in production because it's very common that you combine scrips, you chain them together in a process that is repeatable and quite complex when you're doing processing of real world data.
Let's take a look at WordCount using Pig. So I'm going to just go through this simplified example so that we can compare and contrast this to WordCount in some of the other languages and libraries.
So first, we're using the concept of aliases. So we have variables, namely lines, words, grouped, wordcount, those are aliases. And then we're setting them equal to values that are generated by Pig. So lines is equal to calling the LOAD keyword. We're just loading some text. And then we're loading it as a character array, which is a data type in Pig. And we need to load that type to be able to use with the function so that we can process it.
To the words variable, we're using the FOREACH keyword, against the lines input, and then we're calling several functions. So this, I think, is really quite elegant, and I want to contrast this with what I showed earlier in the hive where we used a (mumbles) which, to me, was kind of unreadable, so we're going to generate a new output, we're going to flatten or group that output, put it in a bag, and then we're going to TOKENIZE, which I think is kind of beautiful. I'm a little bit of a nerd, but we're tokenizing the line, which is doing a basic split of the text into words. Now the TOKENIZE function separates on spaces, and that might not be sophisticated enough for your particular implementation so you might have to write your own TOKENIZE, but for a lot of implementations of basic text processing, it's good enough. And then you get a word.
Next, we have the grouped variable, which is going to group the words by word, and then for the wordcount, for each grouped, we're going to generate a group.
And we're going to count the words, and then we're going to dump out the results. Now are you starting to see the patterns? Are you starting to be able to see where the map would occur? And where the reduce would occur? For me, it's easier to think backwards. If I look at the wordcount variable, it's clear to me that's where the aggregate is going to be generated. So we're going to have these key value lists that are generated, and we're going to group them together, so that's going to be the reducer. So the words variable, and the grouped variable are where the mappers are going to be generated.
Again, it's really critical, as you're working with Hadoop data, that you think in terms of map reduce. That's why we took the time in the earlier modules, to go down to that low level, and write the map reduce code. Even though you may not use it in production, you may use Pig, or you may use Hive. More and more customers do, because they are simpler, and more flexible. However, when you get into real production situations you need to be able to translate back down to mappers and reducers, so you can find out where bottlenecks occur and debug and fix them.