Pig's Data Model
Navigation
- Apache Pig Website
- Wiki
- Cheat Sheets
- Pig's Data Model
- My Pig Installation
- My Pig Logging
- My Pig SET Keys
- My Pig Recipes
- My Pig UDF
- Piggybank!!
- Pig's Parameter Substitution
- Hadoop and Pig
- Programming Pig (O Reilly)
Understanding Pig's data model includes its data types, how it handles concepts such as missing data, and how you can describe your data to Pig.
Types
Pig's data types can be divided into two categories: scalar types, which contain a single value, and complex types, which contain other types.
Scalar Types
Pig's scalar types are simple types that appear in most programming
languages. With the exception of bytearray, they are all represented in Pig
interfaces by java.lang
classes, making them easy to work with
in UDFs:
int
An integer. Ints are represented in interfaces by
java.lang.Integer
. They store a four-byte signed integer.
Constant integers are expressed as integer numbers, for example,
24
.
long
A long integer. Longs are represented in interfaces by
java.lang.Long
. They store an eight-byte signed integer.
Constant longs are expressed as integer numbers with an L
appended, for example, 5000000000L
.
float
A floating-point number. Floats are represented in interfaces by
java.lang.Float
and use four bytes to store their value. Note
that because this is a floating-point number, in some calculations it will
lose precision. For calculations that require no loss of precision, you
should use an int or long instead. Constant floats are expressed as a
floating-point number with an f appended. Floating-point numbers can be
expressed in simple format, 3.14f
, or in exponent format,
6.022e23f
.
double
A double-precision floating-point number. Doubles are represented in
interfaces by java.lang.Double
and use eight bytes to store
their value.
Note that because this is a floating-point number, in some calculations it
will lose precision. For calculations that require no loss of precision,
you should use an int or long instead. Constant doubles are expressed as a
floating-point number in either simple format, 2.71828
, or
in exponent format, 6.626e-34
.
chararray
A string or character array. Chararrays are represented in interfaces by
java.lang.String
. Constant chararrays are expressed as string
literals with single quotes, for example, 'fred'
. In addition
to standard alphanumeric and symbolic characters, you can express certain
characters in chararrays by using backslash codes, such as \t
for Tab and \n
for Return. Unicode characters can be expressed
as \u
followed by their four-digit hexadecimal Unicode value.
For example, the value for Ctrl-A is expressed as \u0001
.
bytearray
A blob or array of bytes. Bytearrays are represented in interfaces by a Java
class DataByteArray
that wraps a Java byte[]
.
There is no way to specify a constant bytearray.
Complex Types
Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of any type, including other complex types. So it is possible to have a map where the value field is a bag, which contains a tuple where one of the fields is a map.
Map
A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex type. The chararray is called a key and is used as an index to find the element, referred to as the value.
Because Pig does not know the type of the value, it will assume it is a bytearray. However, the actual value might be something different. If you know what the actual type is (or what you want it to be), you can cast it; see "Casts". If you do not cast the value, Pig will make a best guess based on how you use the value in your script. If the value is of a type other than bytearray, Pig will figure that out at runtime and handle it. See "Schemas" for more information on how Pig handles unknown types.
By default there is no requirement that all values in a map must be of the
same type. It is legitimate to have a map with two keys name
and
age
, where the value for name
is a chararray and
the value for age
is an int. Beginning in Pig 0.9, a map can
declare its values to all be of the same type. This is useful if you know all
values in the map will be of the same type, as it allows you to avoid the
casting, and Pig can avoid the runtime type-massaging referenced in the
previous paragraph.
Map constants are formed using brackets to delimit the map, a hash between
keys and values, and a comma between key-value pairs. For example,
['name'#'bob', 'age'#55]
will create a map with two keys, "name"
and "age". The first value is a chararray, and the second is an integer.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type-they do not all need to be the same type. A tuple is analogous to a row in SQL, with the fields being SQL columns. Because tuples are ordered, it is possible to refer to the fields by position; see "Expressions in foreach" for details. A tuple can, but is not required to, have a schema associated with it that describes each field's type and provides a name for each field. This allows Pig to check that the data in the tuple is what the user expects, and it allows the user to reference the fields of the tuple by name.
Tuple constants use parentheses to indicate the tuple and commas to delimit
fields in the tuple. For example, ('bob', 55)
describes a tuple
constant with two fields.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference tuples in a bag by position. Like tuples, a bag can, but is not required to, have a schema associated with it. In the case of a bag, the schema describes all tuples within the bag.
Bag constants are constructed using braces, with tuples in the bag separated
by commas. For example, {('bob', 55), ('sally', 52), ('john',
25)}
constructs a bag with three tuples, each with two fields.
Pig users often notice that Pig does not provide a list or set type that can store items of any type. It is possible to mimic a set type using the bag, by wrapping the desired type in a tuple of one field. For instance, if you want to store a set of integers, you can create a bag with a tuple with one field, which is an int. This is a bit cumbersome, but it works.
Bag is the one type in Pig that is not required to fit into memory. As you will see later, because bags are used to store collections when grouping, bags can become quite large. Pig has the ability to spill bags to disk when necessary, keeping only partial sections of the bag in memory. The size of the bag is limited to the amount of local disk available for spilling the bag.
Nulls
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it. See "foreach", "Group", and "Join" for details of how nulls are handled in expressions and relations in Pig.
Unlike SQL, Pig does not have a notion of constraints on the data. In the context of nulls, this means that any data element can always be null. As you write Pig Latin scripts and UDFs, you will need to keep this in mind.
Memory Requirements
I often referenced the size of the value stored for each type (four bytes
for integer, eight bytes for long, etc.). This tells you how large (or small)
a value those types can hold. However, this does not tell you how much memory
is actually used by objects of those types. Because Pig uses Java objects to
represent these values internally, there is an additional overhead. This
overhead depends on your JVM, but it is usually eight bytes per object. It is
even worse for chararrays because Java's String
uses two bytes
per character rather than one.
So, if you are trying to figure out how much memory you need in Pig to hold all of your data (e.g., if you are going to do a join that needs to hold a hash table in memory), do not count the bytes on disk and assume that is how much memory you need. The multiplication factor between disk and memory is dependent on your data, whether your data is compressed on disk, your disk storage format, etc. As a rule of thumb, it takes about four times as much memory as it does disk to represent the uncompressed data.
Schemas
Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig's philosophy of eating anything. If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization. But if no schema is available, Pig will still process the data, making the best guesses it can based on how the script treats the data. First, we will look at ways that you can communicate the schema to Pig; then, we will examine how Pig handles the case where you do not provide it with the schema.
The easiest way to communicate the schema of your data to Pig is to explicitly tell Pig what it is when you load the data:
dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividend:float);
Pig now expects your data to have four fields. If it has more, it will truncate the extra ones. If it has less, it will pad the end of the record with nulls.
It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray:
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
Also, when you declare a schema, you do not have to declare the schema of complex types, but you can if you want to. For example, if your data has a tuple in it, you can declare that field to be a tuple without specifying the fields it contains. You can also declare that field to be a tuple that has three columns, all of which are integers. Table 4-1 gives the details of how to specify each data type inside a schema declaration.
Data Type | Syntax | Example |
---|---|---|
int | int |
as (a:int) |
long | long |
as (a:long) |
float | float |
as (a:float) |
double | double |
as (a:double) |
double | double |
as (a:double) |
chararray | chararray |
as (a:chararray) |
bytearray | bytearray |
as (a:bytearray) |
map | map[] or map[type] , where
type is any valid type. This declares all
values in the map to be of this type. |
as (a:map[], b:map[int]) |
tuple | tuple() or tuple(list_of_fields) , where
list_of_fields is a comma-separated list of field
declarations. |
as (a:tuple(), b:tuple(x:int, y:int)) |
bag | bag{} or bag{t:(list_of_fields)} , where
list_of_fields is a comma-separated list of field
declarations. Note that, oddly enough, the tuple inside the bag must have
a name, here specified as t , even though you will never be able
to access that tuple t directly. |
(a:bag{}, b:bag{t:(x:int,
y:int)}) |
Table 4-1. Schema syntax
The runtime declaration of schemas is very nice. It makes it easy for users to operate on data without having to first load it into a metadata system. It also means that if you are interested in only the first few fields, you only have to declare those fields.
But for production systems that run over the same data every hour or every day, it has a couple of significant drawbacks. One, whenever your data changes, you have to change your Pig Latin. Two, although this works fine on data with 5 columns, it is painful when your data has 100 columns. To address these issues, there is another way to load schemas in Pig.
If the load function you are using already knows the schema of the data, the function can communicate that to Pig. (Load functions are how Pig reads data; see "Load" for details.) Load functions might already know the schema because it is stored in a metadata repository such as HCatalog, or it might be stored in the data itself (if, for example, the data is stored in JSON format). In this case, you do not have to declare the schema as part of the load statement. And you can still refer to fields by name because Pig will fetch the schema from the load function before doing error checking on your script:
mdata = load 'mydata' using HCatLoader(); cleansed = filter mdata by name is not null; ...
What if you specify a schema and the loader returns one? If they are identical, all is well. If they are not identical, Pig will determine whether it can adapt the one returned by the loader to match the one you gave. For example, if you specified a field as a long and the loader said it was an int, Pig can and will do that cast. However, if it cannot determine a way to make the loader's schema fit the one you gave, it will give an error. See "Casts" for a list of casts Pig can and cannot insert to make the schemas work together.
Now let's look at the case where neither you nor the load function tell Pig what the data's schema is. In addition to being referenced by name, fields can be referenced by position, starting from zero. The syntax is a dollar sign, then the position: $0 refers to the first field. So it is easy to tell Pig which field you want to work with. But how does Pig know the data type? It does not, so it starts by assuming everything is a bytearray. Then it looks at how you use those fields in your script, drawing conclusions about what you think those fields are and how you want to use them. Consider the following:
--no_schema.pig daily = load 'NYSE_daily'; calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
In the expression $7 / 1000
, 1000
is an integer,
so it is a safe guess that the eighth field of NYSE_daily
is
an integer or something that can be cast to an integer. In the same way,
$3 * 100.0
indicates $3
is a double, and the use
of $0
in a function that takes a chararray as an argument
indicates the type of $0
. But what about the last expression,
$6
- $3
? The - operator is used only with numeric
types in Pig, so Pig can safely guess that $3
and $6
are numeric. But should
it treat them as integers or floating-point numbers? Here Pig plays it safe
and guesses that they are floating points, casting them to doubles. This is
the safer bet because if they actually are integers, those can be represented
as floating-point numbers, but the reverse is not true. However, because
floating-point arithmetic is much slower and subject to loss of precision,
if these values really are integers, you should cast them so that Pig uses
integer types in this case.
There are also cases where Pig cannot make any intelligent guess:
--no_schema_filter daily = load 'NYSE_daily'; fltrd = filter daily by $6 > $3;
> is a valid operator on numeric, chararray, and bytearray types in Pig Latin. So, Pig has no way to make a guess. In this case, it treats these fields as if they were bytearrays, which means it will do a byte-to-byte comparison of the data in these fields.
Pig also has to handle the case where it guesses wrong and must adapt on the fly. Consider the following:
--unintended_walks.pig player = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]); unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';
Because the values in maps can be of any type, Pig has no idea what type
bat#'base_on_balls'
and bat#'ibbs'
are. By the
rules laid out previously, Pig will assume they are doubles. But let's
say they actually turn out to be represented internally as integers.
(Note: That is not the case in the example data. For that to be the case,
you would need to use a loader that did load the bat map with these values
as integers.) In that case, Pig will need to adapt at runtime and convert
what it thought was a cast from bytearray to double into a cast from int to
double. Note that it will still produce a double output and not an int
output. This might seem nonintuitive; see "How Strongly Typed Is Pig?"
for details on why this is so. It should be noted that in Pig 0.8 and
earlier, much of this runtime adaption code was shaky and often failed.
In 0.9, much of this has been fixed. But if you are using an older version
of Pig, you might need to cast the data explicitly to get the right results.
Finally, Pig's knowledge of the schema can change at different points in the Pig Latin script. In all of the previous examples where we loaded data without a schema and then passed it to a foreach
statement, the data started out without a schema. But after the foreach
, the schema is known. Similarly, Pig can start out knowing the schema, but if the data is mingled with other data without a schema, the schema can be lost. That is, lack of schema is contagious:
--no_schema_join.pig divs = load 'NYSE_dividends' as (exchange, stock_symbol, date, dividends); daily = load 'NYSE_daily'; jnd = join divs by stock_symbol, daily by $1;
In this example, because Pig does not know the schema of daily, it cannot
know the schema of the join of divs
and daily
.
Casts
The previous sections have referenced casts in Pig without bothering to define how casts work. The syntax for casts in Pig is the same as in Java-the type name in parentheses before the value:
--unintended_walks_cast.pig player = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]); unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
The syntax for specifying types in casts is exactly the same as specifying them in schemas, as shown previously in Table 4-1.
Not all conceivable casts are allowed in Pig. Table 4-2 describes which casts are allowed between scalar types. Casts to bytearrays are never allowed because Pig does not know how to represent the various data types in binary format. Casts from bytearrays to any type are allowed. Casts to and from complex types currently are not allowed, except from bytearray, although conceptually in some cases they could be.
To int | To long | To float | To double | To chararray | |
---|---|---|---|---|---|
From int | Yes. | Yes. | Yes. | Yes. |
One type of casting that requires special treatment is casting from bytearray
to other types. Because bytearray indicates a string of bytes, Pig does not
know how to convert its contents to any other type. Continuing the previous
example, both bat#'base_on_balls'
and bat#'ibbs'
were loaded as bytearrays. The casts in the script indicate that you want
them treated as ints.
Pig does not know whether integer values in baseball
are stored
as ASCII strings, Java serialized values, binary-coded decimal, or some
other format. So it asks the load function, because it is that function's
responsibility to cast bytearrays to other types. In general this works
nicely, but it does lead to a few corner cases where Pig does not know how
to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will
not know how to perform casts on it because that bytearray is not generated
by a load function.
Before leaving the topic of casts, we need to consider cases where Pig inserts casts for the user. These casts are implicit, compared to explicit casts where the user indicates the cast. Consider the following:
--total_trade_estimate.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); rough = foreach daily generate volume * close;
In this case, Pig will change the second line to (float)volume *
close
to do the operation without losing precision. In general, Pig
will always widen types to fit when it needs to insert these implicit casts.
So, int and long together will result in a long; int or long and float will
result in a float; and int, long, or float and double will result in a
double. There are no implicit casts between numeric types and chararrays or
other types.
How Strongly Typed is Pig?
In a strongly typed computer language (e.g., Java), the user must declare up front the type for all variables. In weakly typed languages (e.g., Perl), variables can take on values of different type and adapt as the occasion demands. So which is Pig? For the most part it is strongly typed. If you describe the schema of your data, Pig expects your data to be what you said. But when Pig does not know the schema, it will adapt to the actual types at runtime. (Perhaps we should say Pig is "gently typed." It is strong but willing to work with data that does not live up to its expectations.) To see the differences between these two cases, look again at this example:
--unintended_walks.pig player = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]); unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';
In this example, remember we are pretending that the values for
base_on_balls
and ibbs
turn out to be represented
as integers internally (that is, the load function constructed them as
integers). If Pig were weakly typed, the output of unintended
would be records with one field typed as an integer. As it is, Pig will
output records with one field typed as a double. Pig will make a guess and
then do its best to massage the data into the types it guessed.
The downside here is that users coming from weakly typed languages are surprised, and perhaps frustrated, when their data comes out as a type they did not anticipate. However, on the upside, by looking at a Pig Latin script it is possible to know what the output data type will be in these cases without knowing the input data.