Package org.apache.hadoop.hbase.constraint
package org.apache.hadoop.hbase.constraint
Restrict the domain of a data attribute, often times to fulfill business rules/requirements.
Note that all exceptions that you expect to be thrown must be caught and then rethrown as a
Once we added the IntegerConstraint, constraints will be enabled on the table (once it is
created) and we will always check to make sure that the value is an String-encoded integer.
However, suppose we also write our own constraint,
At this point we added both the IntegerConstraint and MyConstraint to the table, the
IntegerConstraint will be run first, followed by MyConstraint.
Suppose we realize that the
This will overwrite the previous configuration for MyConstraint, but not change the order
of the constraint nor if it is enabled/disabled.
Note that the same constraint class can be added multiple times to a table without repercussion.
A use case for this is the same constraint working differently based on its configuration.
Suppose then we want to disable just MyConstraint. Its as easy as:
This just turns off MyConstraint, but retains the position and the configuration associated with
MyConstraint. Now, if we want to re-enable the constraint, its just another one-liner:
Similarly, constraints on the entire table are disabled via:
Or enabled via:
Lastly, suppose you want to remove MyConstraint from the table, including with position it should
be run at and its configuration. This is similarly simple:
Also, removing all constraints from a table is similarly simple:
This will remove all constraints (and associated information) from the table
and turn off the constraint processing.
NOTE
It is important to note the use above of If you just use
Table of Contents
Overview
Constraints are used to enforce business rules in a database. By checking allPuts
on a given table, you
can enforce very specific data policies. For instance, you can ensure that a certain column
family-column qualifier pair always has a value between 1 and 10. Otherwise, the
Put
is rejected and the data integrity is maintained.
Constraints are designed to be configurable, so a constraint can be used across different tables,
but implement different behavior depending on the specific configuration given to that
constraint.
By adding a constraint to a table (see Example Usage), constraints will
automatically be enabled. You also then have the option of to disable (just 'turn off') or remove
(delete all associated information) all constraints on a table. If you remove all constraints
(see
Constraints.remove(org.apache.hadoop.hbase.client.TableDescriptorBuilder)
,
you must re-add any Constraint
you want on that table.
However, if they are just disabled (see
Constraints.disable(org.apache.hadoop.hbase.client.TableDescriptorBuilder)
,
all you need to do is enable constraints again, and everything will be turned back on as it was
configured. Individual constraints can also be individually enabled, disabled or removed without
affecting other constraints.
By default, constraints are disabled on a table. This means you will not see any slow down
on a table if constraints are not enabled.
Concurrency and Atomicity
Currently, no attempts at enforcing correctness in a multi-threaded scenario when modifying a constraint, viaConstraints
, to the the
TableDescriptorBuilder
. This is particularly important
when adding a constraint(s) to the TableDescriptorBuilder
as it first retrieves the next priority from a custom value set in the descriptor, adds each
constraint (with increasing priority) to the descriptor, and then the next available priority is
re-stored back in the TableDescriptorBuilder
.
Locking is recommended around each of Constraints add methods:
Constraints.add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class...)
,
Constraints.add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, org.apache.hadoop.hbase.util.Pair...)
,
and
Constraints.add(org.apache.hadoop.hbase.client.TableDescriptorBuilder, Class, org.apache.hadoop.conf.Configuration)
.
Any changes on a single TableDescriptor should be serialized, either within a single
thread or via external mechanisms.
Note that having a higher priority means that a constraint will run later; e.g. a constraint with
priority 1 will run before a constraint with priority 2.
Since Constraints currently are designed to just implement simple checks (e.g. is the value in
the right range), there will be no atomicity conflicts. Even if one of the puts finishes the
constraint first, the single row will not be corrupted and the 'fastest' write will win; the
underlying region takes care of breaking the tie and ensuring that writes get serialized to the
table. So yes, this doesn't ensure that we are going to get specific ordering or even a fully
consistent view of the underlying data.
Each constraint should only use local/instance variables, unless doing more advanced usage.
Static variables could cause difficulties when checking concurrent writes to the same region,
leading to either highly locked situations (decreasing through-put) or higher probability of
errors. However, as long as each constraint just uses local variables, each thread interacting
with the constraint will execute correctly and efficiently.
Caveats
In traditional (SQL) databases, Constraints are often used to enforce referential integrity. However, in HBase, this will likely cause significant overhead and dramatically decrease the number ofPuts
/second possible on a
table. This is because to check the referential integrity when making a
Put
, one must block on a scan for the 'remote' table,
checking for the valid reference. For millions of Puts
a second, this will breakdown very quickly. There are several options around the blocking
behavior including, but not limited to:
- Create a 'pre-join' table where the keys are already denormalized
- Designing for 'incorrect' references
- Using an external enforcement mechanism
- All changes made via
Constraints
will make modifications to theTableDescriptor
for a given table. As such, the usual renabling of tables should be used for propagating changes to the table. When at all possible, Constraints should be added to the table before the table is created. - Constraints are run in the order that they are added to a table. This has implications for what order constraints should be added to a table.
- Whenever new Constraint jars are added to a region server, those region servers need to go through a rolling restart to make sure that they pick up the new jars and can enable the new constraints.
- There are certain keys that are reserved for the Configuration namespace:
- _ENABLED - used server-side to determine if a constraint should be run
- _PRIORITY - used server-side to determine what order a constraint should be run
TableDescriptorBuilder
via the usual method.
ConstraintProcessor
if you are interested).
Example usage
First, you must define aConstraint
. The best way to do this is to extend
BaseConstraint
, which takes care of some of the more
mundane details of using a Constraint
.
Let's look at one possible implementation of a constraint - an IntegerConstraint(there are also
several simple examples in the tests). The IntegerConstraint checks to make sure that the value
is a String-encoded int
. It is really simple to implement this kind of constraint,
the only method needs to be implemented is
Constraint.check(org.apache.hadoop.hbase.client.Put)
:
public class IntegerConstraint extends BaseConstraint { public void check(Put p) throws ConstraintException { Map<byte[], List<KeyValue>> familyMap = p.getFamilyMap(); for (List <KeyValue> kvs : familyMap.values()) { for (KeyValue kv : kvs) { // just make sure that we can actually pull out an int // this will automatically throw a NumberFormatException if we try to // store something that isn't an Integer. try { Integer.parseInt(new String(kv.getValue())); } catch (NumberFormatException e) { throw new ConstraintException("Value in Put (" + p + ") was not a String-encoded integer", e); } } }
ConstraintException
. This way, you can be sure that a
Put
fails for an expected reason, rather than for any
reason. For example, an OutOfMemoryError
is probably indicative of an inherent
problem in the Constraint
, rather than a failed
Put
.
If an unexpected exception is thrown (for example, any kind of uncaught
RuntimeException
), constraint-checking will be 'unloaded' from the regionserver
where that error occurred. This means no further
Constraints
will be checked on that server
until it is reloaded. This is done to ensure the system remains as available as possible.
Therefore, be careful when writing your own Constraint.
So now that we have a Constraint, we want to add it to a table. It's as easy as:
TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME); ... Constraints.add(builder, IntegerConstraint.class);
MyConstraint.java
. First, you
need to make sure this class-files are in the classpath (in a jar) on the regionserver where that
constraint will be run (this could require a rolling restart on the region server - see
Caveats above)
Suppose that MyConstraint also uses a Configuration (see
Configurable.getConf()
). Then adding MyConstraint looks
like this: TableDescriptor builder = TableDescriptorBuilder.newBuilder(TABLE_NAME); Configuration conf = new Configuration(false); ... (add values to the conf) (modify the table descriptor) ... Constraints.add(builder, new Pair(MyConstraint.class, conf));
Configuration
for MyConstraint is
actually wrong when it was added to the table. Note, when it is added to the table, it is
not added by reference, but is instead copied into the
TableDescriptor
. Thus, to change the
Configuration
we are using for MyConstraint, we need to do this:
(add/modify the conf) ... Constraints.setConfiguration(desc, MyConstraint.class, conf);
Constraints.disable(desc, MyConstraint.class);
Constraints.enable(desc, MyConstraint.class);
Constraints.disable(desc);
Constraints.enable(desc);
Constraints.remove(desc, MyConstraint.class);
Constraints.remove(desc);
Configuration conf = new Configuration(false);
new Configuration()
, then the Configuration
will be loaded with the default properties. While in the simple case, this is not going to be an
issue, it will cause pain down the road. First, these extra properties are going to cause serious
bloat in your TableDescriptor
, meaning you are keeping
around a ton of redundant information. Second, it is going to make examining your table in the
shell, via describe 'table'
, a huge pain as you will have to dig through a ton of
irrelevant config values to find the ones you set. In short, just do it the right way.-
ClassDescriptionBase class to use when actually implementing a
Constraint
.Apply aConstraint
(in traditional database terminology) to a Table.Exception that a user defined constraint throws on failure of aPut
.Processes multipleConstraints
on a given table.Utilities for adding/removing constraints from a table.