Class Scan

All Implemented Interfaces:
Attributes
Direct Known Subclasses:
ImmutableScan, InternalScan

@Public public class Scan extends Query
Used to perform Scan operations.

All operations are identical to Get with the exception of instantiation. Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the Scanner will iterate over all rows.

To get all columns from all rows of a Table, create an instance with no constraints; use the Scan() constructor. To constrain the scan to specific column families, call addFamily for each family to retrieve on your Scan instance.

To get specific columns, call addColumn for each column to retrieve.

To only retrieve columns within a specific range of version timestamps, call setTimeRange.

To only retrieve columns with a specific timestamp, call setTimestamp .

To limit the number of versions of each column to be returned, call readVersions(int).

To limit the maximum number of values returned for each call to next(), call setBatch.

To add a filter, call setFilter.

For small scan, it is deprecated in 2.0.0. Now we have a setLimit(int) method in Scan object which is used to tell RS how many rows we want. If the rows return reaches the limit, the RS will close the RegionScanner automatically. And we will also fetch data when openScanner in the new implementation, this means we can also finish a scan operation in one rpc call. And we have also introduced a setReadType(ReadType) method. You can use this method to tell RS to use pread explicitly.

Expert: To explicitly disable server-side block caching for this scan, execute setCacheBlocks(boolean).

Note: Usage alters Scan instances. Internally, attributes are updated as the Scan runs and if enabled, metrics accumulate in the Scan instance. Be aware this is the case when you go to clone a Scan instance or if you go to reuse a created Scan instance; safer is create a Scan instance per usage.

  • Field Details

  • Constructor Details

    • Scan

      public Scan()
      Create a Scan operation across all rows.
    • Scan

      public Scan(Scan scan) throws IOException
      Creates a new instance of this class while copying all values.
      Parameters:
      scan - The scan instance to copy from.
      Throws:
      IOException - When copying the values fails.
    • Scan

      public Scan(Get get)
      Builds a scan object with the same specs as get.
      Parameters:
      get - get to model scan after
  • Method Details

    • isGetScan

      public boolean isGetScan()
    • addFamily

      public Scan addFamily(byte[] family)
      Get all columns from the specified family.

      Overrides previous calls to addColumn for this family.

      Parameters:
      family - family name
    • addColumn

      public Scan addColumn(byte[] family, byte[] qualifier)
      Get the column from the specified family with the specified qualifier.

      Overrides previous calls to addFamily for this family.

      Parameters:
      family - family name
      qualifier - column qualifier
    • setTimeRange

      public Scan setTimeRange(long minStamp, long maxStamp) throws IOException
      Get versions of columns only within the specified timestamp range, [minStamp, maxStamp). Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the default.
      Parameters:
      minStamp - minimum timestamp value, inclusive
      maxStamp - maximum timestamp value, exclusive
      Throws:
      IOException
      See Also:
    • setTimestamp

      public Scan setTimestamp(long timestamp)
      Get versions of columns with the specified timestamp. Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.
      Parameters:
      timestamp - version timestamp
      See Also:
    • setColumnFamilyTimeRange

      public Scan setColumnFamilyTimeRange(byte[] cf, long minStamp, long maxStamp)
      Description copied from class: Query
      Get versions of columns only within the specified timestamp range, [minStamp, maxStamp) on a per CF bases. Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the default. Column Family time ranges take precedence over the global time range.
      Overrides:
      setColumnFamilyTimeRange in class Query
      Parameters:
      cf - the column family for which you want to restrict
      minStamp - minimum timestamp value, inclusive
      maxStamp - maximum timestamp value, exclusive
    • withStartRow

      public Scan withStartRow(byte[] startRow)
      Set the start row of the scan.

      If the specified row does not exist, the Scanner will start from the next closest row after the specified row.

      Note: Do NOT use this in combination with setRowPrefixFilter(byte[]) or setStartStopRowForPrefixScan(byte[]). Doing so will make the scan result unexpected or even undefined.

      Parameters:
      startRow - row to start scanner at or after
      Throws:
      IllegalArgumentException - if startRow does not meet criteria for a row key (when length exceeds HConstants.MAX_ROW_LENGTH)
    • withStartRow

      public Scan withStartRow(byte[] startRow, boolean inclusive)
      Set the start row of the scan.

      If the specified row does not exist, or the inclusive is false, the Scanner will start from the next closest row after the specified row.

      Note: Do NOT use this in combination with setRowPrefixFilter(byte[]) or setStartStopRowForPrefixScan(byte[]). Doing so will make the scan result unexpected or even undefined.

      Parameters:
      startRow - row to start scanner at or after
      inclusive - whether we should include the start row when scan
      Throws:
      IllegalArgumentException - if startRow does not meet criteria for a row key (when length exceeds HConstants.MAX_ROW_LENGTH)
    • withStopRow

      public Scan withStopRow(byte[] stopRow)
      Set the stop row of the scan.

      The scan will include rows that are lexicographically less than the provided stopRow.

      Note: Do NOT use this in combination with setRowPrefixFilter(byte[]) or setStartStopRowForPrefixScan(byte[]). Doing so will make the scan result unexpected or even undefined.

      Parameters:
      stopRow - row to end at (exclusive)
      Throws:
      IllegalArgumentException - if stopRow does not meet criteria for a row key (when length exceeds HConstants.MAX_ROW_LENGTH)
    • withStopRow

      public Scan withStopRow(byte[] stopRow, boolean inclusive)
      Set the stop row of the scan.

      The scan will include rows that are lexicographically less than (or equal to if inclusive is true) the provided stopRow.

      Note: Do NOT use this in combination with setRowPrefixFilter(byte[]) or setStartStopRowForPrefixScan(byte[]). Doing so will make the scan result unexpected or even undefined.

      Parameters:
      stopRow - row to end at
      inclusive - whether we should include the stop row when scan
      Throws:
      IllegalArgumentException - if stopRow does not meet criteria for a row key (when length exceeds HConstants.MAX_ROW_LENGTH)
    • setRowPrefixFilter

      @Deprecated public Scan setRowPrefixFilter(byte[] rowPrefix)
      Deprecated.
      since 2.5.0, will be removed in 4.0.0. The name of this method is considered to be confusing as it does not use a Filter but uses setting the startRow and stopRow instead. Use setStartStopRowForPrefixScan(byte[]) instead.

      Set a filter (using stopRow and startRow) so the result set only contains rows where the rowKey starts with the specified prefix.

      This is a utility method that converts the desired rowPrefix into the appropriate values for the startRow and stopRow to achieve the desired result.

      This can safely be used in combination with setFilter.

      This CANNOT be used in combination with withStartRow and/or withStopRow. Such a combination will yield unexpected and even undefined results.

      Parameters:
      rowPrefix - the prefix all rows must start with. (Set null to remove the filter.)
    • setStartStopRowForPrefixScan

      public Scan setStartStopRowForPrefixScan(byte[] rowPrefix)

      Set a filter (using stopRow and startRow) so the result set only contains rows where the rowKey starts with the specified prefix.

      This is a utility method that converts the desired rowPrefix into the appropriate values for the startRow and stopRow to achieve the desired result.

      This can safely be used in combination with setFilter.

      This CANNOT be used in combination with withStartRow and/or withStopRow. Such a combination will yield unexpected and even undefined results.

      Parameters:
      rowPrefix - the prefix all rows must start with. (Set null to remove the filter.)
    • readAllVersions

      Get all available versions.
    • readVersions

      public Scan readVersions(int versions)
      Get up to the specified number of versions of each column.
      Parameters:
      versions - specified number of versions for each column
    • setBatch

      public Scan setBatch(int batch)
      Set the maximum number of cells to return for each call to next(). Callers should be aware that this is not equivalent to calling setAllowPartialResults(boolean). If you don't allow partial results, the number of cells in each Result must equal to your batch setting unless it is the last Result for current row. So this method is helpful in paging queries. If you just want to prevent OOM at client, use setAllowPartialResults(true) is better.
      Parameters:
      batch - the maximum number of values
      See Also:
    • setMaxResultsPerColumnFamily

      public Scan setMaxResultsPerColumnFamily(int limit)
      Set the maximum number of values to return per row per Column Family
      Parameters:
      limit - the maximum number of values returned / row / CF
    • setRowOffsetPerColumnFamily

      public Scan setRowOffsetPerColumnFamily(int offset)
      Set offset for the row per Column Family.
      Parameters:
      offset - is the number of kvs that will be skipped.
    • setCaching

      public Scan setCaching(int caching)
      Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.
      Parameters:
      caching - the number of rows for caching
    • getMaxResultSize

      public long getMaxResultSize()
      Returns the maximum result size in bytes. See setMaxResultSize(long)
    • setMaxResultSize

      public Scan setMaxResultSize(long maxResultSize)
      Set the maximum result size. The default is -1; this means that no specific maximum result size will be set for this scan, and the global configured value will be used instead. (Defaults to unlimited).
      Parameters:
      maxResultSize - The maximum result size in bytes.
    • setFilter

      public Scan setFilter(Filter filter)
      Description copied from class: Query
      Apply the specified server-side filter when performing the Query. Only Filter.filterCell(org.apache.hadoop.hbase.Cell) is called AFTER all tests for ttl, column match, deletes and column family's max versions have been run.
      Overrides:
      setFilter in class Query
      Parameters:
      filter - filter to run on the server
      Returns:
      this for invocation chaining
    • setFamilyMap

      public Scan setFamilyMap(Map<byte[],NavigableSet<byte[]>> familyMap)
      Setting the familyMap
      Parameters:
      familyMap - map of family to qualifier
    • getFamilyMap

      public Map<byte[],NavigableSet<byte[]>> getFamilyMap()
      Getting the familyMap
    • numFamilies

      public int numFamilies()
      Returns the number of families in familyMap
    • hasFamilies

      public boolean hasFamilies()
      Returns true if familyMap is non empty, false otherwise
    • getFamilies

      public byte[][] getFamilies()
      Returns the keys of the familyMap
    • getStartRow

      public byte[] getStartRow()
      Returns the startrow
    • includeStartRow

      public boolean includeStartRow()
      Returns if we should include start row when scan
    • getStopRow

      public byte[] getStopRow()
      Returns the stoprow
    • includeStopRow

      public boolean includeStopRow()
      Returns if we should include stop row when scan
    • getMaxVersions

      public int getMaxVersions()
      Returns the max number of versions to fetch
    • getBatch

      public int getBatch()
      Returns maximum number of values to return for a single call to next()
    • getMaxResultsPerColumnFamily

      Returns maximum number of values to return per row per CF
    • getRowOffsetPerColumnFamily

      Method for retrieving the scan's offset per row per column family (#kvs to be skipped)
      Returns:
      row offset
    • getCaching

      public int getCaching()
      Returns caching the number of rows fetched when calling next on a scanner
    • getTimeRange

      Returns TimeRange
    • getFilter

      public Filter getFilter()
      Returns RowFilter
      Overrides:
      getFilter in class Query
    • hasFilter

      public boolean hasFilter()
      Returns true is a filter has been specified, false if not
    • setCacheBlocks

      public Scan setCacheBlocks(boolean cacheBlocks)
      Set whether blocks should be cached for this Scan.

      This is true by default. When true, default settings of the table and family are used (this will never override caching blocks if the block cache is disabled for that family or entirely).

      Parameters:
      cacheBlocks - if false, default settings are overridden and blocks will not be cached
    • getCacheBlocks

      public boolean getCacheBlocks()
      Get whether blocks should be cached for this Scan.
      Returns:
      true if default caching should be used, false if blocks should not be cached
    • setReversed

      public Scan setReversed(boolean reversed)
      Set whether this scan is a reversed one

      This is false by default which means forward(normal) scan.

      Parameters:
      reversed - if true, scan will be backward order
    • isReversed

      public boolean isReversed()
      Get whether this scan is a reversed one.
      Returns:
      true if backward scan, false if forward(default) scan
    • setAllowPartialResults

      public Scan setAllowPartialResults(boolean allowPartialResults)
      Setting whether the caller wants to see the partial results when server returns less-than-expected cells. It is helpful while scanning a huge row to prevent OOM at client. By default this value is false and the complete results will be assembled client side before being delivered to the caller.
      See Also:
    • getAllowPartialResults

      public boolean getAllowPartialResults()
      Returns true when the constructor of this scan understands that the results they will see may only represent a partial portion of a row. The entire row would be retrieved by subsequent calls to ResultScanner.next()
    • setLoadColumnFamiliesOnDemand

      public Scan setLoadColumnFamiliesOnDemand(boolean value)
      Description copied from class: Query
      Set the value indicating whether loading CFs on demand should be allowed (cluster default is false). On-demand CF loading doesn't load column families until necessary, e.g. if you filter on one column, the other column family data will be loaded only for the rows that are included in result, not all rows like in normal case. With column-specific filters, like SingleColumnValueFilter w/filterIfMissing == true, this can deliver huge perf gains when there's a cf with lots of data; however, it can also lead to some inconsistent results, as follows: - if someone does a concurrent update to both column families in question you may get a row that never existed, e.g. for { rowKey = 5, { cat_videos => 1 }, { video => "my cat" } } someone puts rowKey 5 with { cat_videos => 0 }, { video => "my dog" }, concurrent scan filtering on "cat_videos == 1" can get { rowKey = 5, { cat_videos => 1 }, { video => "my dog" } }. - if there's a concurrent split and you have more than 2 column families, some rows may be missing some column families.
      Overrides:
      setLoadColumnFamiliesOnDemand in class Query
    • getFingerprint

      Compile the table and column family (i.e. schema) information into a String. Useful for parsing and aggregation by debugging, logging, and administration tools.
      Specified by:
      getFingerprint in class Operation
      Returns:
      a map containing fingerprint information (i.e. column families)
    • toMap

      public Map<String,Object> toMap(int maxCols)
      Compile the details beyond the scope of getFingerprint (row, columns, timestamps, etc.) into a Map along with the fingerprinted information. Useful for debugging, logging, and administration tools.
      Specified by:
      toMap in class Operation
      Parameters:
      maxCols - a limit on the number of columns output prior to truncation
      Returns:
      a map containing parameters of a query (i.e. rows, columns...)
    • setRaw

      public Scan setRaw(boolean raw)
      Enable/disable "raw" mode for this scan. If "raw" is enabled the scan will return all delete marker and deleted rows that have not been collected, yet. This is mostly useful for Scan on column families that have KEEP_DELETED_ROWS enabled. It is an error to specify any column when "raw" is set.
      Parameters:
      raw - True/False to enable/disable "raw" mode.
    • isRaw

      public boolean isRaw()
      Returns True if this Scan is in "raw" mode.
    • setAttribute

      public Scan setAttribute(String name, byte[] value)
      Description copied from interface: Attributes
      Sets an attribute. In case value = null attribute is removed from the attributes map. Attribute names starting with _ indicate system attributes.
      Specified by:
      setAttribute in interface Attributes
      Overrides:
      setAttribute in class OperationWithAttributes
      Parameters:
      name - attribute name
      value - attribute value
    • setId

      public Scan setId(String id)
      Description copied from class: OperationWithAttributes
      This method allows you to set an identifier on an operation. The original motivation for this was to allow the identifier to be used in slow query logging, but this could obviously be useful in other places. One use of this could be to put a class.method identifier in here to see where the slow query is coming from. id to set for the scan
      Overrides:
      setId in class OperationWithAttributes
    • setAuthorizations

      public Scan setAuthorizations(Authorizations authorizations)
      Description copied from class: Query
      Sets the authorizations to be used by this Query
      Overrides:
      setAuthorizations in class Query
    • setACL

      public Scan setACL(Map<String,Permission> perms)
      Description copied from class: Query
      Set the ACL for the operation.
      Overrides:
      setACL in class Query
      Parameters:
      perms - A map of permissions for a user or users
    • setACL

      public Scan setACL(String user, Permission perms)
      Description copied from class: Query
      Set the ACL for the operation.
      Overrides:
      setACL in class Query
      Parameters:
      user - User short name
      perms - Permissions for the user
    • setConsistency

      public Scan setConsistency(Consistency consistency)
      Description copied from class: Query
      Sets the consistency level for this operation
      Overrides:
      setConsistency in class Query
      Parameters:
      consistency - the consistency level
    • setReplicaId

      public Scan setReplicaId(int Id)
      Description copied from class: Query
      Specify region replica id where Query will fetch data from. Use this together with Query.setConsistency(Consistency) passing Consistency.TIMELINE to read data from a specific replicaId.
      Expert: This is an advanced API exposed. Only use it if you know what you are doing
      Overrides:
      setReplicaId in class Query
    • setIsolationLevel

      Description copied from class: Query
      Set the isolation level for this query. If the isolation level is set to READ_UNCOMMITTED, then this query will return data from committed and uncommitted transactions. If the isolation level is set to READ_COMMITTED, then this query will return data from committed transactions only. If a isolation level is not explicitly set on a Query, then it is assumed to be READ_COMMITTED.
      Overrides:
      setIsolationLevel in class Query
      Parameters:
      level - IsolationLevel for this query
    • setPriority

      public Scan setPriority(int priority)
      Overrides:
      setPriority in class OperationWithAttributes
    • setScanMetricsEnabled

      public Scan setScanMetricsEnabled(boolean enabled)
      Enable collection of ScanMetrics. For advanced users.
      Parameters:
      enabled - Set to true to enable accumulating scan metrics
    • isScanMetricsEnabled

      public boolean isScanMetricsEnabled()
      Returns True if collection of scan metrics is enabled. For advanced users.
    • isAsyncPrefetch

    • setAsyncPrefetch

      @Deprecated public Scan setAsyncPrefetch(boolean asyncPrefetch)
      Deprecated.
      Since 3.0.0, will be removed in 4.0.0. After building sync client upon async client, the implementation is always 'async prefetch', so this flag is useless now.
    • getLimit

      public int getLimit()
      Returns the limit of rows for this scan
    • setLimit

      public Scan setLimit(int limit)
      Set the limit of rows for this scan. We will terminate the scan if the number of returned rows reaches this value.

      This condition will be tested at last, after all other conditions such as stopRow, filter, etc.

      Parameters:
      limit - the limit of rows for this scan
    • setOneRowLimit

      public Scan setOneRowLimit()
      Call this when you only want to get one row. It will set limit to 1, and also set readType to Scan.ReadType.PREAD.
    • getReadType

      Returns the read type for this scan
    • setReadType

      public Scan setReadType(Scan.ReadType readType)
      Set the read type for this scan.

      Notice that we may choose to use pread even if you specific Scan.ReadType.STREAM here. For example, we will always use pread if this is a get scan.

    • getMvccReadPoint

      Get the mvcc read point used to open a scanner.
    • setMvccReadPoint

      Scan setMvccReadPoint(long mvccReadPoint)
      Set the mvcc read point used to open a scanner.
    • resetMvccReadPoint

      Set the mvcc read point to -1 which means do not use it.
    • setNeedCursorResult

      public Scan setNeedCursorResult(boolean needCursorResult)
      When the server is slow or we scan a table with many deleted data or we use a sparse filter, the server will response heartbeat to prevent timeout. However the scanner will return a Result only when client can do it. So if there are many heartbeats, the blocking time on ResultScanner#next() may be very long, which is not friendly to online services. Set this to true then you can get a special Result whose #isCursor() returns true and is not contains any real data. It only tells you where the server has scanned. You can call next to continue scanning or open a new scanner with this row key as start row whenever you want. Users can get a cursor when and only when there is a response from the server but we can not return a Result to users, for example, this response is a heartbeat or there are partial cells but users do not allow partial result. Now the cursor is in row level which means the special Result will only contains a row key. Result.isCursor() Result.getCursor() Cursor
    • isNeedCursorResult

      public boolean isNeedCursorResult()
    • createScanFromCursor

      public static Scan createScanFromCursor(Cursor cursor)
      Create a new Scan with a cursor. It only set the position information like start row key. The others (like cfs, stop row, limit) should still be filled in by the user. Result.isCursor() Result.getCursor() Cursor