How to get filtered data from Bigtable using Python?

房东的猫 提交于 2019-12-10 10:33:57

问题


I am using Bigtable emulator and have successfully added a table in it and now I need to get filtered data.

The table is as follows:

arc_record_id | record_id | batch_id
1             |624        |86
2             |625        |86
3             |626        |86 

and so on...till arc_record_id 10.

I have tried this given below Python code:

visit_dt_filter = ValueRangeFilter(start_value = "1".encode('utf-8'), 
end_value = "2".encode('utf-8'))

col1_filter = ColumnQualifierRegexFilter(b'arc_record_id')

chain1 = RowFilterChain(filters=[col1_filter, visit_dt_filter])

partial_rows = testTable.read_rows(filter_=chain1)

for row in partial_rows:
    cell = row.cells[columnFamilyid1]["arc_record_id".encode('utf-8')][0]
    print(cell.value.decode('utf-8'))

The rowkey is

prim_key=row_value[0] //which is arc_record_id 
row_key="RecordArchive{}".format(prim_key).encode('utf-8') 

I get the output as

1
10
2
3

I expect the output to be

arc_record_id | record_id | batch_id
1             |624        |86
2             |625        |86

回答1:


There are several issues with your code that will help you get to what you want:

  1. Bigtable uses lexicographic sort over arbitrary bytes, so the sort order is 1, 10, 2, 3 and so on. This is why 10 is included in your result set. You could fix this by left padding your numbers so they are stored as 000000001, 000000002. (You can reduce the inefficiency of this by storing in hex or even binary).

  2. Because you only print row.cells[columnFamilyid1]["arc_record_id".encode('utf-8')] you are only outputting arc_record_id.

  3. Because the column you want to filter is the row key, it is both easier and more efficient to directly tell read_rows the range to read: read_rows(start_key="RecordArchive1".encode('utf-8'), end_key="RecordArchive3".encode('utf-8'))

All in all, try code like:

KEY_PREFIX = "RecordArchive".encode('utf-8')
ARC_RECORD_ID_COL = "arc_record_id".encode('utf-8')
RECORD_ID_COL = "record_id".encode('utf-8')
BATCH_ID_COL = "batch_id".encode('utf-8')

# Functions used to store/retrieve integer values. Supports IDs up to 2**31
def pack_int(i):
    return struct.pack('>l', i)
def unpack_int(b):
    return struct.unpack('>l', b)[0]
# row key of a record of given arc_record_id
def rowkey(id):
    return KEY_PREFIX + pack_int(id)

results = table.read_rows(start_key=rowkey(1), end_key=rowkey(2), end_inclusive=True)
print("arc_record_id,record_id,batch_id")
for row in results:
    print("{},{},{}".format(
              unpack_int(row.cell[columnFamilyid1][ARC_RECORD_ID_COL][0].value),
              unpack_int(row.cell[columnFamilyid1][RECORD_ID_COL][0].value),
              unpack_int(row.cell[columnFamilyid1][BATCH_ID_COL][0].value)))


来源:https://stackoverflow.com/questions/55792924/how-to-get-filtered-data-from-bigtable-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!