ReductSelect Extension
The ReductSelect extension is a powerful tool that enables you to manipulate columnar data on the storage side. It enables you to select specific columns, filter data and perform various operations before returning it to the client.
The extension is only available under a commercial licence. If you wish to use it for non-commercial purposes, please contact us.
Description
The extension addresses the need for efficient data manipulation in columnar formats. It does this without the overhead of transferring large datasets to the client for processing, and enables users to perform operations such as selecting specific columns, filtering data and applying transformations directly in storage.
Data Format
Currently, the extension only supports single-line CSV records without headers. Data is expected to be in columnar format, with each record represented as a single line in the CSV file, separated by commas.
Query Syntax
The extension expects a JSON object with the following structure:
{
"columns": [
{
# Column index to select
"index": "number",
# Select a range of columns starting from this index until the end or `to` index
"from": "number",
# Select a range of columns starting from this index until the end or `from` index
"to": "number",
# Name of the label to use in the query condition
"as_label": "string"
}
]
}
Column Selection
The columns
array contains objects that specify the columns to select from the CSV data.
Parameter | Type | Mandatory | Description |
---|---|---|---|
index | number | No* | The index of the column to select. It can be used with as_label to map the selected column to a label. |
from | number | No* | The starting index of the range of columns to select. It can be used with to to specify the end of the range. |
to | number | No* | The ending index of the range of columns to select. It can be used with from to specify the start of the range. |
as_label | string | No | The name of the label to use in the query condition. It can be used with index to map the selected column to a label. |
*Each column object must have an index or a range of columns to select.
Example
The following example shows how to use the ReductSelect extension to select specific columns from a CSV file and filter the data based on these columns. Although the example is written in Python, you can use any of the official SDKs to run it.
- Python
from time import time_ns
from reduct import Client
async def main():
async with Client("http://localhost:8383", api_token="my-token") as client:
bucket = await client.create_bucket(
"my-bucket",
exist_ok=True,
)
# Write some CSV data with timestamps
now = time_ns() // 1000
await bucket.write("csv", "1,2,3,4,5", timestamp=now, content_type="text/csv")
await bucket.write("csv", "6,7,8,9,10", timestamp=now+1, content_type="text/csv")
await bucket.write("csv", "11,12,13,14,15", timestamp=now+2, content_type="text/csv")
# Prepare the query with the 'select' extension
ext = {
"select": { # name of the extension to use
"columns": [
# Select the first column and label it as 'col1'
{"index": 0, "as_label": "col1"},
# Select columns from 2nd to 4th column without labeling
{"from": 2, "to": 4},
]
}
}
# Query the data with the 'select' extension
# and filter out records where the first column is less than 10
async for record in bucket.query("csv", start=now, when={"@col1": {"$lt": 10}}, ext=ext):
print(f"Record timestamp: {record.timestamp}")
print(f"Record labels: {record.labels}")
print(await record.read_all())
# 5. Run the main function
if __name__ == "__main__":
import asyncio
asyncio.run(main())
The expected output of the above code is as follows:
Record timestamp: 1747389597415517
Record labels: {'@col1': '1'}
b'1,3,4\n'
Record timestamp: 1747389597415518
Record labels: {'@col1': '6'}
b'6,8,9\n'