Skip to main content
Anthony Cavin
Data Scientist - ML/AI, Python, TypeScript

A data scientist specializing in machine learning, AI, Python, and TypeScript, with a strong interest in applying these technologies to data-driven projects and innovative AI solutions.

View all authors

Share

Kafka Integration Tutorial for Blob Data

· 12 min read
Anthony Cavin
Data Scientist - ML/AI, Python, TypeScript

Kafka ReductStore Example

Sensor data processed and labeled by AI, stored in ReductStore, with metadata relayed to Kafka

In this tutorial, we will walk through a simple and practical setup for integrating Kafka with ReductStore to handle unstructured data streams from edge devices. We'll cover the basics of setting up Kafka and ReductStore using Docker, creating Kafka topics in Python, and managing blob data and metadata.

If you are new to Kafka and ReductStore, here's a quick summary of the technology:

  • Apache Kafka is a distributed streaming platform to share data between applications and services in real-time.
  • ReductStore is a time-series database for blob data, optimized for edge computing and complements Kafka by providing a data storage solution for files larger than 1MB–Kafka's maximum message size.

In our example, we will deploy a simple architecture with a single instance of Kafka and ReductStore running on a local machine. We will demonstrate how to create Kafka topics, write data to ReductStore, and forward metadata to Kafka.

For an easy start, you can follow along by cloning the reduct-kafka-example repository containing all the code snippets and Docker Compose files used in this tutorial.

Share

Implementing Data Streaming in PyTorch from Remote DB

· 9 min read
Anthony Cavin
Data Scientist - ML/AI, Python, TypeScript

PyTorch Training Diagram

PyTorch training loop with data streaming from remote device

When training a model, we aim to process data in batches, shuffle data at each epoch to avoid over fitting, and leverage Python's multiprocessing for data fetching through multiple workers.

The reason that we want to use multiple workers is that GPUs are capable of handling large amounts of data concurrently; however, the bottleneck often lies in the time-consuming task of loading this data into the system.

Moreover, the challenge is even trickier when there is simply too much data to store the whole dataset on disk and we need to stream data from a remote database such as ReductStore.

In this blog post, we will go through a full example and setup a data stream to PyTorch from a playground dataset on a remote database.

Let's dig in!

Share

Open-Source Alternatives to Landing AI

· 7 min read
Anthony Cavin
Data Scientist - ML/AI, Python, TypeScript

Photo by Luke Southern

Photo by

Luke Southern

on

Unsplash

In the thriving world of IoT, integrating MLOps for Edge AI is important for creating intelligent, autonomous devices that are not only efficient but also trustworthy and manageable.

MLOps—or Machine Learning Operations—is a multidisciplinary field that mixes machine learning, data engineering, and DevOps to streamline the lifecycle of AI models.

In this field, important factors to consider are:

  • explainability, ensuring that decisions made by AI are interpretable by humans;

  • orchestration, which involves managing the various components of machine learning in production–at scale; and

  • reproducibility, guaranteeing consistent results across different environments or experiments.