Azure Data Explorer (ADX) is a fully managed data analytics service for real-time analysis on large volumes of data streaming from applications, websites and IoT devices.
The primary use of ADX is the ingestion of structured, semi-structured and unstructured data for big data analytics, with speeds of up to 200 Megabytes/sec per node (up to 1000 nodes) returning results in less than a second across billions of records.
More businesses are opening their network to a wide variety of IoT devices and applications, it becomes increasingly vital for network and security teams to proactively react to these threatening events in a timely and cost effective manner.
I recently employed ADX with a government client to migrate an existing Kafka workload which ingests and transforms Fortinet, Paloalto, and Bluecoat web security logs. During Covid-19, their workload increased 10-fold, with an associated 5-fold increase in costs. The migration of this workload resulted in a 60% cost reduction, a simplified solution and an improvement in data reliability.
How can data be ingested into Azure Data Explorer?
Automated Pipelines — Ingestion Methods
Event Grid Blob Created — When a ‘blob’ is created on the Azure storage account it results in the firing of an event that triggers Data Explorer ingestion pipeline.
Event Hub
IoT Hub
Azure Data Factory
Light Ingest — Command line tool for historical loads to minimise cost.
Supported Formats
Uncompressed Formats — ApacheAvro, AvroCSV, JSON, MultiJSON, ORC, Parquet, PSV, RAW, SCsv, SOHsv, TSV, TSVE, TXT, W3CLOGFILE
When the source data has a schema provided e.g. avro, parquet, w3clogfile it can be directly inserted into the final destination table with the expected data types, column names etc.
Compressed Formats — GZip, Zip
Transformations
Data is transformed in ADX by using the native language KQL — Kusto Query Language. This is a simple, yet powerful language to query structured, semi-structured and unstructured data. It assumes a relational data model of tables and columns, with a minimal set of data types. The language is very expressive, easy to read and understand the query intent.
Ingesting Fortinet Logs from Azure Storage to ADX
In this article I will demonstrate how to create an Ingestion Pipeline to ingest and transform Fortinet Web Security log files uploaded hourly to an Azure Storage Account and which accumulate to a daily total of 400GB (when uncompressed).
The file is a compressed .gz file split into three different formats:
Space delimited values
Pipe delimited values
Pipe delimited Key-Value Pairs
Solution
The solution used follows the high-level steps below:
Fortinet Log Files are uploaded/created on Azure Storage(ADLS Gen2) This action in turn triggers the ingestion process using an Event Grid-created subscriber.
The file is ingested into an ADX staging table.
An ADX user-defined Update Policy reads the newly uploaded data in the staging table and transforms the data into the destination table as required.
Ingestion Pipeline
Prerequisites
Install Kusto explorer and connect to the ADX cluster. Alternatively, the Web UI can be utilised.
Microsoft recommends each file must be 1GB uncompressed for optimal ingestion and no larger than 4GB.
Register Event Grid with the Azure Subscription.
To create the ingestion pipeline the following steps must be completed
Create a container on Azure Storage — ADLS Gen2.
Create an ADX Staging Table.
Set a Retention Policy on the ADX Staging table.
Create an ADX Query Function to read and transform the data landing in the staging table.
Create an ADX Destination Table for the curated data.
Create ADX Update policy. The Update Policy instructs ADX to automatically append data to the target table whenever new data is inserted into the staging table, based on the transformation function created in step 3.
Create an Event Grid Ingestion Method. The chosen ingestion method is ingesting data into data explorer via Event Grid from ADLS.
Test :)
Steps
1. Create a container on Azure Storage — ADLS Gen2.
2. Create an ADX staging table with one column of data type string.
3. Set a Retention Policy on the ADX Staging table to only keep 14 days of data.
4. Create an ADX Function.
The function reads and transforms the data from the staging table to the desired output. Only a subset of source columns are required in the output.
5. Create an ADX Destination Table for the curated data. The ingestion function can be used to create the schema for the destination table using the following script:
NOTE: Ensure the DateTime and numeric columns are typed correctly as ADX stores metadata and statistics for each column. ADX will also store the maximum and minimum values of the extent of the data. This will ensure that when the user requests the data from the store, with certain conditions, it will be compared and only relevant extents are scanned and returned as results.
6. Create ADX Update Policy The Update Policy instructs ADX to automatically append data to the target table whenever new data is inserted into the staging table, based on the transformation function created above.
7. Create an Event Grid Ingestion Method.
The chosen ingestion method is ingesting data into data explorer via Event Grid from ADLS.
- Log in to the Azure Portal.
- Navigate to the ADX Cluster ➜ Databases (Select appropriate database) ➜ Data connections.
- Add Data Connection — see below.
- Click ‘Next: Review + create >’ to the next tab Ingest Properties. NOTE: Txt files do not have mappings. Mappings are only used for CSV, JSON, AVRO, and W3CLOGFILE files.
8. Test :)
Upload a file to the Azure storage container. If the ingestion has failed run the query below to check why.
Conclusion
If you would like a copy of my code, please drop me a message on LinkedIn.
I hope you have found this helpful and will save your company money and time getting started with Azure Data Explorer.
Please share your thoughts, questions, corrections and suggestions. All feedback and comments are very welcome.
Comments