Skip to main content

Command Palette

Search for a command to run...

Build Serverless CSV Cleaning Pipeline with Azure Functions

Updated
6 min read
Build Serverless CSV Cleaning Pipeline with Azure Functions
P
Cloud Platform Engineer with 5+ years of experience in cloud automation, CI/CD, and infrastructure optimisation, I specialise in streamlining deployments, enhancing system resilience, and reducing operational costs.

In this project, I will be creating a serverless data pipeline on Azure. When a CSV file is uploaded to Azure Storage, an Azure Function automatically triggers, cleans the data by removing extra spaces, empty rows, or invalid values, and then saves the cleaned CSV back to Storage. All secrets like storage keys are kept safe in Azure Key Vault, so nothing sensitive is stored in the code. By the end of this article, you’ll have a fully working, automated pipeline that transforms messy CSV files into clean, usable data without writing a single manual script.

Full source code available on GitHub: https://github.com/palakbhawsar98/csv-processor

Prerequisite:

  • Azure account and basic knowledge of working with Azure services

  • VS Code installed with the Azure Functions extension

  • Python 3.11 installed

  • Azure Functions Core Tools v4 installed

  • Basic understanding of CSV files and Python

Step 1: Create Resource Group

Create a Resource Group in Azure to hold all the resources for this project. This keeps everything organized and easy to manage.

Step 2: Create Storage Account

Create a Storage Account with two containers: one for raw CSV uploads and one for cleaned files. We use two containers because if the function writes the cleaned file back to the same container that triggers it, it would trigger again and create an infinite loop. Separating raw and processed files prevents this and keeps the pipeline stable.

Step 3: Create App Insights

Create Application Insights to monitor the function. This lets us track each run, see row counts, and catch any errors during processing. With this, you can easily see if the pipeline is working and debug if something goes wrong.

Step 4: Create App Function

Create an Azure Function App to process the CSV files. This function will trigger automatically whenever a new file is uploaded to the raw CSV container, clean the data, and save the processed file. Select Flex Consumption plan, which ensures the function only runs when triggered and keeps costs very low.

For storage, I linked the same Storage Account that we created earlier. Azure Function Apps require a Storage Account to store the function code, manage triggers, and keep internal logs.

In the Monitoring tab of the Function App, enable Application Insights. Select the existing instance, this ensures all function runs, logs, and errors are tracked for monitoring and debugging.

Enable Managed Identity so the function can securely access Key Vault to read secrets like storage keys, no secrets should be stored in the code.

Step 5: Create Azure Key Vault

Create a Key Vault to keep secrets like the Storage Account key safe.

In the Access configuration tab, select Vault access policy. In the Access policy section, add a policy and give Get and List permissions for secrets. Under Principal, select the Function App. This lets the Function App access secrets safely without storing them in the code.

Step 6: Add Secret to Key Vault

Add the Storage connection string as a secret in Key Vault. Go to stcsvprocessor01 and open Access keys, then copy the Connection string (key1). After that, go to key vault and open the Secrets section. Add a new secret with the name stcsvprocessor01-connection-string and paste the copied connection string as the value. This stores the Storage credential securely in Key Vault so the Function App can access it when processing files.

Step 7: Add Key Vault URL to Function App

Go to app function and open the Environment variables section. Add a new variable with the name KEY_VAULT_URL. The value should be the Key Vault URL, which can be found in the Overview section of the Key Vault. This allows the Function App to know which Key Vault to use when retrieving secrets.

Step 8: Deploy the Python Function

Before deploying, let us understand the key files in your project:

function_app.py : This is the main Python code. It contains the function that triggers when a file is uploaded, cleans the data by removing blank rows, skipping rows with negative values, and stripping extra whitespace from every cell, then saves the cleaned file to the processed-uploads container with a timestamp in the filename.

requirements.txt : Lists all the Python libraries function needs to run in Azure.

local.settings.json : This file stores local development settings like the storage connection string. It never gets deployed to Azure and is automatically excluded from your code by .gitignore so your credentials stay safe on your machine.

host.json : Azure Functions configuration file that controls runtime behaviour like logging and timeouts. You do not need to change this.

To deploy from VS Code:

  1. Press Ctrl + Shift + P

  2. Type Azure Functions: Deploy to Function App

  3. Select your subscription

  4. Select func-csv-processor

  5. Click Deploy when the confirmation dialog appears

VS Code will package your code and deploy it to Azure. Once complete your function will appear in the portal under Functions in your Function App.

Step 9: Set Up Event Grid Trigger

Azure requires Event Grid to trigger the function instead of a standard blob trigger. Event Grid watches your Storage Account and sends a notification to your Function App the moment a new file is uploaded. Go to stcsvprocessor01 under Events, select Event Subscription and fill in the below details.

Then click the Filters tab and configure the below. By setting the subject filter to raw-uploads we are telling Event Grid to only notify our function when a file lands in that specific container and ignore everything else.

Step 10: Test the Pipeline

Now that everything is set up let us test it end to end to make sure the pipeline is working correctly. Create a test CSV file, deliberately add blank row and 1 row with a negative salary value. Upload the file in container raw-uploads, than check processed-uploads folder for cleaned CSV file.

Before:

After:

Check the function invocations: You can track all of this by going to your Function App → Functions → CsvCleanProcessor → Invocations which shows a full history of every run with the date, status, and how long it took. The errors I faced ranged from the wrong Python version being used, the secret name in my code not matching what I created in Key Vault, and Azure Flex Consumption requiring Event Grid instead of a standard blob trigger. Do not be discouraged if you see errors here, click on any failed invocation and you can read the full error message which makes it much easier to understand exactly what went wrong and fix it.

Thank you for taking time to read my article. If I have overlooked any steps or missed any details, please don't hesitate to get in touch.

Feel free to reach out to me anytime Contact me

~ Palak Bhawsar ✨