Processing Structured Files

.csv, .tsv are simple text files in which columns are separated by special separators (e.g., comma or tab), and the end of a row is marked by an end-of-line character. Such files may, but don’t have to, have a header describing what each column is. The header is always the first row in the file. Below is a sample table generated that we will use to perform transformations and calculations for our queries.

Category Product Quantity UnitPrice
Home Appliances Vacuum Cleaner 82 163.77
Books Science 5 47.56
Computer Accessories Webcam 1 113.5
Fashion T-shirt 74 181.97
Fashion Jeans 8 159.87
Fashion Hat 53 136.61
Home Appliances Refrigerator 22 130.62
Computer Accessories Speaker 98 103.57
Home Appliances Refrigerator 17 121.75
Books Biography 23 10.07
Computer Accessories Speaker 6 109.77
Fashion Jeans 21 195.95
Fashion Sneakers 69 34.7
Electronics Smartwatch 86 151.0
Electronics Laptop 7 189.76
Electronics Tablet 20 115.77
Computer Accessories Mouse 44 178.51
Home Appliances Blender 62 85.69
Home Appliances Vacuum Cleaner 48 14.64
Electronics Headphones 49 185.01
Electronics Laptop 15 101.91
Fashion Jeans 90 67.94
Computer Accessories Keyboard 74 12.86
Fashion Hat 34 120.21
Books History 1 98.95
Computer Accessories Speaker 32 135.16
Electronics Tablet 32 86.54
Electronics Laptop 42 54.91
Electronics Headphones 12 192.94
Electronics Smartwatch 40 83.4
Home Appliances Blender 65 50.17
Fashion Hat 28 82.16
Home Appliances Washing Machine 50 77.21
Electronics Smartphone 47 185.91
Books Biography 42 24.57
Fashion Jeans 13 73.85
Computer Accessories Speaker 34 69.99
Electronics Laptop 66 198.1
Books Science 18 188.15
Fashion Sneakers 60 131.19
Books Mystery 81 73.11
Fashion Jeans 51 30.24
Home Appliances Refrigerator 14 161.9
Computer Accessories Keyboard 50 121.49
Fashion Jacket 54 146.29
Computer Accessories Webcam 12 133.3
Computer Accessories Monitor 30 5.32
Home Appliances Blender 43 157.01
Books Mystery 59 58.53
Computer Accessories Speaker 8 156.03
Computer Accessories Mouse 17 132.75
Fashion Jeans 33 43.96
Electronics Laptop 43 132.83
Computer Accessories Webcam 96 68.64
Electronics Laptop 30 125.86
Fashion T-shirt 67 198.49
Books Science 9 135.57
Books Science 31 8.89
Fashion T-shirt 89 197.07
Electronics Smartwatch 17 107.64
Computer Accessories Monitor 9 171.51
Fashion Sneakers 79 168.12
Fashion T-shirt 65 78.55
Fashion Hat 22 73.59
Computer Accessories Webcam 31 164.67
Fashion Jeans 83 20.63
Books History 50 186.23
Electronics Camera 18 183.47
Computer Accessories Speaker 44 136.75
Books Novel 23 183.86
Home Appliances Refrigerator 26 155.14
Books Science 87 120.53
Electronics Smartphone 26 91.9
Computer Accessories Speaker 12 27.0
Fashion Hat 48 112.67
Fashion Hat 78 23.1
Books History 10 91.04
Electronics Laptop 78 168.49
Books Mystery 58 65.09
Home Appliances Microwave 78 115.89
Books Biography 1 155.56
Books History 18 174.27
Computer Accessories Mouse 23 130.46
Home Appliances Blender 19 70.96
Books Novel 92 118.97
Home Appliances Washing Machine 17 147.19
Books History 32 194.87
Home Appliances Vacuum Cleaner 42 140.49
Fashion Jacket 34 134.91
Electronics Smartphone 82 167.1
Home Appliances Vacuum Cleaner 96 191.72
Fashion Hat 78 58.36
Books Mystery 27 96.87
Electronics Camera 5 158.05
Books Novel 74 170.65
Fashion Jacket 13 199.69
Computer Accessories Mouse 66 93.49
Home Appliances Vacuum Cleaner 88 155.43
Home Appliances Microwave 10 71.59
Computer Accessories Mouse 3 26.24

Enriching the Table with Calculations

First, we will enrich our calculations by determining the total price UnitPrice * Quantity.

select 
	*, 
	ToInt32(Quantity) * ToDecimal(UnitPrice) as TotalPrice 
from #separatedvalues.csv('@qfs/category_product_data.csv', true, 0)

We begin by reading the file from the query space named category_product_data.csv which has a header true and we are supposed to start reading from row zero 0. Because our data source reads all columns as string types, we have to convert them to numbers. With each row read, a transformation will be performed on the relevant columns followed by multiplication.

Table after our transformations:

Category Product Quantity UnitPrice TotalPrice
Home Appliances Vacuum Cleaner 82 163.77 13429.14
Books Science 5 47.56 237.8
Computer Accessories Webcam 1 113.5 113.5
….
Computer Accessories Mouse 3 26.24 78.72

When the File Has No Header

If our file doesn’t have a header or the names are very irregular, we can simply skip the header row, indicating that the file doesn’t have a header. This will result in automatically assigned column names Column1, Column2, … ColumnN.

select 
	*, 
	ToInt32(Column3) * ToDecimal(Column4) as Column5 
from #separatedvalues.csv('@qfs/category_product_data.csv', false, 1)

Table after our transformations:

Column1 Column2 Column3 Column4 Column5
Home Appliances Vacuum Cleaner 82 163.77 13429.14
Books Science 5 47.56 237.8
Computer Accessories Webcam 1 113.5 113.5
….
Computer Accessories Mouse 3 26.24 78.72

Let’s Now Calculate the Total Price for Products in a Given Category and the Total Price for a Product in a Given Category

To get the desired effect in one query, we will use grouping, aggregate methods in the parent group (TotalPriceForCategory), and the current group’s aggregate method (TotalPriceForProduct).

select 
    Category, 
    Product, 
    Sum(ToInt32(Quantity) * ToDecimal(UnitPrice), 1) as TotalPriceForCategory, 
    Sum(ToInt32(Quantity) * ToDecimal(UnitPrice)) as TotalPriceForProduct 
from #separatedvalues.csv('@qfs/category_product_data.csv', true, 0) 
group by Category, Product

This query works by creating subgroups of categories, then creating further subgroups of products within each of these. The expression Sum(ToInt32(Quantity) * ToDecimal(UnitPrice)) accesses the innermost subgroup (products), while Sum(ToInt32(Quantity) * ToDecimal(UnitPrice), 1) accesses the parent subgroup, the subgroup of the given category, and performs its calculations there. The results of this query are the aggregated calculations:

Category Product TotalPriceForCategory TotalPriceForProduct
Home Appliances Vacuum Cleaner 96150.58 52115.4
Books Science 80287 15606.33

These calculations can be verified using an auxiliary Python script:

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

total_price_per_category = df.groupby('Category')['TotalPrice'].sum().reset_index()

total_price_per_product_in_category = df.groupby(['Category', 'Product'])['TotalPrice'].sum().reset_index()

total_price_per_category, total_price_per_product_in_category

Sentiment Analysis of Posts Using GPT

Suppose we received the following data:

PostId Comment Date
1 This product is amazing! Highly recommend. 2023-01-01
2 Absolutely terrible service, I’m very disappointed. 2023-01-02
3 The product is okay, nothing special. 2023-01-03
6 It’s a decent product, but I had higher expectations. 2023-01-06

We need to determine the sentiment with which the posts were written, and to this end, we can use a data source such as the GPT model from OpenAI. The query in a simple way will return whether a given post has a positive, negative, or neutral sentiment.

select 
    csv.PostId,
    csv.Comment,
    gpt.Sentiment(csv.Comment) as Sentiment,
    csv.Date
from #separatedvalues.csv('@qfs/comments_sample.csv', true, 0) csv inner join #openai.gpt('gpt-4-1106-preview') gpt on 1 = 1

In this query, we use an inner join because we want to use a method that belongs to the calculated gpt table. This table always returns a single row; we are not specifically interested in the row value but want it to be available for each message. When initializing, we use the exact model name which is supposed to respond to our sentiment query. The data source #openai.gpt has more interesting methods with which I encourage you to become familiar in the source documentation.