Victor Ferrer´s Java and SW blog: How to import CSV files into Parquet with Kite SDK

Summary

In order to perform some data studies, I need to import a CSV file, generated by Yahoo Finance and containing historical data from General Electric (GE) into Parquet files in HDFS so I can later run some Spark Jobs on it.

The data is a time-series and that should be partitioned per year and month (just for the sake of the example).

In order to have a quick solution (I do not want to write a Spark Job or Java application), I will use Kite SDK.

Steps to follow

Clean the data with a simple AWK script
Define the schema of the Parquet file (column types, etc.)
Define the partition strategy that is going to follow (for more optimal querying)
Execute the import

Clean the data

This is usually not necessary but the CSV file coming from Yahoo Finance needed a couple of tweaks before being imported:

Header should be removed
From the "date" field (yyyy-mm-dd) I needed two new columns with the year and month values (for data partitioning later on)
First field should be quoted in order to be treated as a string

This is the script, I will not comment on it for the sake of brevity:

	# Original CSV is generated as follows:

	# Date,Open,High,Low,Close,Adj Close,Volume
	# 1962-01-02,0.781250,0.794271,0.773438,0.778646,0.002326,2073600
	# 1962-01-03,0.774740,0.774740,0.768229,0.770833,0.002303,1420800

	# Ignore the header, extract the year and month from the first column,
	# create two additional columns at the end of the file and add quotes to the
	# first colum:

	# "1962-01-02",0.781250,0.794271,0.773438,0.778646,0.002326,2073600,1962,01
	# "1962-01-03",0.774740,0.774740,0.768229,0.770833,0.002303,1420800,1962,01

	# Command to be used:
	awk -F"," 'NR<2 {next} NR >=2 { OFS = "," } {$1="\""$1"\"";$8=substr($1,2,4);$9=substr($1,7,2);$10=substr($1,10,2); print}' \
	GE.csv > GE_out.csv

	# Parameters used:
	# F = "," -> "," is the field separator in the input file
	# NR>=2 -> First line is ignored
	# OFS="," -> Output file uses "," as delimiter

view raw script.sh hosted with ❤ by GitHub

Create the schema you want to use

Along with some metadata, we just need to indicate the column types and basic restrictions (like nullable fields). Just edit a file text like this one (called my_schema.txt):

	{
	"type" : "record",
	"name" : "ge_stock_data",
	"namespace" : "victor.ferrer.stock.data",
	"doc" : "GE",
	"fields" : [
	{ "name" : "date", "type" : "string" },
	{ "name" : "open", "type" : "float"},
	{ "name" : "high", "type" : "float" },
	{ "name" : "low", "type" : "float" },
	{ "name" : "close", "type" : "float"},
	{ "name" : "adj_close","type" : "float" },
	{ "name" : "volume", "type" : "double"},
	{ "name" : "year", "type": "int"},
	{ "name" : "month","type" : "int"}
	]
	}

view raw my_schema.txt hosted with ❤ by GitHub

You can also try to infer the schema from the CSV file (not my favorite choice, though).

Define the partitions

Partition columns will be translated into folders when writing the Parquet files. This will make any query over the tables very efficient, if those partition columns are present in the "where" clause, as only that sub-folders will be loaded from HDFS. Again, create a text file with the partitions (called partitions.txt):

	[ {
	"name" : "year_p",
	"source" : "year",
	"type" : "identity"
	}, {
	"name" : "month_p",
	"source" : "month",
	"type" : "identity"
	} ]

view raw partitions.txt hosted with ❤ by GitHub

One interesting point, is that the value for the partition columns can be derived from one of the "business" columns. For instance, if I have a time field (represented in a long value), it can be used to extract the year, month and day and use those values for partitioning the data (very common strategy for time series).

If you want, the library has a small tool where you can provide the partitions and it will generate the file for you:

	# Create the partitions from the "date" column
	kite-dataset partition-config date:year date:month date:day -s my_schema.txt \
	-o partitions.txt

view raw script.sh hosted with ❤ by GitHub

Execute the import

The first two lines of the script can be used for downloading the tool. If you have already installed it, just execute the rest of the steps:

A dataset is created in the given HDFS location (/victor/testimport) in this case using the provided schema and partition strategy.
Then the file is imported by giving the name of the file and the dataset path.

	# Installation
	curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.1.0/kite-tools-1.1.0-binary.jar \
	-o kite-dataset
	chmod +x kite-dataset

	# Create the dataset using the schema definition
	kite-dataset create dataset:hdfs:/victor/testimport --schema my_schema.txt \
	--format parquet -p partitions.txt

	# Import the CSV file called GE_out.csv
	kite-dataset csv-import GE_out.csv dataset:hdfs:/victor/testimport --use-hdfs

view raw script.sh hosted with ❤ by GitHub

Verify the import

Just load a Spark DataFrame and show some rows to check that the data was properly loaded:

	val df = sqlContext.read.parquet("/victor/testimport")
	.select($"date",$"close",$"volume",$"year_p")
	.filter("year_p = 1962")
	df.show()

	/*
	+----------+--------+---------+------+
	\| date\| close\| volume\|year_p\|
	+----------+--------+---------+------+
	\|1962-08-01\| 0.6875\|1372800.0\| 1962\|
	\|1962-08-02\|0.692708\|1132800.0\| 1962\|
	\|1962-08-03\|0.697917\|1478400.0\| 1962\|
	\| ... \|
	\|1962-08-24\|0.708333\|1968000.0\| 1962\|
	\|1962-08-27\|0.699219\|1209600.0\| 1962\|
	\|1962-08-28\|0.704427\|1574400.0\| 1962\|
	+----------+--------+---------+------+
	*/

view raw SparkJob.scala hosted with ❤ by GitHub

Resources

Code Notes on Github.io
Stokker
Portfolio Manager
Sparkker

Victor Ferrer´s Java and SW blog

Sunday, 5 August 2018

How to import CSV files into Parquet with Kite SDK

No comments:

Post a Comment