Sunday, 5 August 2018

How to import CSV files into Parquet with Kite SDK

Summary 

In order to perform some data studies, I need to import a CSV file, generated by Yahoo Finance and containing historical data from General Electric (GE) into Parquet files in HDFS so I can later run some Spark Jobs on it. 

The data is a time-series and that should be partitioned per year and month (just for the sake of the example).

In order to have a quick solution (I do not want to write a Spark Job or Java application), I will use Kite SDK.

Steps to follow
  • Clean the data with a simple AWK script
  • Define the schema of the Parquet file (column types, etc.)
  • Define the partition strategy that is going to follow (for more optimal querying)
  • Execute the import

Clean the data

This is usually not necessary but the CSV file coming from Yahoo Finance needed a couple of tweaks before being imported:

  • Header should be removed
  • From the "date" field (yyyy-mm-dd) I needed two new columns with the year and month values (for data partitioning later on)
  • First field should be quoted in order to be treated as a string
This is the script, I will not comment on it for the sake of brevity:


Create the schema you want to use

Along with some metadata, we just need to indicate the column types and basic restrictions (like nullable fields). Just edit a file text like this one (called my_schema.txt):


You can also try to infer the schema from the CSV file (not my favorite choice, though).

Define the partitions

Partition columns will be translated into folders when writing the Parquet files. This will make any query over the tables very efficient, if those partition columns are present in the "where" clause, as only that sub-folders will be loaded from HDFS. Again, create a text file with the partitions (called partitions.txt):


One interesting point, is that the value for the partition columns can be derived from one of the "business" columns. For instance, if I have a time field (represented in a long value), it can be used to extract the year, month and day and use those values for partitioning the data (very common strategy for time series).

If you want, the library has a small tool where you can provide the partitions and it will generate the file for you:


Execute the import

The first two lines of the script can be used for downloading the tool. If you have already installed it, just execute the rest of the steps:

  • A dataset is created in the given HDFS location (/victor/testimport) in this case using the provided schema and partition strategy.
  • Then the file is imported by giving the name of the file and the dataset path.


Verify the import

Just load a Spark DataFrame and show some rows to check that the data was properly loaded:


No comments:

Post a Comment