Foreign Tables with CSV in action
FOREIGN TABLES is a rather fresh feature in IRIS (2023.?)
So I was motivated to try something new by own hands.
Documentation of Foreign Table from File is a good starting point.
Also the related promotional video is fine to start with.
The advantages of the approach are pretty obvious to me
- differently to SQL LOAD DATA data remain outside IRIS
- you pick out what is useful to you and leave the remaining junk aside
- this seems to be especially attractive for UPDATE and ADD-ON data
- when you are done you have no pollution of your IRIS instance
On my way to create the example I met some facts that I want to share
- The requirement of a JAVA language server is only mentioned in LOAD DATA
- You fail at first access if it is missing. Docker Containers have it.
- My personal environment doesn't need it.
- Next surprise: What is this FOREIGN SERVER for?
- Just a placeholder to point to your directory of your CSV-file
- Creation of the TABLE is pretty straight forward and very flexible
- Skipping USING clause presented my headers as first ROW of the table
- And as it's no real table and had no ROW numbers I missed my numeric ID's
- No surprise: It is READ ONLY, No Updates
- Also: Finding the right mix of single and double quotes is some challenge
- Similar: Data presentation in CSV can be tricky
- Most simple approach: - thinking of date formats -
- Declare critical columns as VARCHAR
- Then handle useful transformations or content check locally
- Finally:
- the Table is visible for SQL but there is no related Class to it
- you have no chance to play with CALCULATED properties
Summary
- A great and highly useful feature
- You have to understand the limits
- Adjust your expectations to the available options
- Test also the "impossible" combinations - they will pop up
Enjoy the demo package
GitHub
Comments
thanks for sharing.
I'm testing the feature.
Generally, when we access a datalake, the source csv files are organized in the same folder with their respective timestamp in the name, example:
transactions20240126.csv
transactions20240127.csv
etc...
Is there any wildcard to load multiple files?
To loop/list all files within a directory, possibly with wildcard filter, you can use the FileSet class query in %Library.File class, see documentation on Query Directories and Drives, List the Contents of a Directory.