Dates can be represented in hundreds of different ways but SQLizer does a pretty good job of detecting them, and handling them properly.
Here are a few examples. Consider the following CSV file:
ID, Date Column
1, 01/01/2017
2, 02/01/2017
3, 12/01/2017
It’s clear to any human reading this that the second column should be a date column. But CSV data files don’t specify the types of data being used in each column, whereas databases do. So SQLizer needs to detect the columns that should be dates and sets up the right column type, which it does:
CREATE TABLE IF NOT EXISTS example_dates (
`ID` INT,
`Date_Column` DATETIME
);
INSERT INTO example_dates VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-01-02 00:00:00'),
(3,'2017-01-12 00:00:00');
IF NOT EXISTS (
select * from sysobjects where name='example_dates' and xtype='U'
) CREATE TABLE example_dates (
[ID] INT,
[Date_Column] DATETIME
);
INSERT INTO example_dates VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-01-02 00:00:00'),
(3,'2017-01-12 00:00:00');
CREATE TABLE IF NOT EXISTS "example_dates" (
"ID" INT,
"Date_Column" TIMESTAMP
);
INSERT INTO "example_dates" VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-01-02 00:00:00'),
(3,'2017-01-12 00:00:00');
CREATE TABLE IF NOT EXISTS example_dates (
ID INTEGER,
Date_Column TEXT
);
INSERT INTO example_dates VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-01-02 00:00:00'),
(3,'2017-01-12 00:00:00');
Notice how the dates in the original file could have been either US dates (mm/dd/yyyy) or the more internationally used dd/mm/yyyy format. But SQLizer assumed it was dd/mm/yyyy.
Now consider this CSV file:
ID, Date Column
1, 01/01/2017
2, 02/01/2017
3, 12/01/2017
4, 12/13/2017
We’ve added a fourth row, and when you look at that row, it’s clear that the dates are in US format: mm/dd/yyyy. If we feed that through to SQLizer, it works out that all the dates in that column should be US dates.
CREATE TABLE IF NOT EXISTS example_dates2 (
`ID` INT,
`Date_Column` DATETIME
);
INSERT INTO example_dates2 VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-02-01 00:00:00'),
(3,'2017-12-01 00:00:00'),
(4,'2017-12-13 00:00:00');
IF NOT EXISTS (
select * from sysobjects where name='example_dates2' and xtype='U'
) CREATE TABLE example_dates2 (
[ID] INT,
[Date_Column] DATETIME
);
INSERT INTO example_dates2 VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-02-01 00:00:00'),
(3,'2017-12-01 00:00:00'),
(4,'2017-12-13 00:00:00');
CREATE TABLE IF NOT EXISTS "example_dates2" (
"ID" INT,
"Date_Column" TIMESTAMP
);
INSERT INTO "example_dates2" VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-02-01 00:00:00'),
(3,'2017-12-01 00:00:00'),
(4,'2017-12-13 00:00:00');
CREATE TABLE IF NOT EXISTS example_dates2 (
ID INTEGER,
Date_Column TEXT
);
INSERT INTO example_dates2 VALUES
(1,'2017-01-01 00:00:00'),
(2,'2017-02-01 00:00:00'),
(3,'2017-12-01 00:00:00'),
(4,'2017-12-13 00:00:00');
A similar thing is happening for different time formats. Consider the following CSV file
ID, Date Column
1, 01/01/2017 11:30
2, 02/01/2017 10:20
3, 12/01/2017 12:12
SQLizer assumes the dates are in 24 hour format, so it’s 11 AM and 12 PM etc.
CREATE TABLE IF NOT EXISTS example_dates3 (
`ID` INT,
`Date_Column` DATETIME
);
INSERT INTO example_dates3 VALUES
(1,'2017-01-01 11:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
IF NOT EXISTS (
select * from sysobjects where name='example_dates3' and xtype='U'
) CREATE TABLE example_dates3 (
[ID] INT,
[Date_Column] DATETIME
);
INSERT INTO example_dates3 VALUES
(1,'2017-01-01 11:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
CREATE TABLE IF NOT EXISTS "example_dates3" (
"ID" INT,
"Date_Column" TIMESTAMP
);
INSERT INTO "example_dates3" VALUES
(1,'2017-01-01 11:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
CREATE TABLE IF NOT EXISTS example_dates3 (
"ID" INT,
"Date_Column" TIMESTAMP
);
INSERT INTO example_dates3 VALUES
(1,'2017-01-01 11:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
But if we add AM and PM specifiers to those dates...
ID, Date Column
1, 01/01/2017 11:30 PM
2, 02/01/2017 10:20 AM
3, 12/01/2017 12:12 PM
SQLizer picks those up and uses them:
CREATE TABLE IF NOT EXISTS example_dates4 (
`ID` INT,
`Date_Column` DATETIME
);
INSERT INTO example_dates4 VALUES
(1,'2017-01-01 23:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
IF NOT EXISTS (
select * from sysobjects where name='example_dates4' and xtype='U'
) CREATE TABLE example_dates4 (
[ID] INT,
[Date_Column] DATETIME
);
INSERT INTO example_dates4 VALUES
(1,'2017-01-01 23:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
CREATE TABLE IF NOT EXISTS "example_dates4" (
"ID" INT,
"Date_Column" TIMESTAMP
);
INSERT INTO "example_dates4" VALUES
(1,'2017-01-01 23:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
CREATE TABLE IF NOT EXISTS example_dates4 (
ID INTEGER,
Date_Column TEXT
);
INSERT INTO example_dates4 VALUES
(1,'2017-01-01 23:30:00'),
(2,'2017-01-02 10:20:00'),
(3,'2017-01-12 12:12:00');
Finally, having interpreted the second column as dates, if SQLizer spots some invalid data, like this (The 32nd of Undecember):
ID, Date Column
1, 01/01/2017
2, 02/01/2017
3, 12/01/2017
4, 32/13/2017
It will simply assume that they weren’t supposed to be dates after all, and make the second column a text column:
CREATE TABLE IF NOT EXISTS example_dates5 (
`ID` INT,
`Date_Column` VARCHAR(10) CHARACTER SET utf8
);
INSERT INTO example_dates5 VALUES
(1,'01/01/2017'),
(2,'02/01/2017'),
(3,'12/01/2017'),
(4,'32/13/2017');
IF NOT EXISTS (
select * from sysobjects where name='example_dates5' and xtype='U'
) CREATE TABLE example_dates5 (
[ID] INT,
[Date_Column] NVARCHAR(10)
);
INSERT INTO example_dates5 VALUES
(1,N'01/01/2017'),
(2,N'02/01/2017'),
(3,N'12/01/2017'),
(4,N'32/13/2017');
CREATE TABLE IF NOT EXISTS "example_dates5" (
"ID" INT,
"Date_Column" TEXT
);
INSERT INTO "example_dates5" VALUES
(1,'01/01/2017'),
(2,'02/01/2017'),
(3,'12/01/2017'),
(4,'32/13/2017');
CREATE TABLE IF NOT EXISTS example_dates5 (
ID INTEGER,
Date_Column TEXT
);
INSERT INTO example_dates5 VALUES
(1,' 01/01/2017'),
(2,' 02/01/2017'),
(3,' 12/01/2017'),
(4,' 32/13/2017');
Doing this on a four line CSV file isn’t too hard. But SQLizer sometimes has to detect the correct date format on files containing billions of lines of data. We never take shortcuts on this, we don’t base type decisions on a sample of the data, or the first N rows. We check the whole file before deciding what types to use for each column.