admin管理员组文章数量:1431509
I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:
# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn
Output I like to achieve:
Group |col1 |col2 |k |k2 |k5 |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2| v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn
While Group
could also be the filename of the csv
(even better now that I write it)
The k=v
part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).
What I want to achieve: a SELECT * FROM read_csv('*.txt')
achieving the desired output.
But reading with read_csv already fails.
I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:
# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn
Output I like to achieve:
Group |col1 |col2 |k |k2 |k5 |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2| v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn
While Group
could also be the filename of the csv
(even better now that I write it)
The k=v
part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).
What I want to achieve: a SELECT * FROM read_csv('*.txt')
achieving the desired output.
But reading with read_csv already fails.
Share Improve this question edited Nov 21, 2024 at 6:49 peak 117k21 gold badges180 silver badges212 bronze badges asked Nov 19, 2024 at 9:03 SaschaSascha 10.3k4 gold badges44 silver badges68 bronze badges 4 |3 Answers
Reset to default 1You should be able to use read_csv
with additional Parameters.
filename=true
to include the filenamenull_padding=true
to pad missing columns with nulls.
import duckdb
with open("79202754.txt", "w") as f:
f.write("""\
# comment
unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
""")
duckdb.sql("""
pivot (
from (
unpivot (
from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
select
*,
row_number() over () as row_nr
)
on * exclude (row_nr, column0, column1, filename)
)
select
*,
split(value, '=')[1] as k,
split(value, '=')[2] as v
)
on k
using first(v)
group by (row_nr, column0, column1, filename)
""")
┌────────┬───────────────┬────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ row_nr │ column0 │ column1 │ filename │ ... │ k │ k2 │ k5 │ k6 │ k7 │ kn │
│ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────┼───────────────┼────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 3 │ unquoted text │ timestamp2 │ 79202754.txt │ NULL │ v1 │ v12 │ v5row2 │ NULL │ NULL │ vn │
│ 2 │ unquoted text │ timestamp1 │ 79202754.txt │ NULL │ v │ v2 │ v5 │ v6 │ v7 │ NULL │
└────────┴───────────────┴────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
UNPIVOT
The first step is to unpivot so you get a single column of k=v values which you then split into 2 columns.
duckdb.sql("""
unpivot (
from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
select
*,
row_number() over () as row_nr
)
on * exclude (row_nr, column0, column1, filename)
""")
┌───────────────┬────────────┬──────────────┬────────┬─────────┬───────────┐
│ column0 │ column1 │ filename │ row_nr │ name │ value │
│ varchar │ varchar │ varchar │ int64 │ varchar │ varchar │
├───────────────┼────────────┼──────────────┼────────┼─────────┼───────────┤
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column2 │ k=v │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column3 │ k2=v2 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column4 │ ... │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column5 │ k5=v5 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column6 │ k6=v6 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column7 │ k7=v7 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column2 │ k=v1 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column3 │ k2=v12 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column4 │ ... │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column5 │ kn=vn │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column6 │ k5=v5row2 │
├───────────────┴────────────┴──────────────┴────────┴─────────┴───────────┤
│ 11 rows 6 columns │
└──────────────────────────────────────────────────────────────────────────┘
You then PIVOT back into a wide format.
You basically have two choices: ingest the data using read_csv() and do the post-processing in DuckDB, or pre-process the data so it can easily be read by DuckDB.
The former would be quite tricky, whereas the latter is very easy using a tool such as jq. In the following, I'll use jq as the preprocessing tool.
One advantage of the following program is that it makes relatively weak assumptions. In particular, it will produce correct results even if the keys are not in the same order on each line.
The program of course does make some assumptions, notably that the first line of data mentions all the keys of interest.
If this assumption cannot be made, then the following program could still be used with small modifications to allow the full roster of keys to be determined.
# Input: an array of key=value strings
# Output: an object
def toObject:
reduce (.[] | split("=")) as $kv ({};
.[$kv[0]] = $kv[1]);
# Input: an array of key=value strings
# Output: an array
def keyNames:
split(";")
| reduce (.[2:][] | split("=")) as $kv ([];
. + [$kv[0]] );
def header($group; $keyNames):
["Group", "col1", "col2"] + $keyNames;
input | sub("^# *"; "") as $group
| input as $line
| ($line | keyNames) as $keyNames
| header($group; $keyNames) ,
( ($line, inputs)
| split(";") as $columns
| [$group] + $columns[0:2] + [ $columns[2:] | toObject | .[$keyNames[]]]
)
| @tsv
Using the sample input and an invocation of the form: jq -nrR ... the output would be TSV records as follows:
Group col1 col2 k k2 .... k5
comment unquoted text timestamp1 v v2 v5
comment unquoted text timestamp2 v1 v12
Just to add on how you can parse #comment
and convert it to group_col
for each group:
import duckdb
from io import StringIO
data = """
# comment
unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
"""
rel = duckdb.read_csv(StringIO(data), delimiter=';', header=False, null_padding=True)
duckdb.sql("""
with c1 as (
from rel
select
*,
last_value(
case when column1 is null then column0 end ignore nulls
) over() as group_col
), c2 as (
unpivot (
from c1
select *, row_number() over() as rn
where column1 is not null
)
on * exclude (rn, group_col, column0, column1)
)
pivot (
from c2
select
rn, group_col, column0, column1,
split(value, '=')[1] as k,
split(value, '=')[2] as v
)
on k
using any_value(v)
""")
┌───────┬───────────┬───────────────┬────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ rn │ group_col │ column0 │ column1 │ ... │ k │ k2 │ k5 │ k6 │ k7 │ kn │
│ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────┼───────────┼───────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ # comment │ unquoted text │ timestamp1 │ NULL │ v │ v2 │ v5 │ v6 │ v7 │ NULL │
│ 2 │ # comment │ unquoted text │ timestamp2 │ NULL │ v1 │ v12 │ v5row2 │ NULL │ NULL │ vn │
└───────┴───────────┴───────────────┴────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
本文标签: csvWeird data format parsingStack Overflow
版权声明:本文标题:csv - Weird data format parsing - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745571414a2664070.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
from read_csv('foo.csv', delim=';', header=false, null_padding=true, filename=true)
and you can UNPIVOT, transform the k=v column and then PIVOT back to a wide format. Maybe you can ask on the DuckDB Discord. – jqurious Commented Nov 19, 2024 at 14:53