csv - Weird data format parsing - Stack Overflow

IT技术

更新时间：2025-04-250

admin管理员组
文章数量:1431509

I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:

# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn

Output I like to achieve:

Group  |col1         |col2     |k |k2 |k5  |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2|  v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn

While Group could also be the filename of the csv (even better now that I write it)

The k=v part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).

What I want to achieve: a SELECT * FROM read_csv('*.txt') achieving the desired output.

But reading with read_csv already fails.

I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:

# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn

Output I like to achieve:

Group  |col1         |col2     |k |k2 |k5  |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2|  v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn

While Group could also be the filename of the csv (even better now that I write it)

The k=v part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).

What I want to achieve: a SELECT * FROM read_csv('*.txt') achieving the desired output.

But reading with read_csv already fails.

Share Improve this question edited Nov 21, 2024 at 6:49 peak 117k21 gold badges180 silver badges212 bronze badges asked Nov 19, 2024 at 9:03 Sascha 10.3k4 gold badges44 silver badges68 bronze badges

3 Please add a meaningful example, ie what exact result would you expect from above input? – derpirscher Commented Nov 19, 2024 at 9:07
2 Can you describe the constraints that force you to use 'duckdb-only'? Not really sure what that means (as in, why can't you transform the data before importing, what kind of options in duckdb do you expect to find in order to parse this) – julaine Commented Nov 19, 2024 at 9:14
Example added and some reasoning – Sascha Commented Nov 19, 2024 at 14:22
1 I was just about to post an answer but I cant now as it got closed. It's too large for a comment, but you can read with from read_csv('foo.csv', delim=';', header=false, null_padding=true, filename=true) and you can UNPIVOT, transform the k=v column and then PIVOT back to a wide format. Maybe you can ask on the DuckDB Discord. – jqurious Commented Nov 19, 2024 at 14:53

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

You should be able to use read_csv with additional Parameters.

filename=true to include the filename
null_padding=true to pad missing columns with nulls.

import duckdb

with open("79202754.txt", "w") as f:
    f.write("""\
# comment
unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
""")

duckdb.sql("""
pivot (
   from (
      unpivot (
         from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
         select 
            *,
            row_number() over () as row_nr
      )
      on * exclude (row_nr, column0, column1, filename)
   )
   select 
      *, 
      split(value, '=')[1] as k, 
      split(value, '=')[2] as v
)
on k
using first(v)
group by (row_nr, column0, column1, filename)
""")

┌────────┬───────────────┬────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ row_nr │    column0    │  column1   │   filename   │   ...   │    k    │   k2    │   k5    │   k6    │   k7    │   kn    │
│ int64  │    varchar    │  varchar   │   varchar    │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────┼───────────────┼────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│      3 │ unquoted text │ timestamp2 │ 79202754.txt │ NULL    │ v1      │ v12     │ v5row2  │ NULL    │ NULL    │ vn      │
│      2 │ unquoted text │ timestamp1 │ 79202754.txt │ NULL    │ v       │ v2      │ v5      │ v6      │ v7      │ NULL    │
└────────┴───────────────┴────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

UNPIVOT

The first step is to unpivot so you get a single column of k=v values which you then split into 2 columns.

duckdb.sql("""
unpivot (
   from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
   select 
      *,
      row_number() over () as row_nr
   )
on * exclude (row_nr, column0, column1, filename)
""")

┌───────────────┬────────────┬──────────────┬────────┬─────────┬───────────┐
│    column0    │  column1   │   filename   │ row_nr │  name   │   value   │
│    varchar    │  varchar   │   varchar    │ int64  │ varchar │  varchar  │
├───────────────┼────────────┼──────────────┼────────┼─────────┼───────────┤
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column2 │ k=v       │
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column3 │ k2=v2     │
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column4 │ ...       │
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column5 │ k5=v5     │
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column6 │ k6=v6     │
│ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column7 │ k7=v7     │
│ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column2 │ k=v1      │
│ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column3 │ k2=v12    │
│ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column4 │ ...       │
│ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column5 │ kn=vn     │
│ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column6 │ k5=v5row2 │
├───────────────┴────────────┴──────────────┴────────┴─────────┴───────────┤
│ 11 rows                                                        6 columns │
└──────────────────────────────────────────────────────────────────────────┘

You then PIVOT back into a wide format.

You basically have two choices: ingest the data using read_csv() and do the post-processing in DuckDB, or pre-process the data so it can easily be read by DuckDB.

The former would be quite tricky, whereas the latter is very easy using a tool such as jq. In the following, I'll use jq as the preprocessing tool.

One advantage of the following program is that it makes relatively weak assumptions. In particular, it will produce correct results even if the keys are not in the same order on each line.

The program of course does make some assumptions, notably that the first line of data mentions all the keys of interest.

If this assumption cannot be made, then the following program could still be used with small modifications to allow the full roster of keys to be determined.

# Input: an array of key=value strings
# Output: an object
def toObject:
  reduce (.[] | split("=")) as $kv ({};
     .[$kv[0]] = $kv[1]);

# Input: an array of key=value strings
# Output: an array
def keyNames:
  split(";")
  | reduce (.[2:][] | split("=")) as $kv ([];
     . + [$kv[0]] );

def header($group; $keyNames):
  ["Group", "col1", "col2"] + $keyNames;
  
input | sub("^# *"; "") as $group
| input as $line
| ($line | keyNames) as $keyNames
| header($group; $keyNames) ,
 ( ($line, inputs)
   | split(";") as $columns
   | [$group] + $columns[0:2] + [ $columns[2:] | toObject | .[$keyNames[]]]
   )
| @tsv

Using the sample input and an invocation of the form: jq -nrR ... the output would be TSV records as follows:

Group   col1    col2    k   k2  ....    k5
comment unquoted text   timestamp1  v   v2      v5
comment unquoted text   timestamp2  v1  v12

Just to add on how you can parse #comment and convert it to group_col for each group:

import duckdb
from io import StringIO

data = """
# comment
unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
"""

rel = duckdb.read_csv(StringIO(data), delimiter=';', header=False, null_padding=True)

duckdb.sql("""
    with c1 as (
        from rel
        select
            *,
            last_value(
                case when column1 is null then column0 end ignore nulls
            ) over() as group_col
    ), c2 as (
        unpivot (
            from c1
            select *, row_number() over() as rn
            where column1 is not null
        )
        on * exclude (rn, group_col, column0, column1)
    )
    pivot (
        from c2
        select
            rn, group_col, column0, column1,
            split(value, '=')[1] as k, 
            split(value, '=')[2] as v
    )
    on k
    using any_value(v)
""")

┌───────┬───────────┬───────────────┬────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│  rn   │ group_col │    column0    │  column1   │   ...   │    k    │   k2    │   k5    │   k6    │   k7    │   kn    │
│ int64 │  varchar  │    varchar    │  varchar   │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────┼───────────┼───────────────┼────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│     1 │ # comment │ unquoted text │ timestamp1 │ NULL    │ v       │ v2      │ v5      │ v6      │ v7      │ NULL    │
│     2 │ # comment │ unquoted text │ timestamp2 │ NULL    │ v1      │ v12     │ v5row2  │ NULL    │ NULL    │ vn      │
└───────┴───────────┴───────────────┴────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

本文标签： csvWeird data format parsingStack Overflow

版权声明：本文标题：csv - Weird data format parsing - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1745571414a2664070.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

csv - Weird data format parsing - Stack Overflow

3 Answers 3

UNPIVOT

更多相关文章