admin管理员组文章数量:1433161
In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.
df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"))
df
overlap
1 1,2,3
2 1,2,3
3 1,2,3,4
4 3,4
5 5,6
6 5,6,7
7 6,7
8 8,9
9 8,9,10
10 9,10
11 11,12,13
12 11,12,13
13 11,12,13,14
14 13,14
15 15,16
16 15,16,17
17 16,17
18 18,19
19 18,19,20
20 19,20
I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:
1 1,2,3
5 5,6
8 8,9
11 11,12,13
15 15,16
18 18,19
I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.
In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.
df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"))
df
overlap
1 1,2,3
2 1,2,3
3 1,2,3,4
4 3,4
5 5,6
6 5,6,7
7 6,7
8 8,9
9 8,9,10
10 9,10
11 11,12,13
12 11,12,13
13 11,12,13,14
14 13,14
15 15,16
16 15,16,17
17 16,17
18 18,19
19 18,19,20
20 19,20
I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:
1 1,2,3
5 5,6
8 8,9
11 11,12,13
15 15,16
18 18,19
I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.
Share Improve this question edited Nov 18, 2024 at 21:25 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Nov 18, 2024 at 18:35 bcrewbcrew 1053 bronze badges 2 |3 Answers
Reset to default 6One option for this particular example data is to create an igraph
graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.
library(dplyr)
library(igraph)
df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"))
df |>
mutate(id = row_number(), .before = 1) |>
group_by(
g_clust =
strsplit(overlap, ",") |>
# either create a directed graph or set duplicate = FALSE for
# corner cases like `overlap = c("1", "1,2,3", ...)`
graph_from_adj_list(mode = "all", duplicate = FALSE) |>
components() |>
getElement("membership")
) |>
slice_head(n = 1)
#> # A tibble: 6 × 3
#> # Groups: g_clust [6]
#> id overlap g_clust
#> <int> <chr> <dbl>
#> 1 1 1,2,3 1
#> 2 5 5,6 2
#> 3 8 8,9 3
#> 4 11 11,12,13 4
#> 5 15 15,16 5
#> 6 18 18,19 6
Overlaps graph for reference:
strsplit(df$overlap, ",") |>
graph_from_adj_list(mode = "all", duplicate = FALSE) |>
plot()
We can try {ivs}
:
x = vapply(strsplit(unique(overlap), ","),
\(i) as.numeric(i[c(1, length(i))]), numeric(2))
library(ivs)
int = iv_groups(iv(x[1, ], x[2, ]))
giving
> as.data.frame(int)
y
1 [1, 4)
2 [5, 7)
3 [8, 10)
4 [11, 14)
5 [15, 17)
6 [18, 20)
The vapply
is a bit redundant as we call as.numeric
several times. Do you really want comma-separated integers stored as character?
transform(as.data.frame(int),
s = Vectorize(\(x, y) toString(x:y))(iv_start(int), iv_end(int) - 1))
giving
int s
1 [1, 4) 1, 2, 3
2 [5, 7) 5, 6
3 [8, 10) 8, 9
4 [11, 14) 11, 12, 13
5 [15, 17) 15, 16
6 [18, 20) 18, 19
Edit
@Chris is right in the comment below. I should add some explanation.
(1) Re-structure the data. Split the strings, find first and last value, coerce character
to numeric
.
x = # we assign the output of the pipe x |> ... |> ... to x
overlap |> # access the data
unique() |> # get rid of duplicates (not needed)
strsplit(",") |> # split on ",", we might want to add fixed=TRUE
# returns a list of character vectors, so we iterate over it with lapply
lapply(\(x) x[c(1, length(x))]) |> # get first and last element
# "1,2,3" ---> "1" "2" "3" has length 3 while "3" "4" has length 2
do.call(what="rbind") |> # list to 2-column matrix
type.convert(as.is=TRUE) # we coerce from character to numeric
gives
> x
[,1] [,2]
[1,] 1 3
[2,] 1 4
[3,] 3 4
[4,] 5 6
[5,] 5 7
[6,] 6 7
[7,] 8 9
[8,] 8 10
[9,] 9 10
[10,] 11 13
[11,] 11 14
[12,] 13 14
[13,] 15 16
[14,] 15 17
[15,] 16 17
[16,] 18 19
[17,] 18 20
[18,] 19 20
>
> # of
> class(x)
[1] "matrix" "array"
This obviously assumes that the lowest integer is in first position and the highest in last--a reasonable assumption? Otherwise we should coerce to numeric first and apply range
on each list element.
(2) To create interval vectors, we use iv()
. From it's documentation (cp. help(iv)
):
iv() creates an interval vector from start and end vectors. This is how you will typically create interval vectors, and is often used with columns in a data frame.
i.e.
> library(ivs)
> y = iv(x[, 1], x[, 2])
> y
<iv<integer>[18]>
[1] [1, 3) [1, 4) [3, 4) [5, 6) [5, 7) [6, 7) [8, 9) [8, 10)
[9] [9, 10) [11, 13) [11, 14) [13, 14) [15, 16) [15, 17) [16, 17) [18, 19)
[17] [18, 20) [19, 20)
Finally, we use iv_groups
. From help(iv_groups)
:
This family of functions revolves around grouping overlapping intervals within a single iv. When multiple overlapping intervals are grouped together they result in a wider interval containing the smallest iv_start() and the largest iv_end() of the overlaps.
> z = iv_groups(y)
> z
<iv<integer>[6]>
[1] [1, 4) [5, 7) [8, 10) [11, 14) [15, 17) [18, 20)
(3) It seems like your desired output is a character vector, where all integer sequences of the remaining intervals are pasted together.
To achieve this, we use ivs::iv_start()
and ivs::iv_end()
to access the boundaries of each interval. Now we would like to generate regular sequences. Unfortunately, :
is not vectorised, hence we introduce:
seq2str = Vectorize(\(from, to) toString(from:to))
toSpring()
is a wrapper for format()
, it's help page states
The default method first converts x to character and then concatenates the elements separated by ", ".
Applying our custom function to each start and end gives
> seq2str(from = iv_start(z), to = iv_end(z) - 1)
[1] "1, 2, 3" "5, 6" "8, 9" "11, 12, 13" "15, 16" "18, 19"
Note
You can also use this shorter version.
# input data
overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", "5,6", "5,6,7", "6,7", "8,9",
"8,9,10", "9,10", "11,12,13", "11,12,13", "11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17", "18,19", "18,19,20", "19,20")
# piped version
library(ivs)
overlap |>
strsplit(",") |>
lapply(\(x) x[c(1, length(x))]) |>
do.call(what="rbind") |>
type.convert(as.is=TRUE) |>
list(. = _) |>
with(iv(.[, 1], .[, 2])) |>
iv_groups() |>
list(. = _) |>
with(Vectorize(\(from, to) toString(from:to))(iv_start(.), to = iv_end(.) - 1))
where we use a trick to be able to make use of the forward pipe operator from base.
Here is an igraph
option
df %>%
mutate(id = str_c("r", row_number()), .before = 1) %>%
separate_longer_delim(overlap, ",") %>%
graph_from_data_frame() %>%
components() %>%
membership() %>%
enframe() %>%
filter(startsWith(name, "r")) %>%
mutate(name = as.integer(sub("\\D", "", name))) %>%
arrange(name) %>%
slice_head(n = 1, by = value) %>%
pull(name) %>%
slice(.data = df)
If we run a dummy dataset like below (slightly different the example from OP's data, see the first overlap
value)
df <- data.frame(overlap = c(
"1", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"
))
we will obtain
overlap
1 1
2 5,6
3 8,9
4 11,12,13
5 15,16
6 18,19
and the original data from OP's question (starting from "1,2,3"
as the first value) gives
overlap
1 1,2,3
2 5,6
3 8,9
4 11,12,13
5 15,16
6 18,19
本文标签: rFilter rows based on combined set of values in a stringStack Overflow
版权声明:本文标题:r - Filter rows based on combined set of values in a string - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745605444a2665812.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
overlap
values. – ThomasIsCoding Commented Nov 19, 2024 at 10:24