admin管理员组

文章数量:1435505

I have a data table where one variable is messy and can contain different variants of the same value (e.g., team name Newcastle United or Newcastle). These variants occur alongside another grouping-like variable (e.g., both the Premier League and A-League have Newcastle clubs with different team name variants):

team = c('Newcastle United','Newcastle','Newcastle Utd','Newcastle United Jets','Newcastle','Newcastle Jets')
competition=c('Premier League','Premier League','Premier League','A-League','A-League','A-League')
df = tibble(team,competition)
# A tibble: 6 × 2
  team                  competition   
  <chr>                 <chr>         
1 Newcastle United      Premier League
2 Newcastle             Premier League
3 Newcastle Utd         Premier League
4 Newcastle United Jets A-League      
5 Newcastle             A-League      
6 Newcastle Jets        A-League     

I also have a lookup table that specifies the desired team name per competition as follows:

old_name=c('Newcastle','Newcastle Utd','Newcastle','Newcastle United Jets')
new_name=c('Newcastle United','Newcastle United','Newcastle Jets','Newcastle Jets')
competition=c('Premier League','Premier League','A-League','A-League')
lookup=tibble(old_name,new_name,competition)
# A tibble: 4 × 3
  old_name              new_name         competition   
  <chr>                 <chr>            <chr>         
1 Newcastle             Newcastle United Premier League
2 Newcastle Utd         Newcastle United Premier League
3 Newcastle             Newcastle Jets   A-League      
4 Newcastle United Jets Newcastle Jets   A-League    

How can I recode/relabel team such that only the relevant competition from the lookup table is used? I tried combining dplyr's group_by and recode in different ways but no luck so far.

(My real data and lookup tables are much bigger and the data table includes cases that don't have a match in the lookup table.)

Desired output:

# A tibble: 6 × 2
  team             competition   
  <chr>            <chr>         
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets   A-League      
5 Newcastle Jets   A-League      
6 Newcastle Jets   A-League    

I have a data table where one variable is messy and can contain different variants of the same value (e.g., team name Newcastle United or Newcastle). These variants occur alongside another grouping-like variable (e.g., both the Premier League and A-League have Newcastle clubs with different team name variants):

team = c('Newcastle United','Newcastle','Newcastle Utd','Newcastle United Jets','Newcastle','Newcastle Jets')
competition=c('Premier League','Premier League','Premier League','A-League','A-League','A-League')
df = tibble(team,competition)
# A tibble: 6 × 2
  team                  competition   
  <chr>                 <chr>         
1 Newcastle United      Premier League
2 Newcastle             Premier League
3 Newcastle Utd         Premier League
4 Newcastle United Jets A-League      
5 Newcastle             A-League      
6 Newcastle Jets        A-League     

I also have a lookup table that specifies the desired team name per competition as follows:

old_name=c('Newcastle','Newcastle Utd','Newcastle','Newcastle United Jets')
new_name=c('Newcastle United','Newcastle United','Newcastle Jets','Newcastle Jets')
competition=c('Premier League','Premier League','A-League','A-League')
lookup=tibble(old_name,new_name,competition)
# A tibble: 4 × 3
  old_name              new_name         competition   
  <chr>                 <chr>            <chr>         
1 Newcastle             Newcastle United Premier League
2 Newcastle Utd         Newcastle United Premier League
3 Newcastle             Newcastle Jets   A-League      
4 Newcastle United Jets Newcastle Jets   A-League    

How can I recode/relabel team such that only the relevant competition from the lookup table is used? I tried combining dplyr's group_by and recode in different ways but no luck so far.

(My real data and lookup tables are much bigger and the data table includes cases that don't have a match in the lookup table.)

Desired output:

# A tibble: 6 × 2
  team             competition   
  <chr>            <chr>         
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets   A-League      
5 Newcastle Jets   A-League      
6 Newcastle Jets   A-League    
Share Improve this question asked Nov 16, 2024 at 21:53 mrroymrroy 1257 bronze badges 1
  • 1 Perhaps try one of the methods from stackoverflow/questions/67081496/… ? – jared_mamrot Commented Nov 16, 2024 at 22:37
Add a comment  | 

1 Answer 1

Reset to default 1

An approach using full_join

library(dplyr)

full_join(df, lookup, by = join_by(team == old_name, competition)) %>% 
  mutate(team = coalesce(new_name, team), new_name = NULL)
# A tibble: 6 × 2
  team             competition
  <chr>            <chr>
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets   A-League
5 Newcastle Jets   A-League
6 Newcastle Jets   A-League

本文标签: dplyrgrouped recoding with lookup table in RStack Overflow