admin管理员组

文章数量:1431726

This is my input:

"Once there     was a (so-called) rock. it.,was not! in fact, a big rock."

I need it to output an array that looks like this

["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]

There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:

spaceDelimiters  = " -_" 
commaDelimiters  = ",():;\""
periodDelimiters = ".!?"

If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma

I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces

Maybe my approach is just wrong? This is what I've got:

def split(string, delimiters):
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string)

Which ends up doing everything wrong. It's not even close

This is my input:

"Once there     was a (so-called) rock. it.,was not! in fact, a big rock."

I need it to output an array that looks like this

["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]

There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:

spaceDelimiters  = " -_" 
commaDelimiters  = ",():;\""
periodDelimiters = ".!?"

If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma

I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces

Maybe my approach is just wrong? This is what I've got:

def split(string, delimiters):
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string)

Which ends up doing everything wrong. It's not even close

Share Improve this question asked Nov 19, 2024 at 11:51 zealantannerzealantanner 313 bronze badges 4
  • What is delimiters? – no comment Commented Nov 19, 2024 at 12:01
  • "If there's a spaceDelimiter character then it should replace it with a space." - you're not doing any replacing in your current code, you are splitting the input into parts. – C3roe Commented Nov 19, 2024 at 12:02
  • What are you actually trying to do here, in the grand scheme of things? I'm having a hard time coming up with a use case where you'd want to record spaces in an array like that. – CAustin Commented Nov 19, 2024 at 12:20
  • a weird project where I plan to make a speech synthesizer. I want the program to say each word and pause for the appropriate amount of time for spaces commas and periods. As well as a few other punctuation marks but those can be the same amount of time as spaces commas or periods. Hope that made sense – zealantanner Commented Nov 19, 2024 at 12:25
Add a comment  | 

2 Answers 2

Reset to default 1

Use the re library to split text on word boundaries, then replace in sequence of precident

import re

s="Once there     was a (so-called) rock. it.,was not! in fact, a big rock."

# split regex into tokens along word boundaries
regex=r"\b"

l=re.split(regex,s)

def replaceDelimeters(token:str):
    
    # in each token identify if it contains a delimeter
    spaceDelimiters  = r"[^- _]*[- _]+[^- _]*" 
    commaDelimiters  = r"[^,():;\"]*[,():;\"]+[^,():;\"]*"
    periodDelimiters = r"[^.!?]*[.!?]+[^.!?]*"
    
    # substitute for the replacement
    token=re.sub(periodDelimiters,".",token)
    token=re.sub(commaDelimiters,",",token)
    token=re.sub(spaceDelimiters," ",token)
    return token

# apply
[replaceDelimeters(token) for token in l if token!=""]

This method returns "." as the last entry to the list. I don't know if this is your desired behavior; your desired output states otherwise, but your logic appears to desire this. Deleting the last entry if it is a period should be easy enough in any case.

You can do it with a single regular expression.

Define your rules in precedence order (from lowest to highest) with the replacement character as the initial character:

rules = {
    "space": " _-" , # put - last in the rule
    "comma": ",():;\"",
    "period": ".!?",
}

Then create a regular expression which is either one-or-more characters matching no rules or one-or-more characters matching at least one character matching the rule and any number of characters matching that rule and any lower precedence rules with the highest precedence rule earliest in the regular expression pattern:

prev = ""
rule_patterns = deque()
for name, rule in rules.items():
    prev = rule + prev
    rule_patterns.appendleft(f"(?P<{name}>[{prev}]*?[{rule}][{prev}]*)")
rule_patterns.appendleft(f"(?P<other>[^{prev}]+)")

pattern = repile("|".join(rule_patterns))

Which generates the pattern (?P<other>[^.!?,():;" _-]+)|(?P<period>[.!?,():;" _-]*?[.!?][.!?,():;" _-]*)|(?P<comma>[,():;" _-]*?[,():;"][,():;" _-]*)|(?P<space>[ _-]*?[ _-][ _-]*)

Then given your value:

value = "Once there     was a (so-called) rock. it.,was not! in fact, a big rock."

You can find all the matches and, where a rule is matched instead output the first character in the rule:

matches = [
    next(
        (rule[0] for name, rule in rules.items() if match.group(name)),
        match.group("other")
    )
    for match in pattern.finditer(value)
]

print(matches)

Outputs:

['Once', ' ', 'there', ' ', 'was', ' ', 'a', ',', 'so', ' ', 'called', ',', 'rock', '.', 'it', '.', 'was', ' ', 'not', '.', 'in', ' ', 'fact', ',', 'a', ' ', 'big', ' ', 'rock', '.']

本文标签: