How do I parse the escape characters of the content of a string literal input with nom?

How do I parse the escape characters of the content of a string literal input with nom?
from erayerdin@programming.dev to rust@programming.dev on 29 Jul 2024 13:23
https://programming.dev/post/17473388

So, I’m basically trying to parse a string literal with nom. This is the code I’ve come up with:

use nom::{
    bytes::complete::{tag, take_until},
    sequence::delimited,
    IResult,
};

/// Parses string literals.
fn parse_literal<'a>(input: &'a str) -> IResult<&'a str, &'a str> {
    // escape tag identifier is the same as delimiter, obviously
    let escape_tag_identifier =
        input
            .chars()
            .nth(0)
            .ok_or(nom::Err::Error(nom::error::Error::new(
                input,
                nom::error::ErrorKind::Verify,
            )))?;

    let (remaining, value) = delimited(
        tag(escape_tag_identifier.to_string().as_str()),
        take_until(match escape_tag_identifier {
            '\'' => "'",
            '"' => "\"",
            _ => unreachable!("parse_literal>>take_until branched into unreachable."),
        }),
        tag(escape_tag_identifier.to_string().as_str()),
    )(input)?;

    Ok((remaining, value))
}

#[cfg(test)]
mod literal_tests {
    use super::*;

    #[rstest]
    #[case(r#""foo""#, "foo")]
    #[case(r#""foo bar""#, "foo bar")]
    #[case(r#""foo \" bar""#, r#"foo " bar"#)]
    fn test_dquotes(#[case] input: &str, #[case] expected_output: &str) {
        let result = parse_literal(input);
        assert_eq!(result, Ok(("", expected_output)));
    }

    #[rstest]
    #[case("'foo'", "foo")]
    #[case("'foo bar'", "foo bar")]
    #[case(r#"'foo \' bar'"#, "foo ' bar")]
    fn test_squotes(#[case] input: &str, #[case] expected_output: &str) {
        let result = parse_literal(input);
        assert_eq!(result, Ok(("", expected_output)));
    }

    #[rstest]
    #[case(r#""foo'"#, "foo'")]
    #[case(r#"'foo""#, r#"foo""#)]
    fn test_errs(#[case] input: &str, #[case] expected_err_input: &str) {
        let result = parse_literal(input);
        assert_eq!(
            result,
            Err(nom::Err::Error(nom::error::Error::new(
                expected_err_input,
                nom::error::ErrorKind::TakeUntil
            ))),
        );
    }
}

Note: The example uses rstest for tests.

Although it looks a little bit complex, actually, it is not. Basically, the parse function is parse_literal. The tests are separated for double quotes and single quotes and errors.,

When you run the tests, you will realize first and second cases for single and double quotes run successfully. The problem is with the third case of each: #[case(r#““foo \” bar”"#, r#“foo " bar”#)] for test_dquotes and #[case(r#“‘foo \’ bar’”#, “foo ’ bar”)] for test_squotes.

Ideally, if a string literal is defined with single quotes and has single quotes in its content, the single quotes can be escaped with single quotes again. Same goes for double quotes as well. To demonstrate in a pseudocode:

"foo ' bar" // is ok
"foo \" bar" // is ok
"foo " bar" // is err
'foo " bar' // is ok
'foo \' bar' // is ok
'foo ' bar' // is err

Currently, in the code, I take characters until the delimiter with take_until, which reaches to the end of the input, which, let’s say, in this case, is guaranteed to contain only and only the string literal as input. So it’s kind of okay for first and second cases in the tests.

But, of course, this fails in the third cases of each test since the input has the delimiter character early on, finishes early and returns the remaining.

This is only for research purposes, so you do not need to give a fully-featured answer. A pathway is, as well, appreciated.

Thanks in advance.

#rust

threaded - newest

oliveoilcheff@programming.dev on 29 Jul 2024 13:34 next collapse

Have you tried the escaped function? docs.rs/nom/latest/nom/bytes/…/fn.escaped.html

erayerdin@programming.dev on 29 Jul 2024 13:59 collapse

Hmm, didn’t see that. Lemme play with that a little. Maybe I can come up with something. Thank you.

calcopiritus@lemmy.world on 29 Jul 2024 19:08 collapse

What I do to parse strings (pseudo code since I’m on mobile, don’t copy-paste):

delimited(
    ",
    many0(alt(
        any_character_except_quote_or_slash,
        pair('\', escaped_char)
   )),
   "
)

Where any_except_quote_or_slash and escaped_char are defined somewhere else, the rest of the parsers are by nom.

You may want to wrap pair with a map and many0 with recognize.