Use Unicode properties for alnum, alpha, etc.#7626
Use Unicode properties for alnum, alpha, etc.#7626joshuamegnauth54 wants to merge 1 commit intoRustPython:mainfrom
Conversation
Rust and Python differ in which properties they use for alphanumeric, numeric, et cetera. Both languages list which properties are used which makes it easy to mimic Python's behavior in Rust. My previous patch was a bit shortsighted because I filtered out combining characters from is_alphanumeric. Using properties is exact and also much cleaner. It also covers edge cases that my initial approach missed. Besides isalnum, I also fixed isnumeric and isdigit in the same way by using properties.
📝 WalkthroughWalkthroughThis PR updates Unicode character classification methods across the codebase to use ICU (International Components for Unicode) general category and numeric type predicates instead of Rust/Python built-in helpers. Changes affect Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
crates/vm/src/builtins/str.rs (2)
959-979: Prefermatches!over array.contains(&..)for small enum sets.
[NumericType::Decimal, NumericType::Digit, NumericType::Numeric].contains(&NumericType::for_char(c))constructs a stack array and does a linear search on every character.matches!is more idiomatic, reads better, and the compiler lowers it to a single jump table / bitwise test. Same applies toisdigiton Line 977.Also: the lingering comment on Line 974 ("python's isdigit also checks if exponents are digits, these are the unicode codepoints for exponents") referred to the old hardcoded superscript list; with
NumericType::Digithandling those uniformly, the tail of that sentence is stale and could be trimmed.♻️ Proposed cleanup
#[pymethod] fn isnumeric(&self) -> bool { !self.data.is_empty() && self.char_all(|c| { - [ - NumericType::Decimal, - NumericType::Digit, - NumericType::Numeric, - ] - .contains(&NumericType::for_char(c)) + matches!( + NumericType::for_char(c), + NumericType::Decimal | NumericType::Digit | NumericType::Numeric + ) }) } #[pymethod] fn isdigit(&self) -> bool { - // python's isdigit also checks if exponents are digits, these are the unicode codepoints for exponents + // Matches CPython: Numeric_Type=Digit or Numeric_Type=Decimal (covers superscripts, etc.). !self.data.is_empty() && self.char_all(|c| { - [NumericType::Digit, NumericType::Decimal].contains(&NumericType::for_char(c)) + matches!( + NumericType::for_char(c), + NumericType::Digit | NumericType::Decimal + ) }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/builtins/str.rs` around lines 959 - 979, Replace the small-array-based enum membership checks in the isnumeric and isdigit methods with matches! pattern matching to avoid allocating/linear-search semantics: in isnumeric (fn isnumeric) change the closure that currently uses [NumericType::Decimal, NumericType::Digit, NumericType::Numeric].contains(&NumericType::for_char(c)) to a matches! on NumericType::for_char(c) matching Decimal | Digit | Numeric; similarly in isdigit (fn isdigit) replace the [NumericType::Digit, NumericType::Decimal].contains(&NumericType::for_char(c)) check with matches!(NumericType::for_char(c), Digit | Decimal). Also trim the stale portion of the comment inside isdigit that mentions explicit superscript codepoints so the comment only documents that isdigit checks digit/exponent categories if kept.
948-957: Consider consolidatingisalnumto useGeneralCategoryGroup::Numberfor consistency.
GeneralCategoryGroup::Letter ∪ DecimalNumber+c.is_numeric()correctly coversL*,Nd,Nl, andNo(matching CPython's requirements). However, this method mixes ICU and Rust stdlib classification, while sibling methods (isalpha,isdecimal,isnumeric) use ICU exclusively.GeneralCategoryGroup::Number(available in icu_properties v2) expands to exactlyNd | Nl | No, making the refactor below functionally equivalent while consolidating to a single source of truth:♻️ Optional simplification
fn isalnum(&self) -> bool { !self.data.is_empty() && self.char_all(|c| { GeneralCategoryGroup::Letter - .union(GeneralCategoryGroup::DecimalNumber) + .union(GeneralCategoryGroup::Number) .contains(GeneralCategory::for_char(c)) - || c.is_numeric() }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/builtins/str.rs` around lines 948 - 957, The isalnum implementation in the isalnum method should use ICU-only classification for consistency: replace the current check that unions GeneralCategoryGroup::Letter and GeneralCategoryGroup::DecimalNumber and also calls c.is_numeric() with a single GeneralCategoryGroup::Number.contains(GeneralCategory::for_char(c)) combined with the existing Letter check (i.e., Letter ∪ Number) so char_all only uses ICU properties; update the predicate inside str::isalnum (and keep the !self.data.is_empty() guard) to use GeneralCategoryGroup::Number and remove the c.is_numeric() call.crates/sre_engine/src/string.rs (1)
444-454: Consider unifying numeric classification to ICU exclusively.The
|| c.is_numeric()clause uses Rust's stdlib (which checks General_Category Nd|Nl|No), while the rest of the logic relies on ICU. Both approaches are functionally equivalent, but mixing classification sources makes invariants harder to reason about. For consistency with the ICU-based approach used elsewhere, replaceDecimalNumberwithNumberin the union and remove theis_numeric()fallback:Optional refactoring
- .map(|c| { - GeneralCategoryGroup::Letter - .union(GeneralCategoryGroup::DecimalNumber) - .contains(GeneralCategory::for_char(c)) - || c.is_numeric() - }) + .map(|c| { + GeneralCategoryGroup::Letter + .union(GeneralCategoryGroup::Number) + .contains(GeneralCategory::for_char(c)) + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/sre_engine/src/string.rs` around lines 444 - 454, The is_uni_alnum function mixes stdlib numeric classification (c.is_numeric()) with ICU-based checks; change the ICU union to use GeneralCategoryGroup::Number instead of DecimalNumber and remove the c.is_numeric() fallback so all numeric classification comes from ICU (update the union expression in is_uni_alnum and eliminate the || c.is_numeric() branch).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/sre_engine/src/string.rs`:
- Around line 444-454: The is_uni_alnum function mixes stdlib numeric
classification (c.is_numeric()) with ICU-based checks; change the ICU union to
use GeneralCategoryGroup::Number instead of DecimalNumber and remove the
c.is_numeric() fallback so all numeric classification comes from ICU (update the
union expression in is_uni_alnum and eliminate the || c.is_numeric() branch).
In `@crates/vm/src/builtins/str.rs`:
- Around line 959-979: Replace the small-array-based enum membership checks in
the isnumeric and isdigit methods with matches! pattern matching to avoid
allocating/linear-search semantics: in isnumeric (fn isnumeric) change the
closure that currently uses [NumericType::Decimal, NumericType::Digit,
NumericType::Numeric].contains(&NumericType::for_char(c)) to a matches! on
NumericType::for_char(c) matching Decimal | Digit | Numeric; similarly in
isdigit (fn isdigit) replace the [NumericType::Digit,
NumericType::Decimal].contains(&NumericType::for_char(c)) check with
matches!(NumericType::for_char(c), Digit | Decimal). Also trim the stale portion
of the comment inside isdigit that mentions explicit superscript codepoints so
the comment only documents that isdigit checks digit/exponent categories if
kept.
- Around line 948-957: The isalnum implementation in the isalnum method should
use ICU-only classification for consistency: replace the current check that
unions GeneralCategoryGroup::Letter and GeneralCategoryGroup::DecimalNumber and
also calls c.is_numeric() with a single
GeneralCategoryGroup::Number.contains(GeneralCategory::for_char(c)) combined
with the existing Letter check (i.e., Letter ∪ Number) so char_all only uses ICU
properties; update the predicate inside str::isalnum (and keep the
!self.data.is_empty() guard) to use GeneralCategoryGroup::Number and remove the
c.is_numeric() call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: bed164de-3685-4fff-aa41-dfbe24da72da
📒 Files selected for processing (3)
crates/sre_engine/src/string.rscrates/vm/src/builtins/str.rsextra_tests/snippets/builtin_str.py
🥳 |
Rust and Python differ in which properties they use for alphanumeric, numeric, et cetera. Both languages list which properties are used which makes it easy to mimic Python's behavior in Rust.
My previous patch was a bit shortsighted because I filtered out combining characters from is_alphanumeric. Using properties is exact and also much cleaner. It also covers edge cases that my initial approach missed.
Besides isalnum, I also fixed isnumeric and isdigit in the same way by using properties.
This contributes a bit to #7527. Unicode issues for isalnum etc. should be solved now (and permanently too!!). Title casing looks harder to fix, but the basic solution of using properties more should apply there too. Also, the RegEx engine can benefit from using properties more too, but I'm terrible at RegEx so I didn't really try to fix issues there. 😆
For extra reading, here is Python's docs which lists the categories used. Look at
isalpha(), for example.Summary by CodeRabbit
Release Notes
Bug Fixes
isalpha,isalnum,isdigit,isnumeric) for more accurate handling of international characters and special Unicode categories.Tests