Matching Patterns with dcbor
CLI
The dcbor
CLI tool includes powerful pattern matching capabilities that allow you to search for, extract, and validate specific structures within dCBOR data. This chapter introduces the dcbor match
subcommand, which leverages the comprehensive pattern expression (AKA "patex") syntax of the dcbor-pattern
crate to enable sophisticated data analysis and extraction workflows.
This chapter builds on the foundation established in The dcbor
Command Line Tool chapter. If you haven't read that chapter yet, we recommend doing so first to familiarize yourself with the basic dcbor
CLI operations.
What is Pattern Matching?
Pattern matching in the context of dCBOR allows you to:
- Find specific data structures within complex CBOR documents
- Extract values that match certain criteria
- Validate data conformance to expected patterns
- Find the paths that lead to matching values within nested structures
- Transform data by capturing and reformatting matches
The dcbor match
Command
The basic syntax of the dcbor match
command is:
dcbor match <PATTERN> [INPUT] [OPTIONS]
Where:
is a pattern expression (AKA "patex") written in dcbor-pattern expression syntax we'll explore in detail[INPUT]
is the dCBOR data to match against (or read from stdin)[OPTIONS]
control input/output formats and matching behavior
Pattern Syntax Reference
You can find a complete reference for the patex syntax in the dCBOR Expression Syntax Appendix. This appendix provides a quick reference for the patex syntax, including value patterns, structure patterns, and meta patterns we'll cover later.
Value Patterns
Value patterns are the foundation of dCBOR pattern matching. They allow you to match specific data types and exact values. Let's start with the most basic patterns and build up your understanding progressively.
Numbers
Recall that if you simply type:
dcbor 42
You get back the hex representation of the CBOR number 42:
│ 182a
│ 182a
If you want the CBOR diagnostic notation, you can use the --diag
option:
dcbor -o diag 42
│ 42
│ 42
In the examples in this chapter, the actual patex used is shown in its own block, and referred to in the command lines that follow it as $PATTERN
. So when you see a block like this:
PATTERN=
number
PATTERN=
number
What we're hiding is that we really wrote this:
PATTERN=$(cat <<'EOF'
number
EOF
)
This little bit of heredoc awkwardness is the most reliable way to make sure everything in a pattern is assigned to a shell variable verbatim. For many patterns you won't need to use it yourself.
But if you do, now you know.
What if you have two pieces of CBOR data, and you want to check whether one of them is a number?
CBOR1=182a
CBOR2=6548656c6c6f
You can use the dcbor match
command to check whether either of these is a number:
NUMBER=
number
NUMBER=
number
dcbor match $NUMBER -i hex $CBOR1
│ 42
│ 42
dcbor match $NUMBER -i hex $CBOR2
│ Error: No Match
│ Error: No Match
We can see that CBOR1
is the number 42
, and CBOR2
is not a numeric value. So let's see whether it is a textual string by using the TEXT
pattern:
TEXT=
text
TEXT=
text
dcbor match $TEXT -i hex $CBOR2
│ "Hello"
│ "Hello"
The pattern matches, and we can see it is the string "Hello"
.
The number
pattern matches any numeric value, whether it's an integer or floating-point number:
NUMBER=
number
NUMBER=
number
dcbor match $NUMBER 42
│ 42
│ 42
dcbor match $NUMBER 3.14
│ 3.14
│ 3.14
To avoid confusion with command-line flags, you can use --
to separate the pattern from the input. --
signals that there are no command-line flags following it, allowing you to pass values that might otherwise be interpreted as flags. This is especially useful for negative numbers or special values like -Infinity
.
NUMBER=
number
NUMBER=
number
dcbor match $NUMBER -- -1
│ -1
│ -1
Text Strings
As we demonstrated above, the text
pattern matches any text string:
TEXT=
text
TEXT=
text
dcbor match $TEXT '"hello"'
│ "hello"
│ "hello"
dcbor match $TEXT '"🌎"'
│ "🌎"
│ "🌎"
Notice that when providing text strings as input to the CLI, you need to include the double-quotes as part of the dCBOR diagnostic notation. This is the same quoting consideration we discussed in the basic dcbor CLI chapter.
Byte Strings
The bstr
pattern matches any byte string. Byte strings in CBOR are sequences of raw bytes, distinct from text strings which have UTF-8 character encoding semantics:
BSTR=
bstr
BSTR=
bstr
dcbor match $BSTR "h'68656c6c6f'"
│ h'68656c6c6f'
│ h'68656c6c6f'
The empty byte string is perfectly legal:
dcbor match $BSTR "h''"
│ h''
│ h''
Booleans and Null
The bool
pattern matches both boolean values:
BOOL=
bool
BOOL=
bool
dcbor match $BOOL true
│ true
│ true
dcbor match $BOOL false
│ false
│ false
Don't confuse the response falsefalse
here as meaning that the pattern didn't match; it means that the input value was falsefalse
, which is a valid match for the bool
pattern.
The nullnull
pattern matches CBOR's nullnull
value:
NULL=
null
NULL=
null
dcbor match $NULL null
│ null
│ null
The Universal Pattern
The **
("any") pattern matches any CBOR value whatsoever.
ANY=
*
ANY=
*
dcbor match $ANY 42
│ 42
│ 42
dcbor match $ANY '"hello"'
│ "hello"
│ "hello"
dcbor match $ANY "h'1234'"
│ h'1234'
│ h'1234'
**
is useful when you want to match any value in a particular position within a larger structure.
Specific Value Matching
Beyond matching types, you can match exact values by providing the specific value as your pattern.
Specific Numbers
FORTY_TWO=
42
FORTY_TWO=
42
dcbor match $FORTY_TWO 42
│ 42
│ 42
This won't match because 43 ≠ 42:
dcbor match $FORTY_TWO 43
│ Error: No match
│ Error: No match
Specific Text Strings
HELLO=
"hello"
HELLO=
"hello"
dcbor match $HELLO '"hello"'
│ "hello"
│ "hello"
This won't match because the strings are different:
dcbor match $HELLO '"world"'
│ Error: No match
│ Error: No match
Specific Byte Strings
TWO_BYTES=
h'1234'
TWO_BYTES=
h'1234'
dcbor match $TWO_BYTES "h'1234'"
│ h'1234'
│ h'1234'
Specific Boolean Values
BOOL_TRUE=
true
BOOL_TRUE=
true
dcbor match $BOOL_TRUE true
│ true
│ true
This won't match because false ≠ true:
dcbor match $BOOL_TRUE false
│ Error: No match
│ Error: No match
Advanced Value Patterns
Beyond basic type and exact value matching, dCBOR patterns support sophisticated matching criteria including ranges for numbers and regular expressions for text and byte strings.
Number Ranges
Numbers can be matched using ranges and inequality operators, which is useful for validating data within acceptable bounds.
Range Matching
You can match numbers within a specific range using the ...
syntax:
ONE_TO_TEN=
1...10
ONE_TO_TEN=
1...10
dcbor match $ONE_TO_TEN 5
│ 5
│ 5
dcbor match $ONE_TO_TEN 15
│ Error: No match
│ Error: No match
The ...
syntax is shorthand for an inclusive, or closed range, meaning it includes the start and end values in the range.
The same range of numbers can also be specified with a more complex syntax using the &
operator, which we'll cover later.
ONE_TO_TEN=
>=1 & <=10
ONE_TO_TEN=
>=1 & <=10
dcbor match $ONE_TO_TEN 5
│ 5
│ 5
Inequality Operators
Numbers support various inequality operators. Quoting is important here to ensure the shell doesn't misinterpret the operators as command-line directives:
Greater than:
dcbor match ">5" 10
│ 10
│ 10
Greater than or equal to:
dcbor match ">=5" 5
│ 5
│ 5
Less than:
dcbor match "<10" 8
│ 8
│ 8
Less than or equal to:
dcbor match "<=10" 10
│ 10
│ 10
Half-Open Ranges
Using the &
operator allows you to construct patterns that match half-open ranges (where one end is inclusive and the other is exclusive):
dcbor match ">1 & <=10" 10
│ 10
│ 10
dcbor match ">1 & <=10" 1
│ Error: No match
│ Error: No match
Special Number Values
You can also match three special floating-point values: NaN
("not a number"), Infinity
, and -Infinity
.
dcbor match "NaN" NaN
│ NaN
│ NaN
dcbor match "Infinity" Infinity
│ Infinity
│ Infinity
dcbor match -- "-Infinity" -Infinity
│ -Infinity
│ -Infinity
Note the use of --
to signal the end of command-line options, allowing you to pass values that might otherwise be interpreted as flags.
Text Regular Expressions
Regular expressions (or regexes) are powerful pattern matching tools for text, allowing you to search for specific patterns rather than exact text. They use special characters and syntax to define search patterns. For instance, d+
matches one or more digits, [A-Z]+
matches one or more uppercase letters, and ^
and $
anchor patterns to the beginning and end of a string respectively. With regular expressions, you can validate formats, extract information, and perform sophisticated text processing operations.
dCBOR patexes that this chapter describes are based on some of the same concepts as regexes, but they are not the same. The dCBOR pattern expression syntax is designed specifically for matching CBOR data structures and values, while regular expressions are specifically for processing text. Nonetheless, some of the types you can match with dCBOR patterns, such as text strings and byte strings, can be matched using regular expressions.
Text strings can be matched using regular expressions, by using the a regex enclosed in forward slashes: /regex//regex/
:
Match strings starting with "temp"
STARTS_WITH_TEMP=
/^temp/
STARTS_WITH_TEMP=
/^temp/
dcbor match $STARTS_WITH_TEMP '"temporary"'
│ "temporary"
│ "temporary"
This won't match because it doesn't start with "temp":
dcbor match $STARTS_WITH_TEMP '"permanent"'
│ Error: No match
│ Error: No match
Match any email-like pattern
EMAIL_ADDRESS=
/^[^@]+@[^@]+\.[^@]+$/
EMAIL_ADDRESS=
/^[^@]+@[^@]+\.[^@]+$/
dcbor match $EMAIL_ADDRESS '"user@example.com"'
│ "user@example.com"
│ "user@example.com"
Regular expressions use standard Rust regex syntax, which is based on Perl-compatible regular expressions (PCRE). This allows for complex pattern matching including:
- Literal characters:
/abc//abc/
,/123//123/
- Any character:
/.//./
- Character classes:
/[a-z]//[a-z]/
,/[0-9]//[0-9]/
,/\d//\d/
(digit),/\w//\w/
(word character) - Quantifiers:
/<pattern>*//<pattern>*/
(zero or more),/<pattern>+//<pattern>+/
(one or more),/<pattern>?//<pattern>?/
(zero or one),/<pattern>{n,m}//<pattern>{n,m}/
(between n and m times) - Anchors:
/^<pattern>//^<pattern>/
(start),/<pattern>$//<pattern>$/
(end) - Groups:
/(<pattern>)//(<pattern>)/
- Alternation:
/<pattern1>|<pattern2>//<pattern1>|<pattern2>/
Explaining the full syntax of regular expressions is beyond the scope of this book, but you can find more information on the specific Rust implementation in the Rust regex documentation.
Byte String Regular Expressions
Byte strings also support regular expression matching, useful for matching binary patterns or encoded data. Binary regexes operate on raw byte content, not on the hex string representation you see in diagnostic notation. The syntax is like h'<hex>'h'<hex>'
above, but for regexes its: h'/<regex>/'h'/<regex>/'
.
Binary regexes must start with the (?s-u)
flags to work correctly:
(?s)
enables "dot matches newline" mode, allowing.
to match across newlines (like byte0x0a
)(?-u)
disables Unicode mode, allowing.
to match any byte value instead of just valid UTF-8 sequences- Use
x
notation for specific byte values (e.g.,xFF
for byte 255)
Without these flags, patterns may fail on byte strings containing newlines or invalid UTF-8 sequences.
Match byte strings containing the byte 0xFF0xFF
anywhere
CONTAINS_FF=
h'/(?s-u).*\xFF.*/'
CONTAINS_FF=
h'/(?s-u).*\xFF.*/'
dcbor match $CONTAINS_FF "h'ff01020304'"
│ h'ff01020304'
│ h'ff01020304'
Match byte strings starting with specific bytes 01020102
STARTS_WITH_0102=
h'/(?s-u)^\x01\x02/'
STARTS_WITH_0102=
h'/(?s-u)^\x01\x02/'
dcbor match $STARTS_WITH_0102 "h'01020304'"
│ h'01020304'
│ h'01020304'
Match byte strings ending with specific bytes
ENDS_WITH_0304=
h'/(?s-u)\x03\x04$/'
ENDS_WITH_0304=
h'/(?s-u)\x03\x04$/'
dcbor match $ENDS_WITH_0304 "h'01020304'"
│ h'01020304'
│ h'01020304'
Match any 4-byte sequence
ANY_FOUR_BYTES=
h'/(?s-u)^.{4}$/'
ANY_FOUR_BYTES=
h'/(?s-u)^.{4}$/'
dcbor match $ANY_FOUR_BYTES "h'12345678'"
│ h'12345678'
│ h'12345678'
Practical Examples
These advanced patterns are particularly useful for data validation and extraction:
Validate that ages are reasonable (0-120)
dcbor match "0...120" 25
│ 25
│ 25
Extract valid email addresses from text
EMAIL_ADDRESS=
/^\w+@\w+\.\w+$/
EMAIL_ADDRESS=
/^\w+@\w+\.\w+$/
dcbor match $EMAIL_ADDRESS '"john@example.com"'
│ "john@example.com"
│ "john@example.com"
Find numeric IDs above a threshold
dcbor match ">1000" 1001
│ 1001
│ 1001
Match ISO-8601 date-like strings
ISO_DATE=
/^\d{4}-\d{2}-\d{2}$/
ISO_DATE=
/^\d{4}-\d{2}-\d{2}$/
dcbor match $ISO_DATE '"2023-12-25"'
│ "2023-12-25"
│ "2023-12-25"
These advanced value patterns form the building blocks for more complex structure matching, which we'll explore in the next section.
Understanding Match Output
When a pattern matches, the default output shows the matched value. This seems simple now, but it becomes more meaningful when we start working with complex structures where patterns might match multiple values or nested elements.
dcbor match number 42
│ 42
│ 42
The output 42
tells us that the pattern number
matched the input value 42
. When we move to structure patterns, you'll see how this output format shows the path through complex data structures.
Pattern Validation and Error Messages
When a pattern doesn't match, the CLI returns an error:
dcbor match text 42
│ Error: No match
│ Error: No match
This happens because the input 42
is a number, but the pattern text
expects a string. Understanding these error messages helps you debug your patterns and understand why they might not be working as expected.
Finally, here's are a couple of example of patterns that fail to parse:
dcbor match tex '"Hello"'
│ Error: Failed to parse pattern at position 0..1: unrecognized token 't'
│ Pattern: tex
│ ^
│ Error: Failed to parse pattern at position 0..1: unrecognized token 't'
│ Pattern: tex
│ ^
dcbor match '"Hello' '"Hello"'
│ Error: Failed to parse pattern: Unterminated string literal at 0..1
│ Error: Failed to parse pattern: Unterminated string literal at 0..1
Structure Patterns
Beyond matching individual values, dCBOR patterns support matching complex structures like arrays, maps, and tagged values. These patterns allow you to validate data schemas and extract elements from nested structures.
Array Patterns
Basic Array Matching
The arrayarray
pattern matches any array structure:
ANY_ARRAY=
array
ANY_ARRAY=
array
dcbor match $ANY_ARRAY '[1, 2, 3]'
│ [1, 2, 3]
│ [1, 2, 3]
dcbor match $ANY_ARRAY '["hello", "world"]'
│ ["hello", "world"]
│ ["hello", "world"]
dcbor match $ANY_ARRAY '[]'
│ []
│ []
If you want to match the empty array specifically, then the pattern is just the empty array: [][]
.
Array Sequence Patterns
The array pattern can contain a comma-separated list of patterns, where each pattern matches zero or more elements in the array in sequence.
[ <patex>, <patex>, ... ]
[ <patex>, <patex>, ... ]
Match an array with a number followed by text
NUMBER_THEN_TEXT=
[number, text]
NUMBER_THEN_TEXT=
[number, text]
dcbor match $NUMBER_THEN_TEXT '[42, "hello"]'
│ [42, "hello"]
│ [42, "hello"]
[number, text][number, text]
means the first element must be a number, followed by a text string, and that's it: these must be the only elements and they must appear in that order, so adding another element would not match:
dcbor match $NUMBER_THEN_TEXT '[42, "hello", 0]'
│ Error: No match
│ Error: No match
In this case the first element must be the exact number 42
, but the second element can be any text string:
FORTY_TWO_THEN_TEXT=
[42, text]
FORTY_TWO_THEN_TEXT=
[42, text]
dcbor match $FORTY_TWO_THEN_TEXT '[42, "hello"]'
│ [42, "hello"]
│ [42, "hello"]
This won't match because the elements are in wrong order:
dcbor match $FORTY_TWO_THEN_TEXT '["hello", 42]'
│ Error: No match
│ Error: No match
Match array starting with number, then text, then anything else
NUMBER_THEN_TEXT_THEN_ANY=
[number, text, *]
NUMBER_THEN_TEXT_THEN_ANY=
[number, text, *]
dcbor match $NUMBER_THEN_TEXT_THEN_ANY '[42, "hello", true]'
│ [42, "hello", true]
│ [42, "hello", true]
In the example above, the **
operator by itself matches exactly one element. If you want to match zero or more of any elements from this point on, you can use the repeating pattern (*)*(*)*
:
NUMBER_THEN_TEXT_THEN_REST=
[number, text, (*)*]
NUMBER_THEN_TEXT_THEN_REST=
[number, text, (*)*]
dcbor match $NUMBER_THEN_TEXT_THEN_REST '[42, "hello"]'
dcbor match $NUMBER_THEN_TEXT_THEN_REST '[42, "hello", true]'
dcbor match $NUMBER_THEN_TEXT_THEN_REST '[42, "hello", true, false]'
│ [42, "hello"]
│ [42, "hello", true]
│ [42, "hello", true, false]
│ [42, "hello"]
│ [42, "hello", true]
│ [42, "hello", true, false]
We'll cover repeating patterns more thoroughly later.
Map Patterns
Basic Map Matching
The mapmap
pattern matches any map structure
ANY_MAP=
map
ANY_MAP=
map
dcbor match $ANY_MAP '{1: 2, 3: 4}'
│ {1: 2, 3: 4}
│ {1: 2, 3: 4}
dcbor match $ANY_MAP '{"hello": "world"}'
│ {"hello": "world"}
│ {"hello": "world"}
dcbor match $ANY_MAP '{}'
│ {}
│ {}
Key-Value Constraints
Maps can be matched by specifying key-value constraints using <key>: <value><key>: <value>
notation. For each constraint, the target map must have at least one key-value pair that satisfies the constraint.
Match map with a specific key, and a text value
HAS_KEY_NAME=
{"name": text}
HAS_KEY_NAME=
{"name": text}
dcbor match $HAS_KEY_NAME '{"name": "Alice", "age": 30}'
│ {"age": 30, "name": "Alice"}
│ {"age": 30, "name": "Alice"}
Notice that it is not necessary to match every key-value pair in the map; you can match just the ones you care about. The output will show the entire map.
Match map with number-valued key
HAS_KEY_1=
{1: text}
HAS_KEY_1=
{1: text}
dcbor match $HAS_KEY_1 '{1: "first", 2: "second"}'
│ {1: "first", 2: "second"}
│ {1: "first", 2: "second"}
If you want to match a map that only contains a specific key-value pair, you can specify the exact number of entries using the &
operator and a map pattern containing a quantifier:
Match map with exactly one key-value pair, where key is 1 and value is any text
HAS_SINGLE_ENTRY_WITH_KEY_1=
{ {1} } & {1: text}
HAS_SINGLE_ENTRY_WITH_KEY_1=
{ {1} } & {1: text}
This will not match because it has two entries, and the patex specifies one:
dcbor match $HAS_SINGLE_ENTRY_WITH_KEY_1 '{1: "first"}'
│ {1: "first"}
│ {1: "first"}
There are two entries, so no match:
dcbor match $HAS_SINGLE_ENTRY_WITH_KEY_1 '{1: "first", 2: "second"}'
│ Error: No match
│ Error: No match
Match map with multiple required entries
HAS_ID_AND_NAME=
{"id": number, "name": text}
HAS_ID_AND_NAME=
{"id": number, "name": text}
Both key-value pairs must exist, but other entries are allowed
dcbor match $HAS_ID_AND_NAME '{"id": 1, "name": "Alice", "age": 30}'
│ {"id": 1, "age": 30, "name": "Alice"}
│ {"id": 1, "age": 30, "name": "Alice"}
Tagged Value Patterns
CBOR tagged values apply semantic meaning to data. Patterns can match both the tag and the content.
Tag Number Matching
Match any value with tag 1234 containing a number
NUMBER_TAGGED_1234=
tagged(1234, number)
NUMBER_TAGGED_1234=
tagged(1234, number)
dcbor match $NUMBER_TAGGED_1234 "1234(42)"
│ 1234(42)
│ 1234(42)
Match tag 12345 with any content
ANY_TAGGED_12345=
tagged(12345, *)
ANY_TAGGED_12345=
tagged(12345, *)
dcbor match $ANY_TAGGED_12345 '12345("tagged string")'
│ 12345("tagged string")
│ 12345("tagged string")
Content Pattern Matching
Tagged patterns specify both the tag value and required content patterns:
Match tag 2 (bignum) with byte string content
BIGNUM=
tagged(2, bstr)
BIGNUM=
tagged(2, bstr)
dcbor match $BIGNUM "2(h'0102')"
│ 2(h'0102')
│ 2(h'0102')
Match tag with array content having specific structure
NUMBER_TEXT_ARRAY_TAGGED_42=
tagged(42, [number, text])
NUMBER_TEXT_ARRAY_TAGGED_42=
tagged(42, [number, text])
dcbor match $NUMBER_TEXT_ARRAY_TAGGED_42 '42([1, "data"])'
│ 42([1, "data"])
│ 42([1, "data"])
Introducing Paths
Single Path, Single Element Output
When a pattern matches, the default output shows the matching value. For structures, this represents the entire matching structure:
dcbor match 'array' '[1, 2, 3]'
│ [1, 2, 3]
│ [1, 2, 3]
dcbor match '{"key": *}' '{"key": "value", "other": 42}'
│ {"key": "value", "other": 42}
│ {"key": "value", "other": 42}
The examples above only include one match, and one way to get there. But dCBOR items are actually trees, with arrays and maps representing possible branchs. This becomes more meaningful when working with search patterns or captures that can match multiple items or nested elements. For example, later we'll discuss the search
pattern, which visits all the elements in a dcbor item. For a quick example, if you match a pattern that finds all numbers in an array, the output will show each number along with its context, or path from the root of the structure:
dcbor match 'search(number)' '[1, [2, 3]]'
The output shows three paths from the root item to numbers within it:
│ [1, [2, 3]]
│ 1
│ [1, [2, 3]]
│ [2, 3]
│ 2
│ [1, [2, 3]]
│ [2, 3]
│ 3
│ [1, [2, 3]]
│ 1
│ [1, [2, 3]]
│ [2, 3]
│ 2
│ [1, [2, 3]]
│ [2, 3]
│ 3
You can choose to output the last item of each path using the --last-only
option, which will only show the final matched items:
dcbor match --last-only "search(number)" '[1, [2, 3]]'
│ 1
│ 2
│ 3
│ 1
│ 2
│ 3
Output Options Overview
The dcbor match
command provides several options for controlling output format:
--captures
: Show named capture information (covered in advanced chapter)--last-only
: Show only the final matched items--in FORMAT
/--out FORMAT
: Control input/output formats (hex, diag, etc.)
The next chapter will cover advanced matching techniques.
The appendices include a dCBOR Patex Reference.