Torrents wrecked by inconsistent handling of "unsafe" characters #353

Open
opened 2025-04-21 15:07:49 -04:00 by idk · 2 comments
Owner

Opened 3 years ago

Last modified 3 years ago

#2304newdefect

Torrents wrecked by inconsistent handling of "unsafe" characters

Reported by:joggerOwned by:zzz
Priority:
minor
Milestone:
undecided
Component:
apps/i2psnark
Version:
0.9.36
Keywords:

Cc:

Parent Tickets:

Sensitive:
no

Description

Bug surfaces with filenames written by some Mac applications containing characters that have their high bit set. Example hexdump "EFBC8F" wich is displayed as " / ". Lots of those sequences exist.

Torrent created correctly with these characters inside the torrent file on Linux and Mac, Java 9 and 10.

Torrent downloads unchanged to Linux, Java 9 and 10. Downloaded torrent checks clean when moved to another instance on Linux or after crash. Same behaviour observed on Mac for 0.9.35 and Java 9.

On Mac with 0.9.36 and Java 10 above sequence is changed to a single underscore. Torrents do not check clean after a crash or when moved in after downloaded on Linux. As a consequence one can not be sure that it will be possible to seed a downloaded torrent at a later time or on a different machine.

Note about the standard for testing these kind of issues:

It was Kernighan & Pike in The Practice of Programming who said as much in Chapter 6, Testing, §6.5 Stress Tests:

When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.

Subtickets

Opened [3 years ago](/timeline?from=2018-08-28T07%3A38%3A51Z&precision=second "See timeline at Aug 28, 2018 7:38:51 AM") Last modified [3 years ago](/timeline?from=2018-09-02T20%3A26%3A31Z&precision=second "See timeline at Sep 2, 2018 8:26:31 PM") ## [\#2304](/ticket/2304)[new](/query?status=new)[defect](/query?status=!closed&type=defect) # Torrents wrecked by inconsistent handling of "unsafe" characters Reported by:[jogger](/query?status=!closed&reporter=jogger)Owned by:[zzz](/query?status=!closed&owner=zzz) Priority: [minor](/query?status=!closed&priority=minor) Milestone: [undecided](/milestone/undecided "No date set") Component: [apps/i2psnark](/query?status=!closed&component=apps%2Fi2psnark) Version: [0.9.36](/query?status=!closed&version=0.9.36) Keywords: Cc: Parent Tickets: Sensitive: [no](/query?status=!closed&sensitive=0) ### Description Bug surfaces with filenames written by some Mac applications containing characters that have their high bit set. Example hexdump "EFBC8F" wich is displayed as " / ". Lots of those sequences exist. Torrent created correctly with these characters inside the torrent file on Linux and Mac, Java 9 and 10. Torrent downloads unchanged to Linux, Java 9 and 10. Downloaded torrent checks clean when moved to another instance on Linux or after crash. Same behaviour observed on Mac for 0.9.35 and Java 9. On Mac with 0.9.36 and Java 10 above sequence is changed to a single underscore. Torrents do not check clean after a crash or when moved in after downloaded on Linux. As a consequence one can not be sure that it will be possible to seed a downloaded torrent at a later time or on a different machine. Note about the standard for testing these kind of issues: It was Kernighan & Pike in The Practice of Programming who said as much in Chapter 6, Testing, §6.5 Stress Tests: > When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction. ### Subtickets
idk added this to the undecided milestone 2025-04-21 15:07:49 -04:00
idk added the
#2304
apps
i2psnark
undecided
labels 2025-04-21 15:07:49 -04:00
Author
Owner

comment:2 Changed 3 years ago by jogger

Basically you are saying that torrents downloaded (not created) to a Mac with 0.9.35 / Java 9 with "unsafe characters" intact are wrecked on the very same machine after upgrading to 0.9.36 / Java 10 because for some reason filenames are no longer valid. Bad news if you can not move them to Linux.

As a further consequence I can no longer move torrents downloaded on a Mac to Linux because on Linux the characters now considered unsafe on the Mac are still valid.

I suggest changing the policy and abandon all character conversion except for null and slash as long as some Unix is detected as the underlying OS.

[comment:2](https://trac.i2p2.de/\#comment:2) Changed [3 years ago](https://trac.i2p2.de//timeline?from=2018-09-02T20%3A26%3A31Z&precision=second "See timeline at Sep 2, 2018 8:26:31 PM") by jogger Basically you are saying that torrents downloaded (not created) to a Mac with 0.9.35 / Java 9 with "unsafe characters" intact are wrecked on the very same machine after upgrading to 0.9.36 / Java 10 because for some reason filenames are no longer valid. Bad news if you can not move them to Linux. As a further consequence I can no longer move torrents downloaded on a Mac to Linux because on Linux the characters now considered unsafe on the Mac are still valid. I suggest changing the policy and abandon all character conversion except for null and slash as long as some Unix is detected as the underlying OS.
Author
Owner

comment:1 Changed 3 years ago by zzz

related: #571 #771 #1132 #1415

​https://www.fileformat.info/info/unicode/char/ff0f/index.htm

0xEFBC8F is valid UTF-8, U+FF0F FULL WIDTH SOLIDUS

We validate based on the default charset for the JVM, which comes from the OS. If the character is not available in the default charset, it can't be mapped to that charset. So we need to replace it with something else. We use '_'. Converting between charsets is lossy, there's no way to fix it. In addition, even in the same charset, different OSes have different rules on valid chars in file names, and things may happen to file names when you copy them between OSes. Again, that's not fixable by us.

[comment:1](https://trac.i2p2.de/\#comment:1) Changed [3 years ago](https://trac.i2p2.de//timeline?from=2018-08-31T16%3A46%3A11Z&precision=second "See timeline at Aug 31, 2018 4:46:11 PM") by zzz related: [#571](https://trac.i2p2.de//ticket/571 "#571: defect: snark char mapping on torrent creation (closed: fixed)") [#771](https://trac.i2p2.de//ticket/771 "#771: defect: I2PSnark: Willful rewriting of special characters (closed: fixed)") [#1132](https://trac.i2p2.de//ticket/1132 "#1132: defect: New per-torrent config system for i2psnark (closed: fixed)") [#1415](https://trac.i2p2.de//ticket/1415 "#1415: defect: I2PSnark filename conversion to builtin charset in windows may cause ... (new)") [​https://www.fileformat.info/info/unicode/char/ff0f/index.htm](https://trac.i2p2.de/https://www.fileformat.info/info/unicode/char/ff0f/index.htm) 0xEFBC8F is valid UTF-8, U+FF0F FULL WIDTH SOLIDUS We validate based on the default charset for the JVM, which comes from the OS. If the character is not available in the default charset, it can't be mapped to that charset. So we need to replace it with something else. We use '\_'. Converting between charsets is lossy, there's no way to fix it. In addition, even in the same charset, different OSes have different rules on valid chars in file names, and things may happen to file names when you copy them between OSes. Again, that's not fixable by us.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: I2P_Developers/i2p.i2p#353
No description provided.