Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: How to stop any directory with "foo" in it's path from showing up in search? #120

Closed
Gremious opened this issue Nov 24, 2024 · 16 comments

Comments

@Gremious
Copy link

Gremious commented Nov 24, 2024

I have private folders/files that I don't want to ever show up in search
e.g. if I search for .txt, I don't want my-drive/super-private/passwords.txt to ever show up in the results list, so I want to block anything that has *super-private* in it's path.
How would I go about achieving this?

I tried just doing

[global]
e2dsa
e2tsr
no-idx:foo

And it did seemingly re-index, but search entries still showed up.

@9001
Copy link
Owner

9001 commented Nov 24, 2024

Hey! it's been a while :>

Just to clarify, you're alright with all of those folders and files being visible in the directory listing, so you can navigate into it just by clicking my-drive -> super-private -> passwords.txt , but you just don't want the files showing up in search?

in that case, your approach with setting no-idx: foo is fine; you could even be more specific with no-idx: /foo/ to only filter that specific foldername.

the reason the change isn't taking effect on your existing index is because of an optimization which skips most of the indexing code if the folders haven't changed at all (no files added, removed or modified).

if you add re-dhash as a global option, then this optimization gets disabled and it should rebuild the index correctly. You can remove re-dhash after starting copyparty with it just once, since that cache speeds up startups / rescans by a lot once you have enough files/folders.

this workaround shouldn't be necessary; it should be smart enough to realize that it needs to drop the cache and reindex whenever no-idx is changed, so I'll keep this issue open so i can fix that :>

@Gremious
Copy link
Author

Aha yea, hello again ! c:

Just to clarify, you're alright with all of those folders and files being visible in the directory listing ...

Yep, all exactly right

I added re-dhash but am sorry to report - nothing changed. I could still search the files just fine, both by searching the keyword or filename. I tried it as a command line arg just in case, (python3 copyparty-sfx.py -c /home/username/.config/copyparty/config.conf --re-dhash) thinking maybe config flag missing but no

Further advise appreciated

@9001
Copy link
Owner

9001 commented Nov 24, 2024

Whoops, I forgot to mention that --re-dhash is meant to be combined with -e2dsa. But if you did specify both and it still didn't work, then that's odd... I'll take a look in case there's something else I've missed.

EDIT: oh and you don't need to -e2tsr for this, should save a lot of time heh

Btw, there's another way to do this -- if you rename super-private to .super-private , then it'll automatically get excluded from searches (unless --dotsrch). Just thought it'd be worth to mention :>

@Gremious
Copy link
Author

Gremious commented Nov 24, 2024

Yeah I do have both, and sadly does not work

If it helps, for reference, my config now looks like this:

[global]
name: Data
theme: 2
ftp: 1234
ftps: 1234
tftp4
ftp-pr: 1234-12345

# Use hashed passwords
ah-alg: argon2

# Thumbnail view on by default
grid

# Enable selection by default
gsel

# Enable general file indexing, index all files that don't have tags yet
# Maybe it's called "e2..." because it uses an 'up2k' tree for the database...?
e2dsa

# re-build index but like actually
re-dhash

# Enable metadata and tag indexing (ffprobe)
e2ts
# !! delete all media tags for re-indexing
# e2tsr

no-idx: foo

# Full-sized image thumbnails
th-crop=0

# Don't crawl my website please google
force-js
no-robots

# Don't create symlinks to deduplicate, make copies of every file
# no-dedup

# Allow cross-volume symlinks for deduplication
xlink

# Enable dotfiles
# ed
# urlform:get

# rejects all webdav connections unless they actually authenticate
dav-auth

# "opengraph", nice discord and social-media embeds
# (you now have to hotlink files by appending ?raw to the url)
og

# load all the config files in copyparty.d/
# Volume definitions
% copyparty.d

Btw, there's another way to do this -- if you rename super-private to .super-private , then it'll automatically get excluded from searches (unless --dotsrch).

Oh that's nice to know too, thank you 👍

@9001
Copy link
Owner

9001 commented Nov 25, 2024

Okay, you got me stumped... Please help me grasp at straws for a bit :p

But first -- this is regarding files that already exist on disk, right? no-idx only affects the filesystem scanner/indexer, so if a file is uploaded into a path that matches no-idx, then it still gets indexed until the next restart/reindex, at which point the no-idx would kick in and forget them . Changing no-idx to also affect uploads has too many other implications, so it would be better to add a dedicated option for filtering search results in that case.

Assuming we're talking about on-disk files -- one quick way to tell if the no-idx is taking effect at all, is to remove the no-idx option, restart copyparty and let it finish indexing, then add it back in, and another restart. At that point, there should be a log message saying forgetting N shadowed autoindexed files in [...] which indicates that the files should be gone from the db, and not appear in searches. We know that last part isn't true currently, but are you at least getting that message?

...and please ignore the messages along the lines of commit 71 new files; those are due to foldersize calculation and in dire need of rephrasing. So thanks for the reminder :p

We should also confirm which database it's reading the search results from, in case this is somehow related to the structure of your volumes. I don't think this is the case, but still... If you add the global-option srch-dbg then it will add some new pink/purple-colored log messages from u2idx:

23:01:10.660 127.0.0.1 45024       qj: name like *xptxt* |125|
23:01:10.660 u2idx                 searching across all 2 volumes in which the user has 'r' (full read access):
  / = /home/ed/dev/copyparty/srv
  /media = /home/ed/dev/copyparty/srv/media
23:01:10.661 u2idx                 qs: "select up.* from up where up.fn like  '%'||?||'%' " ['xptxt']
23:01:10.661 u2idx                 searching in volume / (/home/ed/dev/copyparty/srv), excludelist ['media']
23:01:10.665 u2idx                 in volume '/': got 0 hits, 0 total so far
23:01:10.665 u2idx                 searching in volume /media (/home/ed/dev/copyparty/srv/media), excludelist []
23:01:10.665 u2idx                 in volume '/media': hit: media/rls/xptxt.txt
23:01:10.665 u2idx                 in volume '/media': got 1 hits, 1 total so far
23:01:10.665 127.0.0.1 45024       q#: 1 (0.00s)

This should make it obvious where it's finding the files, so we could take a closer look at that db in particular. If we've gotten this far, then it would be useful if you could post this part of the log. Please feel free to find-replace file/foldernames to something else, but just take care that the folder structure isn't affected.

And finally, one unrelated thing I noticed in your config is that you currently have deduplication disabled, but xlink enabled. Deduplication became default-disabled in v1.15.0 because many people found it surprising, so now you need to enable it with dedup. But if you're going to keep it disabled, then might as well remove xlink too -- I wouldn't be surprised if that feature turns out to have some funky edgecases 👀

9001 added a commit that referenced this issue Nov 26, 2024
dhash would prevent a new noidx value from taking effect
@Gremious
Copy link
Author

But first -- this is regarding files that already exist on disk, right?

Yes the files already exists, and I want them to no longer show up in search since discovering that they do when I accidentally saw them during an unrelated search.

then it still gets indexed until the next restart/reindex, [...] Changing no-idx to also affect uploads has too many other implications,

That's very good to know, thank you.

If I truly needed it, I can just restart copyparty, it's pretty quick, so I don't mind.


At that point, there should be a log message saying forgetting N shadowed autoindexed files in [...] which indicates that the files should be gone from the db, and not appear in searches.

I can confirm I do get that message

I made a brand new copyparty drive/acc/directory just to test

[accounts]
  testman: ******

[/test]
  /home/gremious/data/user/test
  accs:
    rwmd: testman
  flags:

With copyparty running i made /test/foo/super-secret.txt
Comment out no-idx: foo, restart, put it back in, restart again

forgetting 1 shadowed autoindexed files in [/home/gremious/data/user/test] > [foo]

Still shows up:

image


u2idx

I just replaced usernames with e.g. "user1"
I think /test seems to be in an exclude list at the start because it's under / which itself is a copyparty volume, not sure what that's about, but it finds it the second time around when it actually checks /test.

copyparty[1015873]: 20:14:20.857 u2idx                 qs: "select up.* from up where trim(?||up.rd,'/') like  '%'||?||'%' and up.fn like  '%'||?||'%' " ['\nrd', 'foo', 'super-secret.txt']
copyparty[1015873]: 20:14:20.857 u2idx                 searching in volume / (/home/gremious/data/user), excludelist ['test', 'user2', 'shared', 'user3', 'user1', 'user1/public', 'user1/private', 'user1/protected',>
copyparty[1015873]: 20:14:20.857 u2idx                 in volume '/': got 0 hits, 0 total so far
copyparty[1015873]: 20:14:20.857 u2idx                 searching in volume /gremious (/home/gremious/data/user/gremious), excludelist ['public', 'private', 'protected']
copyparty[1015873]: 20:14:20.858 u2idx                 in volume '/gremious': got 0 hits, 0 total so far
copyparty[1015873]: 20:14:20.858 u2idx                 searching in volume /gremious/public (/home/gremious/data/user/gremious/public), excludelist []
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/gremious/public': got 0 hits, 0 total so far
copyparty[1015873]: 20:14:20.898 u2idx                 searching in volume /shared (/home/gremious/data/user/shared), excludelist []
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/shared': got 0 hits, 0 total so far
copyparty[1015873]: 20:14:20.898 u2idx                 searching in volume /test (/home/gremious/data/user/test), excludelist []
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/test': hit: test/foo/super-secret.txt
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/test': got 1 hits, 1 total so far
copyparty[1015873]: 20:14:20.898 u2idx                 searching in volume /user1 (/home/gremious/data/user/user1), excludelist ['public', 'private', 'protected']
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/user1': got 0 hits, 1 total so far
copyparty[1015873]: 20:14:20.898 u2idx                 searching in volume /user1/public (/home/gremious/data/user/user1/public), excludelist []
copyparty[1015873]: 20:14:20.898 u2idx                 in volume '/user1/public': got 0 hits, 1 total so far
copyparty[1015873]: 20:14:20.898 127.0.0.1 50316       q#: 1 (0.04s)

I noticed in your config is that you currently have deduplication disabled, but xlink enabled. Deduplication became default-disabled in v1.15.0 because many people found it surprising, so now you need to enable it with dedup.

Hey, thanks! Enabled it now :)

@9001
Copy link
Owner

9001 commented Nov 26, 2024

Okayyy, I'm kinda starting to suspect this is filesystem-related now... Clearly the knowledge about that file is removed from the database, but then it suddenly appears in a search afterwards. This reminds me of #61 which also boiled down to some weird search-related issues with your setup, so this is getting interesting!

The way search works is that the indexer (up2k) and the searcher (u2idx) each have their own "SQLite connection", which just means that the DB-file is opened twice, once each by two different threads. This approach is recommended by the SQLite devs, and SQLite has a lot of safeguards to make this both safe and fast. But that's assuming that the filesystem doesn't do anything silly, which is starting to look plausible, as the changes made by up2k are not visible to u2idx.

Before we continue this train of thought, let's make sure that they're actually opening the same file like they should. As you restart copyparty, up2k will print the path to the db, and as you perform the first search after a restart, u2idx will do the same thing:

21:17:12.635 up2k                  /test/ all-default
21:17:12.636 up2k                    /dev/shm/v/home/gremious/data/user/test/.hist/up2k.db |1|

[...]

21:19:51.207 u2idx                 opened /dev/shm/v/home/gremious/data/user/test/.hist/up2k.db
21:19:51.207 u2idx                 searching in volume /test (/dev/shm/v/home/gremious/data/user/test), excludelist []
21:19:51.208 u2idx                 in volume '/test': hit: test/foo/super-secret.txt

those two paths should match exactly; /dev/shm/v/home/gremious/data/user/test/.hist/up2k.db in my case.

Assuming they match on your end as well, let's continue --

if i'm not mistaken, you're running copyparty using python3 copyparty-sfx.py on a linux machine, and it's not inside a container such as docker/podman/lxc. Nice, that eliminates a bunch of potential issues already. But let me know if any of that is incorrect.

Could you post the final three lines of output from python3 copyparty-sfx.py --version ? on my machine, it gives this:

copyparty v1.16.2 "COPYparty" (2024-11-23)
  CPython v3.13.0 on Linux64  [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)]
   sqlite 3.46.1*1 | jinja 3.1.4 | pyftpd 2.0.0 | tftp 0.4.0

(this mentions the thread-safety properties of your linux-distro's sqlite libary, which might be relevant)

And some other things I'd like to know --

  • what linux distro and version are you running? you can check this with cat /etc/os-release if you're unsure
  • what filesystem type are you storing the files (and database) on?
  • are you running any sort of raid or disk-stitching software? maybe something like mergerfs, or unraid's shfs?

a quick way to check the filesystem type (and which blockdevice it's on) is with df --output=source,fstype PATH-TO-THE-UP2K-DB, for example:

df --output=source,fstype /home/jaycore/hist/demo/up2k.db
Filesystem     Type
/dev/nvme0n1p2 btrfs

@9001
Copy link
Owner

9001 commented Nov 27, 2024

meanwhile, I found some good reasons to add a proper option for filtering search results, so here's a beta -- there's global-option srch-excl and volflag srch_excl which takes a regex just like no-idx.

but I still want to figure out what's up with no-idx, since it could be a symptom of something worse... so I'm good to keep debugging if you are :>

copyparty-g697a4fa8.py.zip

9001 added a commit that referenced this issue Nov 27, 2024
a better alternative to using `--no-idx` for this purpose since
this also excludes recent uploads, not just during fs-indexing,
and it doesn't prevent deduplication

also speeds up searches by a tiny amount due to building the
sanchecks into the exclude-filter while parsing the config,
instead of during each search query
@Gremious
Copy link
Author

Gremious commented Nov 27, 2024

those two paths should match exactly

We got:

systemd[1]: Started copyparty.service - copyparty file server.
copyparty[1752040]: 21:38:45.300 up2k                  / daw
copyparty[1752040]: 21:38:45.320 up2k                    /home/gremious/data/user/.hist/up2k.db |1|
copyparty[1752040]: 21:38:45.321 up2k                  /test/ all-default
copyparty[1752040]: 21:38:45.352 up2k                    /home/gremious/data/user/test/.hist/up2k.db |1|

and

copyparty[1752040]: 21:38:56.427 u2idx                 opened /home/gremious/data/user/test/.hist/up2k.db
copyparty[1752040]: 21:38:56.427 u2idx                 searching in volume /test (/home/gremious/data/user/test), excludelist []
copyparty[1752040]: 21:38:56.427 u2idx                 in volume '/test': hit: test/foo/super-secret.txt
copyparty[1752040]: 21:38:56.427 u2idx                 in volume '/test': got 1 hits, 1 total so far

So all seems ok?

also I was going to ask, subsequent searches go like:

copyparty[1762297]: 21:59:02.682 u2idx                 searching in volume /test (/home/gremious/data/user/test), excluding ^\.|/\.
copyparty[1762297]: 21:59:02.682 u2idx                 in volume '/test': hit: test/foo/super-secret.txt
copyparty[1762297]: 21:59:02.682 u2idx                 in volume '/test': got 1 hits, 1 total so far

So I thought, is that "excluding" supposed to have the foo regex perchance?

And then I tried the srch-excl version and a) it works and b) it added it there, so I guess I know what you did now ehe


if i'm not mistaken, you're running copyparty using python3 copyparty-sfx.py on a linux machine, and it's not inside a container such as docker/podman/lxc

Yep, all correct.


Could you post the final three lines of output from python3 copyparty-sfx.py --version ?

copyparty v1.16.2 "COPYparty" (2024-11-23)
  CPython v3.11.2 on Linux64 6.1.85 [GCC 12.2.0]
   sqlite 3.40.1*1 | jinja 3.1.2 | pyftpd 1.5.10 | tftp 0.4.0

what linux distro and version are you running

"Debian GNU/Linux 12 (bookworm)"

what filesystem type are you storing the files (and database) on?

btrfs. (I should really just have this as an ext4 server I do not use anything cool btrfs has to offer, but, can't be bothered so we're stuck with this for now).

are you running any sort of raid or disk-stitching software? maybe something like mergerfs, or unraid's shfs?

No, as far as I remember I'm just running this all on the internal ssd of an intel nuc.


but I still want to figure out what's up with no-idx, since it could be a symptom of something worse... so I'm good to keep debugging if you are :>

I don't mind :)

@Gremious
Copy link
Author

Gremious commented Nov 28, 2024

Update: I don't think it's btrfs related because I just re-ran copyparty on my home computer (EndeavourOS (arch-based) with ext4) and it has the same bug

copyparty v1.16.2-6-g697a4fa8 "COPYparty" (2024-11-26)
  CPython v3.12.7 on Linux64  [GCC 14.2.1 20240910]
   sqlite 3.46.1*1 | jinja 3.1.4 | pyftpd 1.5.10 | tftp 0.4.0

@9001
Copy link
Owner

9001 commented Nov 28, 2024

nice! could you possibly upload the config and/or post the command you used to reproduce it at home? and include the exact steps you took to trigger it? cause this thing has been driving me insane lol

@Gremious
Copy link
Author

Gremious commented Nov 28, 2024

wait wtf i re-did it for the sake of clean guide and this time it works
I had it
I'm gonna see if i can break it again, gimme a hot min

@9001
Copy link
Owner

9001 commented Nov 28, 2024

in the meantime i'm happy to hear it wasn't btrfs; been using it on all my equipment for years and it's saved my ass more times than i'd like to admit... accidentally rm -rf an important folder? no sweat, just yank the powercable and undelete it! gotta start using snapshots one of these days 😅

and the handful of times it's bugged out with filesystem errors in dmesg has always been due to buggy hardware or dying HDDs, so the data checksums are truly a blessing -- not sure when I would have noticed with anything other than btrfs or zfs...

but way more importantly, thanks a lot for finally nailing this bug down 🙏 🙏 can't wait to see what it is :>

@Gremious
Copy link
Author

Gremious commented Nov 28, 2024

cause this thing has been driving me insane lol

I'm starting to question weather I'm insane myself, I swear if i simply misread an option or typoed no-idx at this point somehow I'm going to lose my mind lmao


no sweat, just yank the powercable and undelete it!

Ok that's actually pretty cool

One day I will accept the blessings of btrfs

That day will come when I have the energy to be bothered to properly learn/use snapshot and what not lol


Ok I figured something out? Hopefully
If a file is created using my filesystem explorer, it works perfectly fine. I made /foo/secret.txt and no-idx works exactly how you'd expect. I toggle no-idx while re-dhash is on and it toggles on/off no propblem. And even if I create that folder/file while copyparty is running with with no-idx, it actually will filter out without even a restart.
However.
If a file is uploaded or created using the copyparty buttons - it doesn't work.

SO: we make a /home/username/Test and we put copyparty-sfx.py in it. (I used the current release version but the beta one you send also does the same)

Nuke /home/username/.config/copyparty/ and put this as config.conf

[global]
name: Real Data
e2dsa
re-dhash
# no-idx: foo

[accounts]
  realman: realguy
  testman: testguy

[/]
  /home/gremious/Test/
  accs:
    rwmda: realman
    r: *

[/test]
  /home/gremious/Test/test
  accs:
    rwmd: testman

in /home/username/Test/ we run python copyparty-sfx.py -c ~/.config/copyparty/config.conf
(python version 3.12.7 here btw, don't think it matters)

open copyparty, login with passwd testguy

Go to /test press the copyparty gui button to make folder, name it foo, press copyparty gui button to make new text file, put in a secret.txt in or so.

Search path: foo you'll see it - fair, you mentioned no-idx needs a restart. re-dhash is already on, so, restart copyparty, refresh the page - yet you should still see it.

Funny enough, if I make a /foo/secret.txt using file explorer and /secret/foo/other-secret.txt using copyparty, the filexplorer one hides as one would expect when I toggle, yet the copyparty one does not - so it knows somehow.


P.S.
Copyparty complains

19:38:06.452 root                  WARNING: found config files in [/home/username/.config/copyparty]: config.conf
  config files are not expected here, and will NOT be loaded (unless your setup is intentionally hella funky)

And did not load config by default from there. Did the expected config location change? I kind of expect it to be in ~/.config.

@9001 9001 closed this as completed in d168b2a Nov 28, 2024
@9001
Copy link
Owner

9001 commented Nov 28, 2024

YES! That was it! Awesome work narrowing it down, thanks again 💯

I left all the details in the commit message, but the TL;DR is that the initial code for forgetting files was a tad too careful.

Here's a beta; probably won't be a new release until dec 7th or so: copyparty-gd168b2ac.py.zip


Did the expected config location change?

nah, that's fine -- that message is mainly to help configure things correctly when running in docker/podman. The docker container expects you to create two volumes, one to hold the files and one for the config. You're supposed to put your config files directly inside the config volume, but the confusing part is that this is also where copyparty creates the copyparty folder (the one that usually goes in /home/foo/.config), which, given the name, gives the impression that you're supposed to put the configs there instead.

that part is a bit of a mess, but the good news is I'm finally picking up the motivation to start working on the config gui, and that'll be a great opportunity to rethink some of this stuff :>

@Gremious
Copy link
Author

Hellll yeah!!

Can confirm, works on the home PC and server both! 🎉

Thank you too, great job dude, copyparty Software Of The Year every year forever 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants