When accessing a user of an instance whose original server failed, the first attempt fixed failure. #174

Open
opened 2025-02-05 20:52:14 +09:00 by naskya · 8 comments
Owner

What type of issue is this?

  • label: Server
  • label: Bug

What happened?

When accessing a user of an instance whose original server failed, the first attempt fixed failure.

What did you expect to happen?

Firefish should not show an error message, but display the local content directly.

Steps to reproduce the issue

  1. Find an account with a local cached user object but an upstream server failure, such as @WordlessEcho@lolic.at
  2. Access it on Firefish, such as https://dvd.chat/@WordlessEcho@lolic.at
  3. The first visit of the day will prompt an error, and the next visit will be normal.

Reproduces how often

Once per account per instance per day.

What did you try to solve the issue / Do you have any insights

I suspect that the picture frame code is not fault-tolerant, but further verification is needed.

image

Version

v20240725

Instance

dvd.chat

What browser are you using? (client-side issues only)

What operating system are you using? (client-side issues only)

How do you deploy Firefish on your server? (server-side issues only)

What operating system are you using? (Server-side issues only)

Relevant log output

Contribution Guidelines

By submitting this issue, you agree to follow our Contribution Guidelines

  • I agree to follow this project's Contribution Guidelines
  • I have searched the issue tracker for similar issues, and this is not a duplicate.

Are you willing to fix this bug? (optional)

  • Yes, I will open a merge request that closes this ticket.
<!-- This issue template is for bug reports. There are other issue templates for feature requests, refactor proposals, and discussions, so please use them if this is not a bug report. Also, you don't need to prefix the issue title with "Bug:", because it's managed by issue labels. --> <!-- 💖 Thanks for taking the time to fill out this bug report! 💁 Having trouble with deployment? [Ask the support chat.](https://matrix.to/#/#firefish-community:nitro.chat) 🔒 Found a security vulnerability? [Please disclose it responsibly.](https://firefish.dev/firefish/firefish/-/blob/develop/SECURITY.md) 🤝 By submitting this issue, you agree to follow our [Contribution Guidelines.](https://firefish.dev/firefish/firefish/-/blob/develop/CONTRIBUTING.md) --> ## What type of issue is this? <!-- If this happens on your device and has to do with the user interface, it's client-side. If this happens on either with the API or the backend, or you got a server-side error in the client, it's server-side. --> <!-- Uncomment (remove surrounding arrow signs) the following line(s) to specify the category of this issue. --> * label: Server <!-- * label: Client --> <!-- * label: Mobile --> <!-- * label: Third-party-client --> <!-- * label: Docs --> <!-- * label: Locale --> <!-- * label: Build from source --> <!-- * label: Container --> <!-- * label: Firefish API --> <!-- * label: Mastodon API --> <!-- Please do not edit the next line --> * label: Bug ## What happened? <!-- Please give us a brief description of what happened. --> When accessing a user of an instance whose original server failed, the first attempt fixed failure. ## What did you expect to happen? <!-- Please give us a brief description of what you expected to happen. --> Firefish should not show an error message, but display the local content directly. ## Steps to reproduce the issue <!-- Please describe how to reproduce this issue (preferably, in a ordered list) --> 1. Find an account with a local cached user object but an upstream server failure, such as @WordlessEcho@lolic.at 2. Access it on Firefish, such as https://dvd.chat/@WordlessEcho@lolic.at 3. The first visit of the day will prompt an error, and the next visit will be normal. ## Reproduces how often <!-- Is it always reproducible, or is it conditional/probabilistic ? --> Once per account per instance per day. ## What did you try to solve the issue / Do you have any insights <!-- Not to repeat the same thing, let us share what you have tried so far. --> I suspect that the picture frame code is not fault-tolerant, but further verification is needed. ![image](/uploads/08a68f80affa88d72645370770779c86/image.png) ## Version <!-- What version of firefish is your instance running? You can find this by the instance information page. --> v20240725 <details> ### Instance <!-- What instance of firefish are you using? --> dvd.chat ### What browser are you using? (client-side issues only) ### What operating system are you using? (client-side issues only) ### How do you deploy Firefish on your server? (server-side issues only) ### What operating system are you using? (Server-side issues only) ### Relevant log output <!-- Please copy and paste any relevant log output. --> </details> ## Contribution Guidelines By submitting this issue, you agree to follow our [Contribution Guidelines](https://firefish.dev/firefish/firefish/-/blob/develop/CONTRIBUTING.md) - [X] I agree to follow this project's Contribution Guidelines - [X] I have searched the issue tracker for similar issues, and this is not a duplicate. ## Are you willing to fix this bug? (optional) - [X] Yes, I will open a merge request that closes this ticket. <!-- Please tell us how to fix this bug. As noted in the contribution guidelines, there is a good chance that your merge request will not be merged if there is no agreement with the project maintainers. However, we are currently so understaffed that it is virtually impossible to respond to every single proposal. So, feel free to implement it if there is no response for more than a week or there is a thumbs-up emoji reaction from the project maintainer(s). Many thanks for your involvement! -->
Author
Owner

Author: laozhoubuluo

  1. Thank you very much for providing a way to clear cache. Strictly speaking, lastFetchedAt is a user attribute that needs to be modified in Postgres. I'm too lazy to change database in Postgres.
  2. However, the previously submitted code change plan can solve this problem. I will submit a MR later.
  3. The resolveUser interface can support timeout, but Zotan only implements the 1500ms option. Do you think this time is appropriate? Do you need to implement a longer timeout?
Aug 01 00:11:32 FirefishDev firefish[8755]:  INFO 1        [remote resolve-user]        try resync: laozhoubuluo@firefish-pre.nglab.bid
Aug 01 00:11:32 FirefishDev firefish[8755]:  INFO 1        [remote resolve-user]        WebFinger for laozhoubuluo@firefish-pre.nglab.bid
Aug 01 00:11:33 FirefishDev firefish[8755]: ERROR 1        [remote resolve-user]        Failed to WebFinger for laozhoubuluo@firefish-pre.nglab.bid: 502
Aug 01 00:11:33 FirefishDev firefish[8755]: ERROR 1        [remote resolve-user]        error resolving remote user WebFinger: Error: Failed to WebFinger for laozhoubuluo@firefish-pre.nglab.bid: 502
Aug 01 00:11:33 FirefishDev firefish[8755]:  INFO 1        [remote resolve-user]        return existing remote user: laozhoubuluo@firefish-pre.nglab.bid
*Author: laozhoubuluo* 1. Thank you very much for providing a way to clear cache. Strictly speaking, `lastFetchedAt` is a user attribute that needs to be modified in Postgres. I'm too lazy to change database in Postgres. 2. However, the previously submitted code change plan can solve this problem. I will submit a MR later. 3. The resolveUser interface can support timeout, but Zotan only implements the 1500ms option. Do you think this time is appropriate? Do you need to implement a longer timeout? ``` Aug 01 00:11:32 FirefishDev firefish[8755]: INFO 1 [remote resolve-user] try resync: laozhoubuluo@firefish-pre.nglab.bid Aug 01 00:11:32 FirefishDev firefish[8755]: INFO 1 [remote resolve-user] WebFinger for laozhoubuluo@firefish-pre.nglab.bid Aug 01 00:11:33 FirefishDev firefish[8755]: ERROR 1 [remote resolve-user] Failed to WebFinger for laozhoubuluo@firefish-pre.nglab.bid: 502 Aug 01 00:11:33 FirefishDev firefish[8755]: ERROR 1 [remote resolve-user] error resolving remote user WebFinger: Error: Failed to WebFinger for laozhoubuluo@firefish-pre.nglab.bid: 502 Aug 01 00:11:33 FirefishDev firefish[8755]: INFO 1 [remote resolve-user] return existing remote user: laozhoubuluo@firefish-pre.nglab.bid ```
Author
Owner

Author: naskya

wait for the cache to time out

You can manually clear caches by deleting Redis keys

# delete specific cache
redis-cli 'DEL cache_key_name'

# delete all caches
redis-cli --scan | xargs -L 100 redis-cli DEL

if you’re using the “db-container” setup, you can $ make redis-cli to enter the Redis CLI.

I wonder if anyone have encountered this.

Personally, I don’t think this is good, but the backend behavior changes depending on NODE_ENV, so it may be related. Slow responses should be timed out as they can cause a DoS attack.

$ grep -r 'production' packages/backend/src
packages/backend/src/services/logger.ts:                        process.env.NODE_ENV !== "production"
packages/backend/src/services/drive/upload-from-url.ts:         process.env.NODE_ENV === "production" &&
packages/backend/src/boot/master.ts:    if (env !== "production") {
packages/backend/src/boot/master.ts:            logger.warn("The environment is not in production mode.");
packages/backend/src/server/api/api-handler.ts:                                         ...(y!.info && process.env.NODE_ENV !== "production"
packages/backend/src/server/index.ts:if (!["production", "test"].includes(process.env.NODE_ENV || "")) {
packages/backend/src/db/postgre.ts:const log = process.env.NODE_ENV !== "production";
packages/backend/src/misc/download-url.ts:                              (process.env.NODE_ENV === "production" ||
*Author: naskya* > wait for the cache to time out You can manually clear caches by deleting Redis keys ```sh # delete specific cache redis-cli 'DEL cache_key_name' # delete all caches redis-cli --scan | xargs -L 100 redis-cli DEL ``` if you’re using the [“db-container” setup](https://firefish.dev/firefish/firefish/-/blob/86d3f8f5b5f483d647f1f062d8363263ab47770f/dev/docs/db-container.md), you can `$ make redis-cli` to enter the Redis CLI. > I wonder if anyone have encountered this. Personally, I don’t think this is good, but the backend behavior changes depending on `NODE_ENV`, so it may be related. Slow responses should be timed out as they can cause a DoS attack. ```console $ grep -r 'production' packages/backend/src packages/backend/src/services/logger.ts: process.env.NODE_ENV !== "production" packages/backend/src/services/drive/upload-from-url.ts: process.env.NODE_ENV === "production" && packages/backend/src/boot/master.ts: if (env !== "production") { packages/backend/src/boot/master.ts: logger.warn("The environment is not in production mode."); packages/backend/src/server/api/api-handler.ts: ...(y!.info && process.env.NODE_ENV !== "production" packages/backend/src/server/index.ts:if (!["production", "test"].includes(process.env.NODE_ENV || "")) { packages/backend/src/db/postgre.ts:const log = process.env.NODE_ENV !== "production"; packages/backend/src/misc/download-url.ts: (process.env.NODE_ENV === "production" || ```
Author
Owner

Author: laozhoubuluo

The simulation of the server disconnection problem needs to wait for the cache to time out, and the test results need to be synchronized later.

But by the way, api/users/show calls resolveUser without a timeout mechanism. In the local test environment with slow network (network through proxy), we can see the problem that the backend has not completed the request but Nginx has timed out.

In this example, the frontend gave up after 1.5 minutes, and the backend completed the interface request after 6.5 minutes. I feel that there should be no problem in the production environment. I wonder if anyone have encountered this.

image

*Author: laozhoubuluo* The simulation of the server disconnection problem needs to wait for the cache to time out, and the test results need to be synchronized later. But by the way, api/users/show calls resolveUser without a timeout mechanism. In the local test environment with slow network (network through proxy), we can see the problem that the backend has not completed the request but Nginx has timed out. In this example, the frontend gave up after 1.5 minutes, and the backend completed the interface request after 6.5 minutes. I feel that there should be no problem in the production environment. I wonder if anyone have encountered this. ![image](/uploads/e5b1cc3a0ffb018a05094516d3879bc2/image.png)
Author
Owner

Author: laozhoubuluo

Jul 30 22:19:45 Firefish firefish[709]:  INFO 1        [remote resolve-user]        try resync: wordlessecho@lolic.at
Jul 30 22:19:45 Firefish firefish[709]:  INFO 1        [remote resolve-user]        WebFinger for wordlessecho@lolic.at
Jul 30 22:20:45 Firefish firefish[709]: ERROR 1        [remote resolve-user]        Failed to WebFinger for wordlessecho@lolic.at: The operation was aborted.
Jul 30 22:21:10 Firefish firefish[709]:  INFO 1        [remote resolve-user]        return existing remote user: wordlessecho@lolic.at
*Author: laozhoubuluo* ``` Jul 30 22:19:45 Firefish firefish[709]: INFO 1 [remote resolve-user] try resync: wordlessecho@lolic.at Jul 30 22:19:45 Firefish firefish[709]: INFO 1 [remote resolve-user] WebFinger for wordlessecho@lolic.at Jul 30 22:20:45 Firefish firefish[709]: ERROR 1 [remote resolve-user] Failed to WebFinger for wordlessecho@lolic.at: The operation was aborted. Jul 30 22:21:10 Firefish firefish[709]: INFO 1 [remote resolve-user] return existing remote user: wordlessecho@lolic.at ```
Author
Owner

Author: laozhoubuluo

I reproduced the problem in the local test environment and found that the error message was still Failed to WebFinger for wordlessecho@lolic.at: The operation was aborted. . After carefully checking the screenshots and code, I found that my local Git was not updated to 20240725. After the update, I found that resolveUserWebFinger could also throw this error.

image

But I still have to try to know whether adding a catch here can fix the problem.

image

*Author: laozhoubuluo* I reproduced the problem in the local test environment and found that the error message was still `Failed to WebFinger for wordlessecho@lolic.at: The operation was aborted.` . After carefully checking the screenshots and code, I found that my local Git was not updated to 20240725. After the update, I found that resolveUserWebFinger could also throw this error. ![image](/uploads/40e86fd494137dcebee568583fff3d07/image.png) But I still have to try to know whether adding a catch here can fix the problem. ![image](/uploads/6756ff8b756c657a78286b68c93b84a0/image.png)
Author
Owner

Author: naskya

Actually, we don’t need to use the resolveSelf function. packages/backend/src/remote/resolve-user.ts has been updated by the huge commit f282549900780a3413373dab444968d19db38102, and the Iceshrimp’s code should handle it better (but we may be using it wrong).

*Author: naskya* Actually, we don’t need to use the `resolveSelf` function. [`packages/backend/src/remote/resolve-user.ts`](https://firefish.dev/firefish/firefish/-/blob/develop/packages/backend/src/remote/resolve-user.ts) has been updated by the huge commit f282549900780a3413373dab444968d19db38102, and the Iceshrimp’s code should handle it better (but we may be using it wrong).
Author
Owner

Author: naskya

Thanks for your insights! If you can fix the problem, please feel free to open a merge request.


side note:

I believe the function name resolveSelf is taken from the WebFinger spec (see https://info.firefish.dev/.well-known/webfinger?resource=acct:firefish@info.firefish.dev for example).

I personally don’t think we need to stick to the word self. Perhaps const webfingerLink = ... is a better variable name?

*Author: naskya* Thanks for your insights! If you can fix the problem, please feel free to open a merge request. --- side note: I believe the function name `resolveSelf` is taken from the [WebFinger spec](<https://docs.joinmastodon.org/spec/webfinger/#example>) (see <https://info.firefish.dev/.well-known/webfinger?resource=acct:firefish@info.firefish.dev> for example). I personally don’t think we need to stick to the word `self`. Perhaps `const webfingerLink = ...` is a better variable name?
Author
Owner

Author: laozhoubuluo

changed the description

*Author: laozhoubuluo* changed the description
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
naskya/test#174
No description provided.