#320 - Crawlers hitting Forgejo instance

#320 - Crawlers hitting Forgejo instances - global abuse trend - forgejo/discussions - Codeberg.org
forgejo
discussions
Issues
190
Activity
Crawlers hitting Forgejo instances - global abuse trend
#320
New issue
Open
opened
2025-03-21 09:05:50 +01:00
by
earl-warren
63 comments
earl-warren
commented
2025-03-21 09:05:50 +01:00
Copy link
Detailed discussions on excessive crawling targeting code.forgejo.org:
February 2025
April 2025
Codeberg and the Forgejo infrastructure (which are entirely separate) were both targeted by
excessive crawling and/or DDoS
. The frequency of such problems seem to be increasing in 2025 and may be part of a global abuse trend.
Although there is a consensus and enough first hand reports to conclude that a toxic trend emerged in 2025 by which code related web services are overwhelmed by an excessive increase in the number and pattern of requests, there is no clarity on the cause or the sources.
Here is what is known:
SourceHut, LWN and code.forgejo.org were hit by tens of thousands of IP
RTD was hit by AI bots identifying themselves as such
And open questions:
Why are tens of thousands of IPs from residential areas used for crawling?
Is there any evidence that this is related to AI related crawling (RTD has evidence because the AI bots identified themselves as such)?
Is there anything ruling out the possibility that those crawling that result in DDoS are not malicious?
Discussions and articles are being published on this topic and this discussion is meant to keep an inventory, collect advice and opinions.
First hand description
Code related
DDoS on code.forgejo.org
DoS on nytsoi.net
DoS on donotsta
Videolan
FeeBSD
LWN DDoS
SourceHut blog - Please stop externalizing your costs directly into my face
RTD blog - AI crawlers abuse
Pagure
Fedora
Inkscape
Not code related
Diaspora
Wikimedia
Articles and resources
Apr 2025
Feb 2025
Apr 2025
2024
The hidden world of residential proxies
2024
Understanding and Classifying Network Traffic of Residential Proxies
2023
2019
Mitigation techniques
generating noise
blocking IP ranges
blocking User-Agents
rate limiting
rate limiting non-logged users and logged in users differently
client side proof of work
IP reputation scoring
Detailed discussions on excessive crawling targeting code.forgejo.org:

- [February 2025](https://codeberg.org/forgejo/discussions/issues/297)
- [April 2025](https://codeberg.org/forgejo/discussions/issues/331)

---

Codeberg and the Forgejo infrastructure (which are entirely separate) were both targeted by [excessive crawling and/or DDoS](https://codeberg.org/forgejo/discussions/issues/297). The frequency of such problems seem to be increasing in 2025 and may be part of a global abuse trend.

Although there is a consensus and enough first hand reports to conclude that a toxic trend emerged in 2025 by which code related web services are overwhelmed by an excessive increase in the number and pattern of requests, there is no clarity on the cause or the sources.

Here is what is known:

- SourceHut, LWN and code.forgejo.org were hit by tens of thousands of IP
- RTD was hit by AI bots identifying themselves as such

And open questions:

- Why are tens of thousands of IPs from residential areas used for crawling?
- Is there any evidence that this is related to AI related crawling (RTD has evidence because the AI bots identified themselves as such)?
- Is there anything ruling out the possibility that those crawling that result in DDoS are not malicious?

Discussions and articles are being published on this topic and this discussion is meant to keep an inventory, collect advice and opinions.

## First hand description

### Code related

- [DDoS on code.forgejo.org](https://codeberg.org/forgejo/discussions/issues/297)
- [DoS on nytsoi.net](https://blog.nytsoi.net/2025/03/01/obliterated-by-ai)
- [DoS on donotsta](https://donotsta.re/notice/AreSNZlRlJv73AW7tI)
- [Videolan](https://news.ycombinator.com/item?id=43424367)
- [FeeBSD](https://blog.sysopscafe.com/posts/ai-crawlers-hammering-git-repos/)
- [LWN DDoS](https://social.kernel.org/notice/AqJkUigsjad3gQc664)
- [SourceHut blog - Please stop externalizing your costs directly into my face](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html)
- [RTD blog - AI crawlers abuse](https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/)
- [Pagure](https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/)
- [Fedora](https://fosstodon.org/@Conan_Kudo/114190797503456984)
- [Inkscape](https://floss.social/@doctormo/113907009941378917)

## Not code related

- [Diaspora](https://pod.geraspora.de/posts/17342163)
- [Wikimedia](https://www.heise.de/en/news/AI-scrapers-strain-Wikipedia-50-percent-more-bandwidth-for-multimedia-requests-10336904.html)

## Articles and resources

- Apr 2025 https://pages.madhouse-project.org/algernon/infrastructure.org/services_fronting_for_aman_git_madhouse-project_org
- Feb 2025 https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
- Apr 2025 https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/
- 2024 [The hidden world of residential proxies](https://www.orangecyberdefense.com/global/blog/research/residential-proxies)
- 2024 [Understanding and Classifying Network Traffic of Residential Proxies](https://chasesecurity.github.io/bandwidth_sharing/)
- 2023 https://www.trendmicro.com/vinfo/ae/security/news/vulnerabilities-and-exploits/a-closer-exploration-of-residential-proxies-and-captcha-breaking-services
- 2019 https://ieeexplore.ieee.org/document/8835239
- https://github.com/ai-robots-txt/ai.robots.txt
- https://github.com/anthmn/ai-bot-blocker
- https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
- https://news.ycombinator.com/item?id=43422413
- https://radar.cloudflare.com/

## Mitigation techniques

- generating noise https://zadzmo.org/code/nepenthes/
- blocking IP ranges
- blocking User-Agents
- rate limiting
- rate limiting non-logged users and logged in users differently
- client side proof of work https://anubis.techaro.lol/, https://git.gammaspectra.live/git/go-away
- IP reputation scoring
earl-warren
commented
2025-03-21 10:08:33 +01:00
Author
Copy link
In a non-code context a list of
~500K IPs
were blocked. There is however no clarity on how those were determined to be bad and blocked. See also
In a non-code context a list of [~500K IPs](https://seenthis.net/messages/1104052#message1104149) were blocked. There is however no clarity on how those were determined to be bad and blocked. See also https://framapiaf.org/@biggrizzly/114199414828865169
earl-warren
commented
2025-03-21 10:38:42 +01:00
Author
Copy link
Tangent to the topic: AI powered bug reports
Tangent to the topic: AI powered bug reports https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/
litchipi
commented
2025-03-21 14:30:11 +01:00
Member
Copy link
Here's a sysadmin-noobie idea: honeypots
DDoS is meant to harm the disponibility, but AI crawlers wants to aggregate the maximum amount of data, users only browse visible links, that's a way to differentiate the traffic
Maybe we could set up honeypots for bots, the kind which would trap crawlers that doesn't respect
robots.txt
, and would feed the corresponding IP into a blocklist
Imagine an invisible link for the users that the bot will to crawl into, that would be a way to differentiate real users from crawlers.
EDIT: Seams like it's a real solution
#319 (comment)
Here's a sysadmin-noobie idea: honeypots
DDoS is meant to harm the disponibility, but AI crawlers wants to aggregate the maximum amount of data, users only browse visible links, that's a way to differentiate the traffic

Maybe we could set up honeypots for bots, the kind which would trap crawlers that doesn't respect `robots.txt`, and would feed the corresponding IP into a blocklist
Imagine an invisible link for the users that the bot will to crawl into, that would be a way to differentiate real users from crawlers.

EDIT: Seams like it's a real solution https://codeberg.org/forgejo/discussions/issues/319#issuecomment-3096841
earl-warren
referenced this issue
2025-03-23 01:02:54 +01:00
Let's forge, summary of the day
#317
earl-warren
referenced this issue from forgejo/website
2025-03-28 13:10:36 +01:00
Monthly Update for March 2025
#569
earl-warren
referenced this issue from forgejo/website
2025-03-28 13:10:59 +01:00
Monthly Update for March 2025
#569
bkil
commented
2025-04-03 15:47:10 +02:00
Copy link
Residential IPs appear within your log because if you start blocking data centers, they will switch a toggle and go through a cloud of millions of residential rotating proxy servers that anyone can hire. It costs them a bit more, but not that much more.
One open question from me is why don't the scrapist just clone the whole git repo and do whatever analysis it wants to on it locally? Until you see this, my idea is that code forges are not a primary target for them yet, and you are only collateral damage. As such, a simple workaround such as the following might work:
#319 (comment)
Residential IPs appear within your log because if you start blocking data centers, they will switch a toggle and go through a cloud of millions of residential rotating proxy servers that anyone can hire. It costs them a bit more, but not that much more.

One open question from me is why don't the scrapist just clone the whole git repo and do whatever analysis it wants to on it locally? Until you see this, my idea is that code forges are not a primary target for them yet, and you are only collateral damage. As such, a simple workaround such as the following might work:

bkil
commented
2025-04-03 15:57:45 +02:00
Copy link
Answering another open question: Drew mentioned in the post that the same bot often hits the exact same permalink multiple times a day. git commits of a given hash or files within are not expected to change after the first fetch, so ignoring related caching headers can only be interpreted as a form of DDoS. I.e., it can not be interpreted as a benign access pattern.
The same could be told about one hitting an RSS endpoint without presenting etag or if-modified-since headers and waiting an adaptive, exponential amount of time (or otherwise 1h).
Answering another open question: Drew mentioned in the post that the same bot often hits the exact same permalink multiple times a day. git commits of a given hash or files within are not expected to change after the first fetch, so ignoring related caching headers can only be interpreted as a form of DDoS. I.e., it can not be interpreted as a benign access pattern.

The same could be told about one hitting an RSS endpoint without presenting etag or if-modified-since headers and waiting an adaptive, exponential amount of time (or otherwise 1h).
bkil
commented
2025-04-03 17:07:34 +02:00
Copy link
Instead of useless burning of coal and spinning up the HVAC, how about putting the elevated viewership to actual good use? For example, each user could clone a single repository (or just slices of it) into its memory and serve its own requests from memory by doing the expensive processing client side.
Clients could then also contribute to a torrent swarm over WebRTC to share the authentic content among them (i.e., commits, releases and signed cached files) not unlike how it is done in PeerTube with WebTorrent. As an added benefit, even less bots are expected to be signalling over WebRTC than how many would run wasm.
In the end, it would not really matter whether a client is a bot or human - it would be contributing either way.
Instead of useless burning of coal and spinning up the HVAC, how about putting the elevated viewership to actual good use? For example, each user could clone a single repository (or just slices of it) into its memory and serve its own requests from memory by doing the expensive processing client side.

Clients could then also contribute to a torrent swarm over WebRTC to share the authentic content among them (i.e., commits, releases and signed cached files) not unlike how it is done in PeerTube with WebTorrent. As an added benefit, even less bots are expected to be signalling over WebRTC than how many would run wasm.

In the end, it would not really matter whether a client is a bot or human - it would be contributing either way.
Snoweuph
commented
2025-04-03 17:35:33 +02:00
Copy link
The problem with "constructive" as described by the link is that trolls (for example 4chan) exist.
Also, there are humans who only want to lurk at something or try something out, but don't have the knowledge to add anything constructive.
So the only way is to offload something which is safe (in all 3 core principles) to offload and at the same time heavy enough on the client to shift economics of scraping, while not adding extra load through the system itselve.
The problem with "constructive" as described by the link is that trolls (for example 4chan) exist.
Also, there are humans who only want to lurk at something or try something out, but don't have the knowledge to add anything constructive.

So the only way is to offload something which is safe (in all 3 core principles) to offload and at the same time heavy enough on the client to shift economics of scraping, while not adding extra load through the system itselve.
bkil
commented
2025-04-03 18:20:17 +02:00
Copy link
Not sure whether you read my comment, but the "constructive" reference was more of a joke.
Sharing storage and data signed by Codeberg over a swarm is a scalable and objective browser-based option for chiming into the cost. And then the API endpoints which would compute the expensive functionality on the server side would stop working and each visitor would only have access to MB's of continuous storage underlying each git repo (or maybe the git dumb-HTTP protocol, FIXME).
Not sure whether you read my comment, but the "constructive" reference was more of a joke.

Sharing storage and data signed by Codeberg over a swarm is a scalable and objective browser-based option for chiming into the cost. And then the API endpoints which would compute the expensive functionality on the server side would stop working and each visitor would only have access to MB's of continuous storage underlying each git repo (or maybe the git dumb-HTTP protocol, FIXME).
Snoweuph
commented
2025-04-03 18:56:43 +02:00
Copy link
I dotn understand jokes :3
I dotn understand jokes :3
kita
commented
2025-04-04 04:30:31 +02:00
Member
Copy link
The fact that even residential IPs are being involve make me suspicious... Why residential? Isn't that a botnet at that point? That kind of behavior is very malicious, and puts headache to lots of administrators forcing them to block even legitimate users from accessing their websites. They must be intentionally doing this to circumvent various protection measures (e.g. Cloudflare)
The fact that even residential IPs are being involve make me suspicious... Why residential? Isn't that a botnet at that point? That kind of behavior is very malicious, and puts headache to lots of administrators forcing them to block even legitimate users from accessing their websites. They must be intentionally doing this to circumvent various protection measures (e.g. Cloudflare)
earl-warren
commented
2025-04-04 05:53:43 +02:00
Author
Copy link
@bkil
wrote in
#320 (comment)
Residential IPs appear within your log because if you start blocking data centers, they will switch a toggle and go through a cloud of millions of residential rotating proxy servers that anyone can hire. It costs them a bit more, but not that much more.
Could you provide references regarding this? It is news to me
@bkil wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3487644:

> Residential IPs appear within your log because if you start blocking data centers, they will switch a toggle and go through a cloud of millions of residential rotating proxy servers that anyone can hire. It costs them a bit more, but not that much more.

Could you provide references regarding this? It is news to me 🙏
Gusted
commented
2025-04-04 15:34:44 +02:00
Owner
Copy link
For Codeberg access patterns I don't think we've seen clear evidence that residential IPs were used. But there has been instances that when blocking new IP ranges from certain companies that the same patterns started moving to well-known VPS providers such as Hetzner (which also lead to blocking them accidentally for a few hours).
For Codeberg access patterns I don't think we've seen clear evidence that residential IPs were used. But there has been instances that when blocking new IP ranges from certain companies that the same patterns started moving to well-known VPS providers such as Hetzner (which also lead to blocking them accidentally for a few hours).
litchipi
commented
2025-04-05 10:40:23 +02:00
Member
Copy link
What I would like for my own forgejo instance, is to have an endpoint that only disrespectful bots would hit.
Then I can just fail2ban them to oblivion and report them to
automatically.
But I'm not an organisation with a complex network, so I wonder if this solution is scalable.
Also, because some bots may adapt, it would be nice to have a configuration that allows me to set a custom endpoint for this.
I'd love to implement this bit, if you think it's acceptable as a solution
What I would like for my own forgejo instance, is to have an endpoint that only disrespectful bots would hit.
Then I can just fail2ban them to oblivion and report them to automatically.

But I'm not an organisation with a complex network, so I wonder if this solution is scalable.
Also, because some bots may adapt, it would be nice to have a configuration that allows me to set a custom endpoint for this.
I'd love to implement this bit, if you think it's acceptable as a solution
earl-warren
commented
2025-04-05 10:57:28 +02:00
Author
Copy link
Added
to the description list.
Added https://www.heise.de/en/news/AI-scrapers-strain-Wikipedia-50-percent-more-bandwidth-for-multimedia-requests-10336904.html to the description list.
Aminda
commented
2025-04-08 07:58:21 +02:00
Copy link
I happened to notice Gitea recently made it possible to require login for expensive pages instead of all pages or only private ones, they also seem to have backported it. Would it be possible or worth considering for Forĝejo to do that too?
When I recently noticed an instance I help maintain being hammered by crawlers I just set it to require login for everything since I am yet to have energy to figure out how to enable Anubis there and I think only requiring login for expensive pages could be a satisfactory compromise between openly accessible information and not getting hammered down, although whether that works in practice remains to be seen.
I happened to notice Gitea recently made it possible to require login for expensive pages instead of all pages or only private ones, they also seem to have backported it. Would it be possible or worth considering for Forĝejo to do that too?

- https://github.com/go-gitea/gitea/pull/34024

When I recently noticed an instance I help maintain being hammered by crawlers I just set it to require login for everything since I am yet to have energy to figure out how to enable Anubis there and I think only requiring login for expensive pages could be a satisfactory compromise between openly accessible information and not getting hammered down, although whether that works in practice remains to be seen.
sclu1034
commented
2025-04-08 11:32:48 +02:00
Member
Copy link
I don't feel like there went much thought into Gitea's implementation.
I'm pretty sure
/{username}/{reponame}/src/
and
/{username}/{reponame}/commit/
together block all natural paths to browsing code, and combined with blocking issues, pulls and wikis, that takes away almost everything I would want to make an instance public for. Might as well use the existing "sign-in for everything" at that point. If all I need is showing the README and release downloads, a simple web host would be more flexible.
Additionally, despite them calling it "expensive" paths, it feels more like "the ones that showed up most frequently in access logs". Serving a file from disc in
raw/
, or even rendered in a pretty box in
src/
wiki/
shouldn't be expensive.
activity/
actually makes sense, the contributors page can take ages.
blame/
and
graph/
somewhat fit, too, since you probably only really need them when you're active in the repo anyways.
archive/
does have the issue with filling up temporary storage, but I wouldn't block that completely. Instead, maybe keep downloads from the branch tips open (which should cover most legitimate use cases) and only block access from non-tip commits.
Also, I believe blocking
media/
like that breaks images in READMEs.
But the rest shouldn't cost much per request. If the aggregate cost is too much, rate limiting would be more sensible, and be a less final block for legitimate users.
I don't feel like there went much thought into Gitea's implementation.

I'm pretty sure `/{username}/{reponame}/src/` and `/{username}/{reponame}/commit/` together block all natural paths to browsing code, and combined with blocking issues, pulls and wikis, that takes away almost everything I would want to make an instance public for. Might as well use the existing "sign-in for everything" at that point. If all I need is showing the README and release downloads, a simple web host would be more flexible.

Additionally, despite them calling it "expensive" paths, it feels more like "the ones that showed up most frequently in access logs". Serving a file from disc in `raw/`, or even rendered in a pretty box in `src/`/`wiki/` shouldn't be expensive.

`activity/` actually makes sense, the contributors page can take ages. `blame/` and `graph/` somewhat fit, too, since you probably only really need them when you're active in the repo anyways.
`archive/` does have the issue with filling up temporary storage, but I wouldn't block that completely. Instead, maybe keep downloads from the branch tips open (which should cover most legitimate use cases) and only block access from non-tip commits.
Also, I believe blocking `media/` like that breaks images in READMEs.

But the rest shouldn't cost much per request. If the aggregate cost is too much, rate limiting would be more sensible, and be a less final block for legitimate users.
bkil
commented
2025-04-08 13:25:01 +02:00
Copy link
The above blocking rules were probably designed with whitelisting defense in depth in mind. Their aim was probably only allowing access via the dumb HTTP git transfer protocol that just servers each raw file. No request can be cheaper than that. And it is still possible to build a web interface around that.
The above blocking rules were probably designed with whitelisting defense in depth in mind. Their aim was probably only allowing access via the dumb HTTP git transfer protocol that just servers each raw file. No request can be cheaper than that. And it is still possible to build a web interface around that.
poVoq
commented
2025-04-08 17:46:49 +02:00
Copy link
Might as well use the existing "sign-in for everything" at that point.
No, because the main point of semi-public access is having working links. You can't link a project elsewhere when guest viewers only see a non-descript login page. Even very basic access to the readme alone is much better than that.
Related issue here:
forgejo/forgejo#6924
> Might as well use the existing "sign-in for everything" at that point.

No, because the main point of semi-public access is having working links. You can't link a project elsewhere when guest viewers only see a non-descript login page. Even very basic access to the readme alone is much better than that.

Related issue here: https://codeberg.org/forgejo/forgejo/issues/6924
Gusted
referenced this issue from forgejo/forgejo
2025-04-08 18:21:31 +02:00
feat: Light view option with sign-in requirement?
#6924
earl-warren
commented
2025-04-19 11:18:35 +02:00
Author
Copy link
@jwildeboer
insightful series on this topic.
So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library that sells users network bandwidth. Infatica [1] is just one example, there are many more.
I am 99% sure that these companies cause what effectively are DDoS attacks that many webmasters have to deal with since months. This business model should simply not exist. Apple, Microsoft and Google should act.
And a related article from 2023
Could be detected by tools like
by the user.
@jwildeboer insightful series on this topic.

> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library that sells users network bandwidth. Infatica [1] is just one example, there are many more.
> I am 99% sure that these companies cause what effectively are DDoS attacks that many webmasters have to deal with since months. This business model should simply not exist. Apple, Microsoft and Google should act.

And a related article from 2023

Could be detected by tools like https://exodus-privacy.eu.org/en/ by the user.
earl-warren
commented
2025-04-19 12:08:15 +02:00
Author
Copy link
Most of the providers selling residential proxy services are priced per GB. They appear to predominatly acquire residential bandwidth from people voluntarily running a "proxyware" (such as
) who are paid a fee (no less than 0.1 USD/GB) that depends on how much bandwidth they proxy.
Priced per GB
In the range of 0.3 USD/GB to 5 USD/GB
(same company identical)
(identical)
Priced per proxy
Unlimited bandwidth
(priced by the number of IP used simultaneously)
(US / EU only, 5 IPs at a time only)
(inconsistent advertising claiming datacenter IP when buying residential proxies)
(no residential proxy)
Most of the providers selling residential proxy services are priced per GB. They appear to predominatly acquire residential bandwidth from people voluntarily running a "proxyware" (such as https://www.honeygain.com, https://pawns.app/internet-sharing/, https://earnapp.com/) who are paid a fee (no less than 0.1 USD/GB) that depends on how much bandwidth they proxy.

# Priced per GB

In the range of 0.3 USD/GB to 5 USD/GB

- https://www.piaproxy.com/pay/?paynavVal=8 - https://www.abcproxy.com/ - https://www.922proxy.com/residential-proxies - https://www.lunaproxy.com/pricing/residential-proxies/ - https://www.pyproxy.com/longtermisp/ (same company identical)
- https://www.lumiproxy.com/pricing/residential/
- https://packetstream.io/pricing/
- https://smartproxy.com/deals/residential-offer
- https://oxylabs.io/products/residential-proxy-pool
- https://brightdata.com/proxy-types/residential-proxies
- https://soax.com/proxies/residential - https://netnut.io/rotating-residential-proxies/ (identical)
- https://infatica.io/residential-proxies/
- https://netnut.io/rotating-residential-proxies/
- https://www.nimbleway.com/nimble-ip/residential-proxies
- https://dataimpulse.com/residential-proxies/#pricing
- https://rayobyte.com/products/residential-proxies#pricing
- https://iproyal.com/pricing/residential-proxies/
- https://shifter.io/pricing

# Priced per proxy

- https://www.webshare.io/pricing

# Unlimited bandwidth

- https://www.proxyrack.com/unmetered-residential-proxies/ (priced by the number of IP used simultaneously)
- https://stormproxies.com/residential_proxy.html (US / EU only, 5 IPs at a time only)
- https://www.abcproxy.com/pricing/unlimited-residential-proxies.html (inconsistent advertising claiming datacenter IP when buying residential proxies)
- https://oneproxy.pro/services/rotating-proxies/ (no residential proxy)
earl-warren
referenced this issue
2025-04-20 08:12:35 +02:00
17 April - Ongoing DDoS on code.forgejo.org (updated 2 May)
#331
earl-warren
referenced this issue
2025-04-21 09:22:25 +02:00
17 April - Ongoing DDoS on code.forgejo.org (updated 2 May)
#331
earl-warren
commented
2025-04-21 16:25:23 +02:00
Author
Copy link
I'm more and more convinced that increasing the volume sent is a sound strategy to counter
the current DDoS
The entire system relies on individuals
selling their bandwidth / IP for money
by installing a software (they call that kind of software "proxyware"). The companies who resell this bandwidth are bound to have a price policy that is in proportion of the volume of data being transferred because of that. There is just
one exception
When a Forgejo instance is hosted on a host with unlimited bandwidth, it has the opportunity to strike back by sending large volumes of data to IPs that are suspected to participate in a DDoS. The person running the proxyware is happy: that's more money for them as it uses more bandwidth. The company selling residential proxy services is happy: they can bill more to their customers. The customer, however, is unhappy because they not only get garbage data, it costs them orders of magnitude more money to crawl the Forgejo instance.
I'm more and more convinced that increasing the volume sent is a sound strategy to counter [the current DDoS](https://codeberg.org/forgejo/discussions/issues/331).

The entire system relies on individuals [selling their bandwidth / IP for money](https://www.orangecyberdefense.com/global/blog/research/residential-proxies) by installing a software (they call that kind of software "proxyware"). The companies who resell this bandwidth are bound to have a price policy that is in proportion of the volume of data being transferred because of that. There is just [one exception](https://www.proxyrack.com/unmetered-residential-proxies/).

When a Forgejo instance is hosted on a host with unlimited bandwidth, it has the opportunity to strike back by sending large volumes of data to IPs that are suspected to participate in a DDoS. The person running the proxyware is happy: that's more money for them as it uses more bandwidth. The company selling residential proxy services is happy: they can bill more to their customers. The customer, however, is unhappy because they not only get garbage data, it costs them orders of magnitude more money to crawl the Forgejo instance.
Gusted
commented
2025-04-21 21:40:50 +02:00
Owner
Copy link
Do these IPs follow redirects?
Do these IPs follow redirects?
earl-warren
commented
2025-04-21 23:37:27 +02:00
Author
Copy link
The client sends the request to the proxy so it is up to them to either follow the redirect or not: I don't think it depends on the proxy service.
The client sends the request to the proxy so it is up to them to either follow the redirect or not: I don't think it depends on the proxy service.
Gusted
commented
2025-04-21 23:40:42 +02:00
Owner
Copy link
I see, otherwise an interesting strategy might be to either (1) If they follow the redirect: when you detect these IPs show them the door to
(2) If they don't follow the redirect: add in the reverse proxy that all GET requests must have some parameter
i-am-not=a-bot
and those not having that parameter return a redirect to add that parameter.
I see, otherwise an interesting strategy might be to either (1) If they follow the redirect: when you detect these IPs show them the door to https://hil-speed.hetzner.com/10GB.bin (2) If they don't follow the redirect: add in the reverse proxy that all GET requests must have some parameter `i-am-not=a-bot` and those not having that parameter return a redirect to add that parameter.
Beowulf
commented
2025-04-21 23:43:24 +02:00
Member
Copy link
@Gusted
wrote in
#320 (comment)
If they follow the redirect: when you detect these IPs show them the door to
I thought of that too after hearing your question about redirecting. And I think another benefit of this is that if a real user is accidentally affected by the redirect, the user would probably cancel the download 1) directly because of the file name or 2) because the download file is quite large.
@Gusted wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3890066:

> If they follow the redirect: when you detect these IPs show them the door to https://hil-speed.hetzner.com/10GB.bin

I thought of that too after hearing your question about redirecting. And I think another benefit of this is that if a real user is accidentally affected by the redirect, the user would probably cancel the download 1) directly because of the file name or 2) because the download file is quite large.
sclu1034
commented
2025-04-22 00:26:36 +02:00
Member
Copy link
the user would probably cancel the download
Isn't the whole point of those things that the traffic is transparent to the user? I doubt they get to watch it happen live in their browser, with the option to cancel operations any time.
If anything, I'd expect them to be scratching their head at why their mobile data is suddenly being drained.
> the user would probably cancel the download

Isn't the whole point of those things that the traffic is transparent to the user? I doubt they get to watch it happen live in their browser, with the option to cancel operations any time.
If anything, I'd expect them to be scratching their head at why their mobile data is suddenly being drained.
earl-warren
commented
2025-04-22 10:18:36 +02:00
Author
Copy link
@Gusted
wrote in
#320 (comment)
I see, otherwise an interesting strategy might be to either (1) If they follow the redirect: when you detect these IPs show them the door to
It would work but won't download that huge file. They are currently provided with a 500MB random file which never gets downloaded in full. The download stops at some point and it averages to and estimated ~10MB per download.
(2) If they don't follow the redirect: add in the reverse proxy that all GET requests must have some parameter
i-am-not=a-bot
and those not having that parameter return a redirect to add that parameter.
That's an interesting idea. It is worth trying to figure out if they follow redirects or not indeed.
@Gusted wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3890066:

> I see, otherwise an interesting strategy might be to either (1) If they follow the redirect: when you detect these IPs show them the door to https://hil-speed.hetzner.com/10GB.bin

It would work but won't download that huge file. They are currently provided with a 500MB random file which never gets downloaded in full. The download stops at some point and it averages to and estimated ~10MB per download.

> (2) If they don't follow the redirect: add in the reverse proxy that all GET requests must have some parameter `i-am-not=a-bot` and those not having that parameter return a redirect to add that parameter.

That's an interesting idea. It is worth trying to figure out if they follow redirects or not indeed.
earl-warren
commented
2025-04-22 10:23:47 +02:00
Author
Copy link
@sclu1034
wrote in
#320 (comment)
the user would probably cancel the download
Isn't the whole point of those things that the traffic is transparent to the user?
It depends. 2024 research suggests the bulk of the exit nodes are provided voluntarily, in exchange for a monetary compensation (e.g.
). Mobile apps that include a proxy that the user is generally unaware of (
SDK), are less common and significantly more expensive to the customer.
I doubt they get to watch it happen live in their browser, with the option to cancel operations any time. If anything, I'd expect them to be scratching their head at why their mobile data is suddenly being drained.
As mentioned above, the current code.forgejo.org mitigation shows download gets interrupted (average of 10MB).
@sclu1034 wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3890300:

> > the user would probably cancel the download
> Isn't the whole point of those things that the traffic is transparent to the user?

It depends. 2024 research suggests the bulk of the exit nodes are provided voluntarily, in exchange for a monetary compensation (e.g. https://www.honeygain.com/). Mobile apps that include a proxy that the user is generally unaware of (https://infatica.io/ SDK), are less common and significantly more expensive to the customer.

> I doubt they get to watch it happen live in their browser, with the option to cancel operations any time. If anything, I'd expect them to be scratching their head at why their mobile data is suddenly being drained.

As mentioned above, the current code.forgejo.org mitigation shows download gets interrupted (average of 10MB).
sclu1034
commented
2025-04-22 11:19:45 +02:00
Member
Copy link
the bulk of the exit nodes are provided voluntarily
Yes, but still as a background process, right? "the user would probably cancel the download [...] because of the file name" would require that the user is aware of every single request, or that the proxy behaves like an interactive browser that prompts about file downloads.
> the bulk of the exit nodes are provided voluntarily

Yes, but still as a background process, right? "the user would probably cancel the download [...] because of the file name" would require that the user is aware of every single request, or that the proxy behaves like an interactive browser that prompts about file downloads.
earl-warren
commented
2025-04-22 11:42:11 +02:00
Author
Copy link
I would be surprised if it was not in the background. But the point is moot since the download is interrupted spontaneously by the software anyway.
I would be surprised if it was not in the background. But the point is moot since the download is interrupted spontaneously by the software anyway.
Gusted
commented
2025-04-23 15:34:35 +02:00
Owner
Copy link
@earl-warren
wrote in
#320 (comment)
It would work but won't download that huge file. They are currently provided with a 500MB random file which never gets downloaded in full. The download stops at some point and it averages to and estimated ~10MB per download.
Yes, but it was mainly if they follow redirects - then you are not saturating your network link.
@earl-warren wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3893051:

> It would work but won't download that huge file. They are currently provided with a 500MB random file which never gets downloaded in full. The download stops at some point and it averages to and estimated ~10MB per download.

Yes, but it was mainly if they follow redirects - then you are not saturating your network link.
earl-warren
commented
2025-04-23 17:46:47 +02:00
Author
Copy link
A redirect experiment is conducted with the
on going DDoS
: it does not follow redirects.
A redirect experiment is conducted with the [on going DDoS](https://codeberg.org/forgejo/discussions/issues/331): it does not follow redirects.
bkil
commented
2025-04-23 18:04:14 +02:00
Copy link
You can't reliably detect abusers based on IP alone. The present form of specifying consumer ranges is banning real humans from the poorest countries where a larger fraction of population has such apps installed than in richer countries.
At the same time, if you have confirmed that bots do not follow redirects, you do have a very powerful and cheap way to discriminate between bots & humans: just redirect
every
page requested to its real content (having a filename or query parameter protected by a cheap HMAC for example). Ensure that one hitting an entry page is not served any other entry page until the redirect is followed and the target itself is only served after a few seconds of delay (4.2s?). Hardcode this delay into the interface (via JS or refresh meta) and if the client is fetching this sooner, ban them.
You can't reliably detect abusers based on IP alone. The present form of specifying consumer ranges is banning real humans from the poorest countries where a larger fraction of population has such apps installed than in richer countries.

At the same time, if you have confirmed that bots do not follow redirects, you do have a very powerful and cheap way to discriminate between bots & humans: just redirect _every_ page requested to its real content (having a filename or query parameter protected by a cheap HMAC for example). Ensure that one hitting an entry page is not served any other entry page until the redirect is followed and the target itself is only served after a few seconds of delay (4.2s?). Hardcode this delay into the interface (via JS or refresh meta) and if the client is fetching this sooner, ban them.
bkil
commented
2025-04-23 18:05:57 +02:00
Copy link
Also, a smart bot author may actually follow one level of redirection, but as multiple-redirection is rare, they might have messed up something about
that
. So redirecting a random number of times between 1-3 might be even more robust.
Also, a smart bot author may actually follow one level of redirection, but as multiple-redirection is rare, they might have messed up something about _that_. So redirecting a random number of times between 1-3 might be even more robust.
bkil
commented
2025-04-23 18:17:48 +02:00
Copy link
By the way, aren't some existing proxy services priced by the number of page fetches? Then 3 redirects will actually 4x
their
expenses for a negligible increase of real users over mobile data billed by bandwidth.
By the way, aren't some existing proxy services priced by the number of page fetches? Then 3 redirects will actually 4x _their_ expenses for a negligible increase of real users over mobile data billed by bandwidth.
earl-warren
commented
2025-04-24 08:44:16 +02:00
Author
Copy link
By the way, aren't some existing proxy services priced by the number of page fetches?
None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.
You can't reliably detect abusers based on IP alone. The present form of specifying consumer ranges is banning real humans from the poorest countries where a larger fraction of population has such apps installed than in richer countries.
As the criterion for running an exit node is having access to unmtetered bandwidth, I'm not sure this is an accurate statement. But it is true that blocking large IP ranges leads to potentially blocking legitimate users which is not a universal solution.
> By the way, aren't some existing proxy services priced by the number of page fetches?

None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.

> You can't reliably detect abusers based on IP alone. The present form of specifying consumer ranges is banning real humans from the poorest countries where a larger fraction of population has such apps installed than in richer countries.

As the criterion for running an exit node is having access to unmtetered bandwidth, I'm not sure this is an accurate statement. But it is true that blocking large IP ranges leads to potentially blocking legitimate users which is not a universal solution.
earl-warren
commented
2025-04-24 09:03:32 +02:00
Author
Copy link
@Gusted
wrote in
#320 (comment)
(2) If they don't follow the redirect: add in the reverse proxy that all GET requests must have some parameter
i-am-not=a-bot
and those not having that parameter return a redirect to add that parameter.
Since the DDoS does not follow redirect, that's worth implementing.
@Gusted wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3890066:

Since the DDoS does not follow redirect, that's worth implementing.
sclu1034
commented
2025-04-24 09:29:57 +02:00
Member
Copy link
Maybe a more transparent solution to a query param would be a new cookie, or expanding one of the existing ones.
If the cookie doesn't verify, redirect to a specific URI that sets/changes the cookie and redirects back.
With the query param showing up in everyone's browser address bar, real users are bound to get confused, and the people running scrapers could easily see what's different to their non-interactive session.
But if one of the existing cookies was changed to add some extra data, this would be transparent to real users and hard(er) to track by scrapers.
Maybe a more transparent solution to a query param would be a new cookie, or expanding one of the existing ones.
If the cookie doesn't verify, redirect to a specific URI that sets/changes the cookie and redirects back.

With the query param showing up in everyone's browser address bar, real users are bound to get confused, and the people running scrapers could easily see what's different to their non-interactive session.
But if one of the existing cookies was changed to add some extra data, this would be transparent to real users and hard(er) to track by scrapers.
bkil
commented
2025-04-24 11:00:59 +02:00
Copy link
None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.
The first hit on web search
oneproxy-pro/prices/
ROTATING DATACENTER PROXIES
COUNTRY NUMBER OF REQUESTS TRAFFIC PRICE ORDER
World Mix 3 million Unlimited $39/month
We offer specialized proxy models like rotating proxies, which operate on a pay-per-request model. This ensures that you can efficiently bypass IP rate limiting issues, scrape data, and manage multiple accounts effortlessly.
It It doesn't matter much what is reimbursed to exit providers. It is a business decision how to resell it further, so it can be any combination, really. Hence, it is the safest if you made it harder both per-request and per-MB.
Again, as of 2025, I know of no consumer mobile broadband that would be charged per request instead of per MB. So if you could loosen up on the MB load, you could ease up on real humans while you could still ensure it won't get that much cheaper for abusers.
As the criterion for running an exit node is having access to unmtetered bandwidth, I'm not sure this is an accurate statement.
Most wired home ISP are practically unmetered or have a large volume allowance. At the same time, due to IP shortage, it is much more rare to assign a fixed static IP to home users, hence it may itself rotate every day (WAN DHCP leases are usually assigned for 24h over here, although
some
, but not all of the providers usually extend your existing lease as long as you are powered on).
But it is true that blocking large IP ranges leads to potentially blocking legitimate users which is not a universal solution.
Being banned from random places from time to time due to closeness of any of my IPs or fingerprints is my true life story. This is one thing that Big Tech does better than small artisan hosts unfortunately, so making a step in the wrong direction by Codeberg would not be a great move.
> None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.

1. The first hit on web search
`oneproxy-pro/prices/`
> ROTATING DATACENTER PROXIES
> COUNTRY NUMBER OF REQUESTS TRAFFIC PRICE ORDER
> World Mix 3 million Unlimited $39/month
> We offer specialized proxy models like rotating proxies, which operate on a pay-per-request model. This ensures that you can efficiently bypass IP rate limiting issues, scrape data, and manage multiple accounts effortlessly.

2. It It doesn't matter much what is reimbursed to exit providers. It is a business decision how to resell it further, so it can be any combination, really. Hence, it is the safest if you made it harder both per-request and per-MB.

Again, as of 2025, I know of no consumer mobile broadband that would be charged per request instead of per MB. So if you could loosen up on the MB load, you could ease up on real humans while you could still ensure it won't get that much cheaper for abusers.

> As the criterion for running an exit node is having access to unmtetered bandwidth, I'm not sure this is an accurate statement.

Most wired home ISP are practically unmetered or have a large volume allowance. At the same time, due to IP shortage, it is much more rare to assign a fixed static IP to home users, hence it may itself rotate every day (WAN DHCP leases are usually assigned for 24h over here, although **some**, but not all of the providers usually extend your existing lease as long as you are powered on).

> But it is true that blocking large IP ranges leads to potentially blocking legitimate users which is not a universal solution.

Being banned from random places from time to time due to closeness of any of my IPs or fingerprints is my true life story. This is one thing that Big Tech does better than small artisan hosts unfortunately, so making a step in the wrong direction by Codeberg would not be a great move.
bkil
commented
2025-04-24 11:14:23 +02:00
Copy link
Proxy served through mobile broadband also exists which is quite the opposite of unlimited for the exit, hence such packages also cost more:
research-aimultiple-com/proxy-pricing/
Mitigation technique to add:
Wigle starts to return more and more garbage through its API depending on threat detection instead of outright banning or refusing requests, so it is more difficult (or even impossible) for an attacker to notice. Have you considered randomly generating fake code and text from a distribution which is difficult to automatically distinguish from real data when serving such requests? The abusers would be quick to learn how to put Giteo/Forgejo instances on their ignore list if it had out of the box poisoning capabilities!
zyte-com/pricing/
Automatic proxy rotation & retries: Replacement of blocked IPs or retries is automated to ensure the highest success rates - no more wasted time manually managing your IPs
Smart ban detection: Built-in solution for an extensive, ever-growing database of known site bans for automatic ban detection
Proxy served through mobile broadband also exists which is quite the opposite of unlimited for the exit, hence such packages also cost more:
`research-aimultiple-com/proxy-pricing/`

Mitigation technique to add:

* https://en.wikipedia.org/wiki/List_poisoning
* https://techcrunch.com/2024/01/26/nightshade-the-tool-that-poisons-data-gives-artists-a-fighting-chance-against-ai/

Wigle starts to return more and more garbage through its API depending on threat detection instead of outright banning or refusing requests, so it is more difficult (or even impossible) for an attacker to notice. Have you considered randomly generating fake code and text from a distribution which is difficult to automatically distinguish from real data when serving such requests? The abusers would be quick to learn how to put Giteo/Forgejo instances on their ignore list if it had out of the box poisoning capabilities!
`zyte-com/pricing/`
> Automatic proxy rotation & retries: Replacement of blocked IPs or retries is automated to ensure the highest success rates - no more wasted time manually managing your IPs
> Smart ban detection: Built-in solution for an extensive, ever-growing database of known site bans for automatic ban detection
earl-warren
commented
2025-04-24 15:23:43 +02:00
Author
Copy link
@bkil
wrote in
#320 (comment)
None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.
1. The first hit on web search
`oneproxy-pro/prices/`
ROTATING DATACENTER PROXIES
COUNTRY NUMBER OF REQUESTS TRAFFIC PRICE ORDER
World Mix 3 million Unlimited $39/month
We offer specialized proxy models like rotating proxies, which operate on a pay-per-request model. This ensures that you can efficiently bypass IP rate limiting issues, scrape data, and manage multiple accounts effortlessly.
The two DDoS against code.forgejo.org for which data was collected amounts to around 3 millions unique IP and the majority of them are part of IP ranges apparently associated to residential areas. There are IPs originating from datacenters but they are not the majority.
In the
inventory of residential proxy providers
, there are a few that provide unlimited bandwidth. But none of them also offer proxy from residential areas.
2. It It doesn't matter much what is reimbursed to exit providers. It is a business decision how to resell it further, so it can be any combination, really.
I'm yet to find a provider of residential proxy that made a different business decision, if you find one I'd be very interested.
2024 study
tends to confirm pricing per GB is the dominant business model.
@bkil wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3932156:

> > None that I could find. Proxy recruit exit nodes with per GB money incentive, universally.
> 1. The first hit on web search
> `oneproxy-pro/prices/`
> > ROTATING DATACENTER PROXIES
> > COUNTRY NUMBER OF REQUESTS TRAFFIC PRICE ORDER
> > World Mix 3 million Unlimited $39/month
> > We offer specialized proxy models like rotating proxies, which operate on a pay-per-request model. This ensures that you can efficiently bypass IP rate limiting issues, scrape data, and manage multiple accounts effortlessly.

The two DDoS against code.forgejo.org for which data was collected amounts to around 3 millions unique IP and the majority of them are part of IP ranges apparently associated to residential areas. There are IPs originating from datacenters but they are not the majority.

In the [inventory of residential proxy providers](https://codeberg.org/forgejo/discussions/issues/320#issuecomment-3857304), there are a few that provide unlimited bandwidth. But none of them also offer proxy from residential areas.

> 2. It It doesn't matter much what is reimbursed to exit providers. It is a business decision how to resell it further, so it can be any combination, really.

I'm yet to find a provider of residential proxy that made a different business decision, if you find one I'd be very interested.

A [2024 study](https://www.orangecyberdefense.com/global/blog/research/residential-proxies) tends to confirm pricing per GB is the dominant business model.
earl-warren
commented
2025-04-24 15:32:19 +02:00
Author
Copy link
so making a step in the wrong direction by Codeberg would not be a great move.
There may be a confusion about the scope of my own comments. My first hand experience, the data I observe and the measure that were taken are exclusively focused on the infrastructure that is independent of Codeberg and 100% dedicated to the Forgejo project. I do not have access to the Codeberg infrastructure and have no knowledge to share.
The scope of the discussion is broader than just the Forgejo infrastructure since it appears this is a problem that all instances are facing. By sharing the data we have and the strategies we try, I'm hoping solutions will emerge. I certainly learned a lot on that topic over the past few weeks, although this shady proxy business model started to grow almost a decade ago.
> so making a step in the wrong direction by Codeberg would not be a great move.

There may be a confusion about the scope of my own comments. My first hand experience, the data I observe and the measure that were taken are exclusively focused on the infrastructure that is independent of Codeberg and 100% dedicated to the Forgejo project. I do not have access to the Codeberg infrastructure and have no knowledge to share.

The scope of the discussion is broader than just the Forgejo infrastructure since it appears this is a problem that all instances are facing. By sharing the data we have and the strategies we try, I'm hoping solutions will emerge. I certainly learned a lot on that topic over the past few weeks, although this shady proxy business model started to grow almost a decade ago.
bziemons
referenced this issue
2025-04-25 11:34:11 +02:00
17 April - Ongoing DDoS on code.forgejo.org (updated 2 May)
#331
earl-warren
referenced this issue from forgejo/website
2025-04-27 11:04:39 +02:00
Monthly Update for April 2025
#571
earl-warren
referenced this issue
2025-04-30 15:11:28 +02:00
Effective countermeasure against excessive crawling of Forgejo instances
#339
earl-warren
commented
2025-04-30 15:13:16 +02:00
Author
Copy link
For the record,
a draft of a blog post
explaining the what was done for code.forgejo.org and why was posted just now for discussion.
For the record, [a draft of a blog post](https://codeberg.org/forgejo/discussions/issues/339) explaining the what was done for code.forgejo.org and why was posted just now for discussion.
frnmst
commented
2025-05-08 16:17:46 +02:00
Copy link
Hi. I want to update what's happening since I
commented the #297 issue
So, after about ~1.5 months I removed the DNS black hole of the targeted subdomain. I also removed the per-country whitelists I implemented using IPFire. I still use Anubis and IP reputation lists.
It's been several days and I can see 5 or 6 daily 404 hits to the old links (most of them were Git mirrors which I put private some time ago). This is practically a total stop of the DDoS attack. Earlier on I retained those blocks for a week but the DDoS was still going on.
I'll keep an eye on the logs and will update if I see new anomalies.
Hi. I want to update what's happening since I [commented the #297 issue](https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3102148).

So, after about ~1.5 months I removed the DNS black hole of the targeted subdomain. I also removed the per-country whitelists I implemented using IPFire. I still use Anubis and IP reputation lists.

It's been several days and I can see 5 or 6 daily 404 hits to the old links (most of them were Git mirrors which I put private some time ago). This is practically a total stop of the DDoS attack. Earlier on I retained those blocks for a week but the DDoS was still going on.

I'll keep an eye on the logs and will update if I see new anomalies.
strobeltobias
commented
2025-05-22 22:04:36 +02:00
Copy link
Hi, I recently noticed that RSS feeds for repositories hosted on code.forgejo.org I subscribed to do not get updated in my feed reader anymore. The feed
is behind Anubis and therefore it can not be crawled. For feeds of repositories on codeberg.org this does not seem to be the case.
Is this by intention or can these get excluded?
I subscribe to release feeds to be notified of new software versions I should update to.
Hi, I recently noticed that RSS feeds for repositories hosted on code.forgejo.org I subscribed to do not get updated in my feed reader anymore. The feed https://code.forgejo.org/forgejo/runner/releases.rss is behind Anubis and therefore it can not be crawled. For feeds of repositories on codeberg.org this does not seem to be the case.
Is this by intention or can these get excluded?
I subscribe to release feeds to be notified of new software versions I should update to.
![image](/attachments/6741c620-c8d6-4c69-921a-df4980a69bd3)
image.png
46 KiB
Beowulf
commented
2025-05-22 22:16:55 +02:00
Member
Copy link
@strobeltobias
two different teams, two different infrastructures.
If you are able try to change the user agent to something not including mozilla
@strobeltobias two different teams, two different infrastructures.
If you are able try to change the user agent to something not including mozilla
❤️
earl-warren
commented
2025-05-23 08:00:19 +02:00
Author
Copy link
The feed
is behind Anubis and therefore it can not be crawled.
How do you crawl this URL? As
@Beowulf
wrote, only User-Agent that contain Mozilla or Opera are gated by Anubis. If there are RSS crawlers that do not have a distinctive User-Agent, maybe they can be allowed in a different way.
> The feed https://code.forgejo.org/forgejo/runner/releases.rss is behind Anubis and therefore it can not be crawled.

How do you crawl this URL? As @Beowulf wrote, only User-Agent that contain Mozilla or Opera are gated by Anubis. If there are RSS crawlers that do not have a distinctive User-Agent, maybe they can be allowed in a different way.
sclu1034
commented
2025-05-23 11:47:36 +02:00
Member
Copy link
@strobeltobias
wrote in
#320 (comment)
Hi, I recently noticed that RSS feeds for repositories hosted on code.forgejo.org I subscribed to do not get updated in my feed reader anymore. The feed
is behind Anubis and therefore it can not be crawled. For feeds of repositories on codeberg.org this does not seem to be the case. Is this by intention or can these get excluded? I subscribe to release feeds to be notified of new software versions I should update to.
Miniflux advertises a browser-like user agent by default, which Anubis matches against, as mentioned above.
You'll have to set a custom user agent for that feed in its settings, e.g. just
Miniflux/2.2.8
, or change the global default via
HTTP_CLIENT_USER_AGENT
@strobeltobias wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-4809003:

> Hi, I recently noticed that RSS feeds for repositories hosted on code.forgejo.org I subscribed to do not get updated in my feed reader anymore. The feed https://code.forgejo.org/forgejo/runner/releases.rss is behind Anubis and therefore it can not be crawled. For feeds of repositories on codeberg.org this does not seem to be the case. Is this by intention or can these get excluded? I subscribe to release feeds to be notified of new software versions I should update to. ![image](/forgejo/discussions/attachments/6741c620-c8d6-4c69-921a-df4980a69bd3)

Miniflux advertises a browser-like user agent by default, which Anubis matches against, as mentioned above.

You'll have to set a custom user agent for that feed in its settings, e.g. just `Miniflux/2.2.8`, or change the global default via `HTTP_CLIENT_USER_AGENT`.
fnetX
commented
2025-05-23 11:56:49 +02:00
Owner
Copy link
Maybe the RSS routes can be excluded from Anubis? They can be very heavy, but I assume that a crawler that only sees Anubis will never actually reach the RSS feed URL.
At Codeberg, we only blocked issue filters and crawling repo by commits basically, and it solved the problem for us.
Maybe the RSS routes can be excluded from Anubis? They can be very heavy, but I assume that a crawler that only sees Anubis will never actually reach the RSS feed URL.

At Codeberg, we only blocked issue filters and crawling repo by commits basically, and it solved the problem for us.
❤️
strobeltobias
commented
2025-05-23 13:28:12 +02:00
Copy link
@sclu1034
wrote in
#320 (comment)
Miniflux advertises a browser-like user agent by default, which Anubis matches against, as mentioned above.
You'll have to set a custom user agent for that feed in its settings, e.g. just
Miniflux/2.2.8
, or change the global default via
HTTP_CLIENT_USER_AGENT
Exactly, I use Miniflux. I adjusted the user agent to use for that feed shielded by Anubis to something not containing "Mozilla". After this, it works again.
Though as
@fnetX
wrote in
#320 (comment)
Maybe the RSS routes can be excluded from Anubis? They can be very heavy, but I assume that a crawler that only sees Anubis will never actually reach the RSS feed URL.
At Codeberg, we only blocked issue filters and crawling repo by commits basically, and it solved the problem for us.
Maybe this would be a better solution as to not require users to change their feed reader configs? I am not aware of every feed reader allowing to change the user agent used for fetching the feeds.
@sclu1034 wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-4814151:
> Miniflux advertises a browser-like user agent by default, which Anubis matches against, as mentioned above.
> You'll have to set a custom user agent for that feed in its settings, e.g. just `Miniflux/2.2.8`, or change the global default via `HTTP_CLIENT_USER_AGENT`.

Exactly, I use Miniflux. I adjusted the user agent to use for that feed shielded by Anubis to something not containing "Mozilla". After this, it works again.

Though as @fnetX wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-4814193:

> Maybe the RSS routes can be excluded from Anubis? They can be very heavy, but I assume that a crawler that only sees Anubis will never actually reach the RSS feed URL.
> At Codeberg, we only blocked issue filters and crawling repo by commits basically, and it solved the problem for us.

Maybe this would be a better solution as to not require users to change their feed reader configs? I am not aware of every feed reader allowing to change the user agent used for fetching the feeds.
frnmst
commented
2025-05-23 14:00:59 +02:00
Copy link
Maybe this would be a better solution as to not require users to change their feed reader configs? I am not aware of every feed reader allowing to change the user agent used for fetching the feeds.
In my case, in fact, Nextcloud's feed reader does not have that option. I had to write a simple python script that fetches the RSS feeds with a custom user agent for each feed, and then serve the XML feed files with Apache.
Essentially it's a bridge. I have been using it for years to bypass DNS blocks for example.
> Maybe this would be a better solution as to not require users to change their feed reader configs? I am not aware of every feed reader allowing to change the user agent used for fetching the feeds.

In my case, in fact, Nextcloud's feed reader does not have that option. I had to write a simple python script that fetches the RSS feeds with a custom user agent for each feed, and then serve the XML feed files with Apache.

Essentially it's a bridge. I have been using it for years to bypass DNS blocks for example.
earl-warren
commented
2025-05-24 10:26:49 +02:00
Author
Copy link
RSS feeds will bypass Anubis on
. It needs a configuration change.
RSS feeds will bypass Anubis on https://code.forgejo.org. It needs a configuration change.
earl-warren
commented
2025-05-25 11:14:28 +02:00
Author
Copy link
.rss
files are no longer gated by Anubis on code.forgejo.org (
202 is the status set for challenges
earl-warren:~$ curl -o /dev/null -w
"%{http_code}\n"
-sS -H
"user-agent: Mozilla"
202
earl-warren:~$ curl -o /dev/null -w
"%{http_code}\n"
-sS -H
"user-agent: Mozilla"
200
`.rss` files are no longer gated by Anubis on code.forgejo.org ([202 is the status set for challenges](https://codeberg.org/forgejo/k8s-cluster/src/commit/cca742b68a545bae290ebd8b8c1c106d5cf59a91/flux/apps/forgejo-code/anubisBotPolicy.json#L9-L12))

```sh
earl-warren:~$ curl -o /dev/null -w "%{http_code}\n" -sS -H "user-agent: Mozilla" https://code.forgejo.org/forgejo/forgejo/releases
202
earl-warren:~$ curl -o /dev/null -w "%{http_code}\n" -sS -H "user-agent: Mozilla" https://code.forgejo.org/forgejo/forgejo/releases.rss
200
```
❤️
Albirew
commented
2025-06-02 00:09:21 +02:00
Copy link
you may want to change
\.rss$
into
\.(rss|atom)$
to also allow atom feeds...
then again, anyone can get past anubis on any page this way using
forgejo.tld/someone/someproject/?lol=lol.rss
declaring whole path would be better.
^/[.A-Za-z0-9_-]{1,256}?[.\/A-Za-z0-9_-]*\.(rss|atom)$
you may want to change `\.rss$` into `\.(rss|atom)$` to also allow atom feeds...

then again, anyone can get past anubis on any page this way using `forgejo.tld/someone/someproject/?lol=lol.rss`.
declaring whole path would be better. `^/[.A-Za-z0-9_-]{1,256}?[.\/A-Za-z0-9_-]*\.(rss|atom)$`
earl-warren
commented
2025-06-02 08:09:53 +02:00
Author
Copy link
you may want to change .rss$ into .(rss|atom)$ to also allow atom feeds...
Good catch, this will be done today.
then again, anyone can get past anubis on any page ...
True. Let's keep it simple until the problem shows.
> you may want to change \.rss$ into \.(rss|atom)$ to also allow atom feeds...

Good catch, this will be done today.

> then again, anyone can get past anubis on any page ...

True. Let's keep it simple until the problem shows.
bkil
commented
2025-06-02 09:35:34 +02:00
Copy link
I think the regexp you are looking for could be
^[^?]+\.(atom|rss)(\?.*)?$
I think the regexp you are looking for could be `^[^?]+\.(atom|rss)(\?.*)?$`
mahlzahn
commented
2025-06-02 14:52:47 +02:00
Member
Copy link
@bkil
wrote in
#320 (comment)
I think the regexp you are looking for could be
^[^?]+\.(atom|rss)(\?.*)?$
I don't know how the requests are resolved, but
may also need to be excluded.
@bkil wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-4967775:

> I think the regexp you are looking for could be `^[^?]+\.(atom|rss)(\?.*)?$`

I don't know how the requests are resolved, but `#` may also need to be excluded.
bkil
commented
2025-06-02 15:57:38 +02:00
Copy link
@mahlzahn
An HTTP client does not pass in the URI fragment in the path component of the request line.
@mahlzahn An HTTP client does not pass in the URI fragment in the path component of the request line.
❤️
earl-warren
commented
2025-06-02 21:27:44 +02:00
Author
Copy link
@earl-warren
wrote in
#320 (comment)
you may want to change .rss$ into .(rss|atom)$ to also allow atom feeds...
Good catch, this will be done today.
Done in the most basic way. If crawlers get clever and abuse this, it will be time to counter with the proposed regrexp. Until then, simplicity will rule.
forgejo/k8s-cluster@211cdfb165
$ curl -o /dev/null -w "%{http_code}\n" -sS -H "user-agent: Mozilla" https://code.forgejo.org/forgejo/forgejo/releases.atom
200
@earl-warren wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-4967277:

> > you may want to change .rss$ into .(rss|atom)$ to also allow atom feeds...
> Good catch, this will be done today.

Done in the most basic way. If crawlers get clever and abuse this, it will be time to counter with the proposed regrexp. Until then, simplicity will rule.

```
$ curl -o /dev/null -w "%{http_code}\n" -sS -H "user-agent: Mozilla" https://code.forgejo.org/forgejo/forgejo/releases.atom
200
```
fnetX
commented
2025-06-05 19:37:12 +02:00
Owner
Copy link
then again, anyone can get past anubis on any page
Anubis cannot efficiently protect against targeted campaigns anyway. I have many ideas on how to work around it (e.g. by solving the challenge once on a dedicated node, and then replicating the cookie to the simple crawlers). Even if not all of my ideas work, one of them surely does.
> then again, anyone can get past anubis on any page

Anubis cannot efficiently protect against targeted campaigns anyway. I have many ideas on how to work around it (e.g. by solving the challenge once on a dedicated node, and then replicating the cookie to the simple crawlers). Even if not all of my ideas work, one of them surely does.
viceice
commented
2025-06-05 19:43:43 +02:00
Owner
Copy link
@fnetX
wrote in
#320 (comment)
then again, anyone can get past anubis on any page
Anubis cannot efficiently protect against targeted campaigns anyway. I have many ideas on how to work around it (e.g. by solving the challenge once on a dedicated node, and then replicating the cookie to the simple crawlers). Even if not all of my ideas work, one of them surely does.
the cookie contains the source ip, so each crawler needs to solve once
@fnetX wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-5011251:

> > then again, anyone can get past anubis on any page
> Anubis cannot efficiently protect against targeted campaigns anyway. I have many ideas on how to work around it (e.g. by solving the challenge once on a dedicated node, and then replicating the cookie to the simple crawlers). Even if not all of my ideas work, one of them surely does.

the cookie contains the source ip, so each crawler needs to solve once
LWFlouisa
commented
2025-07-13 07:18:31 +02:00
Copy link
What I find strange is even my own instance that isn't public ( that's more to do with the fact I have no way of currently putting it online ) slows down drastically and overheats.
Is it possible the scrapers are even getting to those only on localhost?
What I find strange is even my own instance that isn't public ( that's more to do with the fact I have no way of currently putting it online ) slows down drastically and overheats.

Is it possible the scrapers are even getting to those only on localhost?
frnmst
commented
2025-07-14 12:16:16 +02:00
Copy link
@LWFlouisa
wrote in
#320 (comment)
What I find strange is even my own instance that isn't public ( that's more to do with the fact I have no way of currently putting it online ) slows down drastically and overheats.
Is it possible the scrapers are even getting to those only on localhost?
Definitely not if you are not proxying the requests, but you can check the logs.
@LWFlouisa wrote in https://codeberg.org/forgejo/discussions/issues/320#issuecomment-5806853:

> What I find strange is even my own instance that isn't public ( that's more to do with the fact I have no way of currently putting it online ) slows down drastically and overheats.
> Is it possible the scrapers are even getting to those only on localhost?

Definitely not if you are not proxying the requests, but you can check the logs.
earl-warren
referenced this issue from forgejo/website
2025-07-21 12:41:05 +02:00
Monthly Report for July 2025
#609
earl-warren
referenced this issue
2025-11-25 18:28:03 +01:00
Crawlers hitting Forgejo instances - global abuse trend returns
#421
earl-warren
referenced this issue
2025-11-27 17:16:06 +01:00
Crawlers hitting Forgejo instances - global abuse trend returns
#421
forgejo-actions
referenced this issue from forgejo/website
2025-11-27 18:11:37 +01:00
Dead links report
#529
earl-warren
referenced this issue from forgejo/website
2025-12-01 15:21:06 +01:00
Monthly report for November 2025
#664
forgejo-actions
referenced this issue from forgejo/website
2026-01-08 18:02:12 +01:00
Dead links report
#529
to join this conversation.
No Branch/Tag specified
Branches
Tags
No results found.
No results found.
Labels
Clear labels
User research - Accessibility
Requires input about accessibility features, likely involves user testing.
User research - Blocked
Do not pick as-is! We are happy if you can help, but please coordinate with ongoing redesign in this area.
User research - Community
Community features, such as discovering other people's work or otherwise feeling welcome on a Forgejo instance.
User research - Config (instance)
Instance-wide configuration, authentication and other admin-only needs.
User research - Errors
How to deal with errors in the application and write helpful error messages.
User research - Filters
How filter and search is being worked with.
User research - Future backlog
The issue might be inspiring for future design work.
User research - Git workflow
AGit, fork-based and new Git workflow, PR creation etc
User research - Labels
Active research about Labels
User research - Moderation
Moderation Featuers for Admins are undergoing active User Research
User research - Needs input
Use this label to let the User Research team know their input is requested.
User research - Notifications/Dashboard
Research on how users should know what to do next.
User research - Rendering
Text rendering, markup languages etc
User research - Repo creation
Active research about the New Repo dialog.
User research - Repo units
The repo sections, disabling them and the "Add more" button.
User research - Security
User research - Settings (in-app)
How to structure in-app settings in the future?
No labels
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
17 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Reference
forgejo/discussions#320
Reference in a new issue
No description provided.
Delete branch "%!s(
)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?