GFWatch: A Longitudinal Measurement Platform Built to Monitor China’s DNS Censorship at Scale

The following blog post is authored by Information Controls Fellowship Program (ICFP) fellow Hoàng Nguyên Phong. As an ICFP fellow, Phong's research examines China’s DNS filtering mechanism, one of the…

Thu, 2021-11-04 13:00

China’s sophisticated filtering system, known as the Great Firewall (GFW), is the region’s biggest impediment to the freedom of information. The GFW is built by the Chinese government and is continuously developed to serve their political interests. In this report, we introduce the design of GFWatch, a large-scale longitudinal measurement platform that informs the public about how GFW censorship changes over time and its negative impact on the free flow of information.

Key findings

We developed GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the GFW’s DNS filtering behavior.

– From April to December 2020, GFWatch tested a total of 534M distinct domains (averaging 411M domains per day) and detected more than 300K censored domains. To the best of our knowledge, this is the largest number of domains tested and censored domains discovered in existing literature.
– We designed a probing method to reverse-engineer the actual blocklist used by the GFW’s DNS filter and identified 41K domains that appear to be overblocked despite matching regular expressions used by the GFW.
– Through our measurements, we discovered more than 3.6K unique forged IPv4 and IPv6 addresses. All IPv6 addresses are bogus and belong to the same subnet of the predefined Teredo prefix, 2001::/32, whereas the vast majority of IPv4 addresses belong to U.S. companies, including Facebook, Dropbox, and Twitter.
– Using data from GFWatch, we also assessed the impact of GFW’s DNS censorship on the global DNS system. We found 77K censored domains whose forged DNS resource records have polluted many popular public DNS resolvers, including Google, Cloudflare, and OpenDNS.
– Finally, we propose strategies to detect poisoned responses that can sanitize polluted DNS records from the cache of public DNS resolvers, and assist in the development of circumvention tools to bypass the GFW’s DNS censorship.
– For data exploration, we built an interactive dashboard that can be used to search for censored domains and fake IP addresses injected by the GFW that GFWatch has discovered.

Summary

Among the censorship regimes on the Internet, China is one of the most notorious, having developed an advanced filtering system, known as the Great Firewall (GFW), to control the flow of online information. The GFW’s worldwide reputation and ability to be measured from outside the country has drawn the attention of researchers from various disciplines, ranging from political science to information and computer sciences.

Despite many previous studies that examine the technical strategies employed by the GFW, such as TCP/IP packet filtering and DNS poisoning, there has yet to be a large-scale, longitudinal examination of China’s DNS filtering mechanism. DNS filtering is a process of interfering with DNS resolutions, which are used to map human-memorable domain names (e.g., citizenlab.ca) to their hosting IP address(es), to prevent access to undesired content. The lack of visibility into how DNS filtering works in the GFW is apparent by the number of censored domains and the pool of IP addresses used by the GFW in forged DNS responses have been reported differently by previous studies. In particular, the number of fake IPs observed in poisoned responses has been increasing from nine in 2010, 28 in 2011, 174 in 2014, to more than 1.5K in 2019. It is necessary to have a system for continuous, long-term monitoring of the GFW’s filtering policy that will provide timely insights about its blocking behavior and assist censorship detection and circumvention efforts. We developed GFWatch, a large-scale longitudinal measurement platform designed to shed light on DNS filtering by the GFW and assess its impact on the global Internet. By building GFWatch, our primary goal is to answer the following questions:

1. How many censored domains are there?
2. What are the forged IP addresses used in fake DNS responses?
3. What is the impact of the GFW’s DNS censorship on the global Internet?
4. What are potential strategies to effectively detect and circumvent the GFW’s DNS censorship?

Since the launch of GFWatch in 2020, we have detected numerous blocking events that coincide with Beijing’s information control policy on sensitive topics, including Hong Kong democracy movements, Uyghur human rights abuses, religious freedom, and COVID-19. In one instance, the GFW blocked access to a US-based online shopping platform (storenvy.com) due to them hosting a store selling Uyghur cultural products, despite the platform being owned by Alibaba. This case illustrates the tension between censorship and expansion: the government supports Chinese companies growing globally, but such support is conditional on falling in line with restrictive content regulations.

GFWatch is able to capture data to tell these stories and at a wider level reveal how DNS censorship works on the GFW and its impact on the global Internet.

GFWatch Architecture

We designed GFWatch according to the following requirements:

– The platform should be able to discover as many censored domains and forged IPs as possible in a timely manner. More specifically, GFWatch should be able to obtain and test new domain names as they appear on the Internet.

– As a longitudinal measurement platform, once a domain is discovered to be censored, GFWatch should continuously keep track of its blocking status to determine whether the domain stays censored or becomes unblocked at some point in the future.

– By measuring many domains with sufficient frequency, GFWatch is expected to provide us with a good view into the pool of forged IPs used by the GFW.

Test Domains

We are interested in the timely discovery of as many censored domains as possible because we hypothesize that the GFW does not block just well-known domains (e.g., facebook.com, twitter.com, tumblr.com) but also less popular or even unranked ones that are of interest to smaller groups of at-risk people (e.g., political dissidents, minority ethnic groups), who are often suppressed by local authorities. Therefore, we opt to curate our test list from top-level domain (TLD) zone files obtained from various sources, including Verisign and the Centralized Zone Data Service operated by ICANN, which we refresh on a daily basis. Using zone files not only provides us with a good coverage of domain names on the Internet, but also helps us to fulfill the first design goal of GFWatch, which is the capability to test new domains as they appear on the Internet. Since TLD zone files contain only second-level domains (SLDs), they do not allow us to observe cases in which the GFW censors subdomains of these SLDs. As we show later, many subdomains (e.g., scratch.mit.edu, nsarchive.gwu.edu, cs.colorado.edu) are censored but their SLDs (e.g., mit.edu, gwu.edu, colorado.edu) are not. We complement our test list by including domains from the Citizen Lab test lists (CLTL), the Tranco list, and the Common Crawl project. Between April and December 2020, we tested a total of 534M domains from 1.5K TLDs, with an average of 411M domains daily tested.

Measurement Methodology

When filtering DNS traffic, the GFW does not consider the direction of request packets. As a result, even DNS queries originating from outside the country can trigger the GFW if they contain a censored domain, making this behavior a popular topic for measurement studies. Based on the observation of this filtering policy, we design GFWatch to probe the GFW from outside of China to discover censored domains and verify their blockage again from our controlled machines located in China to validate our findings.

Because prior works have shown that the GFW does not filter DNS traffic on ports other than the standard port 53, we thus design our probe queries using this standard destination port number. We observe that for major UDP-based DNS query types (e.g., A, CNAME, MX, NS, TXT), the GFW injects the forged responses with an IPv4 for type A queries and a bogus IPv6 for type AAAA queries.

For TCP-based queries that carry censored domains, RST packets are injected instead of DNS responses. Since UDP is the default protocol for DNS in most operating systems, we choose to probe the GFW with UDP-based queries. While using both TCP-based and UDP-based queries would still allow us to detect censored domains, we opt to use UDP-based queries because they also allow us to (1) collect the forged IPs used in the injected DNS responses and (2) conduct our measurement at scale, which would be otherwise more challenging to achieve because a TCP-based measurement at the same scale would require more computing and network resources to handle stateful network connections.

As shown in Figure 1, GFWatch’s main probe is a machine located in an academic network in the United States, where DNS censorship is not anticipated. A and AAAA DNS queries for the test domains are sent towards two hosts in China, which are under our control and do not have any DNS resolution capabilities. Therefore, any DNS responses returned to the main probe come from the GFW.

Figure 1: Probing the GFW’s DNS poisoning from outside.

After the main probe completes each probing batch, detected censored domains are transferred to the Chinese hosts and probed again from inside China towards our control machine, as shown in Figure 2. This way, we can verify that censored domains discovered by our probe in the US are also censored inside China.

Figure 2: Verifying poisoned domains from inside the GFW.

Since GFWatch is designed to probe using UDP, which is a stateless and unreliable protocol, packets may get lost due to factors that are not under our control (e.g., network congestion). Moreover, previous studies have reported that the GFW sometimes fails to block access when it is under heavy load. Therefore, to minimize the impact of these factors on our data collection, GFWatch tests each domain at least three times a day. As of September 2021, GFWatch tests more than 600M fully qualified domains on a daily basis.

Censored Domains

Identifying Blocking Rules

Analyzing the nine-month data collected by GFWatch in 2020, we discovered more than 300K domains triggering the GFW’s DNS censoring capability. However, many fully-qualified domain names are censored due to the blocking of the same second-level domain. For instance, foo.googlevideo.com and bar.googlevideo.com are blocked because all domains under *.googlevideo.com (i.e., any subdomains of googlevideo.com) are blocked. Therefore, to estimate the number of censored domains more accurately, we designed a probing method by testing eight different combinations of each censored domain with random strings. These eight rules are shown in Figure 3.

Figure 3: Probing GFW’s DNS blocking rules.

Among these rules, only Rules 1 and 3 are correct forms of a domain with a different top-level domain (Rule 1) or subdomain (Rule 3). Rule 5 is a more general form of Rules 1 and 3 combined. In contrast, the rest represent unrelated (or non-existent) domains that happen to contain the censored domain string. We refer to censored domains that are grouped with a shorter domain string via rules other than Rules 1 or 3 as being overblocked, because they are not subdomains of the shorter domain, but are actually unrelated domains that are textually similar (e.g., the censored domain mentorproject.org or theventilatorproject.org contain the shorter domain string torproject.org that actually triggers censorship). Using these rules to generate domains and testing them with GFWatch, we identify the most general form of each censored domain that triggers censorship. We refer to these shortest censored domains as the “base domain” from which the blocking rule is generated. As a result, we discovered a total of 138.7K base domains from the set of more than 300K censored domains identified earlier.

Utilizing the base domains to identify cases of overblocking, we found 41K censored domains are overblocked. The top three base domains that cause the most overblocking are 919.com, jetos.com, and 33a.com. These three domains are responsible for a total of 15K unrelated domains being blocked because they end with one of these three base domains (and are not subdomains of them). Figure 4 provides more details on the base domains responsible for the most overblocking. Domain owners may consider refraining from registering domain names containing these base domains to avoid them being inadvertently blocked by the GFW.

Figure 4: Top base censored domains that cause most overblocking of innocuous domains.

Characterizing Censored Domains

We next characterize the 138.7K base domains identified above. We focus on these base domains to avoid the impact of domains with numerous blocked subdomains on our results. Focusing on base domains also allows us to avoid analyzing innocuous domains that are overblocked based on our previous analysis.

Popularity of censored domains. As hypothesized earlier, we find that most domains blocked by the GFW are unpopular and do not appear on lists of most popular websites. We use the rankings provided by the Tranco list, which combines four top lists (Alexa, Majestic, Umbrella, and Quantcast) in a way that makes it more stable and robust against malicious manipulations. Figure 5 shows the CDF of the popularity ranking for the 138.7K blocked base domains. Only 1.3% of them are among the top 100K most popular domains, which is the statistically significant threshold of the popularity ranking as suggested by both top-list providers and previous studies. Even when considering all domains ranked by the Tranco list, only 13.3% of the base censored domains fall within the list’s ranking range, while the remaining are unranked. This finding highlights the importance of GFWatch’s use of TLD zone files to enumerate the set of potentially censored domains.

Figure 5: CDF of the popularity ranking for base censored domains (in log scale).

Types of censored content. For domain categorization, we use a service provided by FortiGuard. Figure 6 shows the top-ten domain categories censored by the GFW. We find that nearly half of the domains we observe are not currently categorized by FortiGuard, with 40% categorized as “newly observed domain” and 5.5% categorized as “not rated.” This is a result of the large number of domains in our dataset, many of which may not be currently active. Apart from the “newly observed domain” and “not rated” categories, we find that “business,” “pornography,” and “information technology” are within the top-five dominant categories. This finding is different from previous anecdotes, in which “proxy avoidance” and “personal websites and blogs” were reported as the most blocked categories.

Figure 6: Top ten categories of domains censored by the GFW.

COVID-19 related domains. On December 19, 2020, the New York Times reported that the Chinese Government issued instructions for suppressing the free flow of information related to the COVID-19 pandemic. GFWatch has detected numerous domains related to COVID-19 being censored by the GFW through DNS tampering, including covid19classaction.it, covid19song.info, covidcon.org, ccpcoronavirus.com, covidhaber.net, and covid-19truth.info. While most censored domains are discovered to be blocked soon after they appear in our set of test domains, we found that there was some delay in blocking ccpcoronavirus.com, covidhaber.net, and covid-19truth.info. Specifically, ccpcoronavirus.com and covidhaber.net first appeared on our test lists in April 2020 but are not blocked until July and September, respectively. Similarly, covid-19truth.info appeared in our dataset in September 2020 but was not censored until October. The large difference in the time the GFW takes to censor different domains shows that the blocklist is likely to be curated by both automated tools and manual efforts.

Educational domains. In 2002, Zittrain et al. reported DNS-based filtering in China of several institutions of higher education in the US, including mit.edu, umich.edu, and gwu.edu. While “education” is not one of the top censored categories, we find numerous blocked education-related domains, including armstrong.edu, brookings.edu, citizenlab.ca, feitian.edu, languagelog.ldc.upenn.edu, pori.hk, soas.ac.uk, scratch.mit.edu, and cs.colorado.edu. Although censorship against some of these domains is not surprising, since they belong to institutions well-known for conducting political science research and may host content deemed as unwanted, we are puzzled by the blocking of cs.colorado.edu. While the University of Colorado’s computer science department is not currently using this domain to host their homepage, the blocking of this domain and its entire namespace *.cs.colorado.edu would prevent students in China from accessing other department resources (e.g., moodle.cs.colorado.edu). This is another instance of the overblocking policy of the GFW, which can be harmful especially during the COVID-19 pandemic when most students need to take classes remotely.

Real-time blocking detection of other popular websites. In 2021, we detected the blocking of many popular websites. Most blocking events occurred after a website published content that could be deemed as “sensitive” to the Chinese government. For instance, the blocking cases of csis.org, aei.org, icsin.org, fineartamerica.com, storenvy.com, and lazaron.es in April all happened after Uyghur-related content was posted on these websites. Similarly, the domain festival-cannes.com was blocked from August after the Cannes Festival decided to screen a documentary about democracy protests in Hong Kong. In another event, shortly after the blanket ban on cryptocurrencies was issued in September, most popular Bitcoin exchanges became censored in China. These cases highlight the importance of GFWatch’s ability to operate in an automated and continuous fashion to obtain a constantly updated view of the GFW to timely inform the public about changes in its blocking policy.

Fake IP Addresses

The use of publicly routable IPs owned by foreign entities not only confuses the impacted users and misleads their interpretation of the GFW’s censorship, but also hinders straightforward detection and circumvention. Therefore, knowing the forged IPs and the pattern in which they are injected (if any) is essential. As of September 2021, GFWatch has discovered more than 3.6K unique forged IPv4 and IPv6 addresses:

– 1.8K forged IPv4 addresses are mapped to multiple ASes owned by numerous U.S. entities, including Facebook, WZ Communications Inc., Twitter, and Dropbox.

– 1.8K IPv6 addresses are bogus and belong to the same subnet of the predefined Teredo prefix, 2001::/32.

The discovery of these IP pools is useful for developing tools to bypass the GFW’s DNS censorship. Specifically, a client’s stub resolver can detect and ignore forged responses by comparing the returned IP addresses with those in the IP pools discovered by GFWatch while waiting for the legitimate response to arrive, since the GFW is designed as an on-path system and does not drop the legitimate DNS response packets. In some rare cases, injections of forged static CNAME records are also observed for a small number of censored domains. For more details about CNAME injection, please refer to §5.3 of our USENIX Security ‘21 paper.

Impact on Global Internet due to DNS Leakage

Previous studies have attributed leakage of DNS censorship to cases where a DNS resolver’s network path transits through China’s network. We discovered that geo-blocking and cases where censored domains have at least one authoritative name server located in China are also a significant cause of pollution of external DNS resolvers of other Internet services outside China.

Specifically, we found evidence of geographic restrictions on Chinese domains, with the GFW injecting DNS replies for domains based in China. For instance, in August 2020, GFWatch detected a geo-blocking case of www.beian.gov.cn, which is managed by the Chinese Ministry of Industry and Information Technology. Two authoritative name servers (dns7.hichina.com and dns8.hichina.com) of the domain are hosted on 16 different IPs. However, checking against the latest MaxMind dataset, we found that all of these IPs are located inside China. Consequently, the DNS censorship against this domain by the GFW will cause DNS queries issued from outside China to be poisoned since all resolution paths from outside China will have to cross the GFW.

This DNS censorship leak combined with public routable IP addresses being used in forged responses impacts the global Internet in several ways. Past reports have shown that the abusive design of the GFW can lead to resource exhaustion attacks on specific IPs, making them inaccessible. In our study, we further discovered that poisoned DNS records have made their way beyond China’s borders into many critical services, ranging from open datasets to commercial historical DNS data. For example, many historical versions of the website www.beian.gov.cn archived by the Internet Archive show an error page served by Facebook.

Figure 7: A polluted version of a China-based domain geo-blocked by the GFW’s DNS filtering is stored by the Internet Archive.

Similarly, as can be seen in Figure 8, forged resource records of the same domain have also tainted the dataset of Security Trails, a popular commercial historical DNS service.

Figure 8: Poisoned DNS records by the GFW as shown in Security Trails’ historical DNS service.

Recommendations

Based on our findings, we propose the following recommendations for reducing harms caused by DNS censorship implemented by the GFW.

GFW operators. Although the widespread impact of the GFW’s DNS filtering policy is clear, as shown by our study, we are not entirely certain whether this censorship policy is intentional or accidental. While prior works have shown intermittent failures of the GFW, all geoblocking of China-based domains and overblocking of innocuous domains discovered by GFWatch have lasted over several months. This relatively long enough period of time leads us to believe that the GFW’s operators would have clearly known about the global impact of their DNS filtering policy. By exposing these negative impacts on several parties outside China to the public, we hope to send a message to the GFW’s operators so that they can revise their DNS filtering policy to reduce its negative impacts beyond China’s borders.

Public DNS resolvers and other impacted DNS-based services. Poisoned DNS responses have widely polluted all popular public DNS resolvers outside China due to the geoblocking and overblocking of many domains based in China. DNSSEC has been introduced to assure the integrity and authenticity of DNS responses for more than two decades to address these problems. However, DNSSEC is not widely adopted because of compatibility problems and technical complications. Using the observed censored domains (Section 2) and forged IP addresses (Section 3) collected by GFWatch, operators of impacted services can effectively detect poisoned DNS responses injected by the GFW and (retroactively) sanitize tainted records from their datasets.

Owners of forged IPs. Legitimate owners of forged IPs may try to avoid hosting critical services on these IPs as their resources may be saturated due to handling unsolicited TCP and HTTP(s) requests originating from clients whose DNS cache is poisoned. Currently, we do not find evidence that the GFW is actively using these forged IPs as a way to saturate computing resources of the infrastructure behind them since there are more than 1.8K forged IPv4 addresses in the pool and most of them are dynamically injected. However, a previous report of the Great Cannon has shown that China is willing to weaponize the global Internet to mount resource exhaustion attacks on specific targets. With DNS censorship, the GFW can adjust its injection pattern to concentrate on a handful of forged IPs, resulting in a large amount of requests towards these targeted IPs and thus saturating their computing resources.

Domain owners. Using our dataset of censored domains, domain owners can check whether their domain is censored or not, and censored due to intended blocking or overblocking. Unless the GFW’s operators revise their blocking rules, future domain owners should try to refrain from registering domains that end with any overblocking patterns to avoid them being inadvertently blocked by the GFW.

End users. There are two potential approaches that can be used at client side to bypass the GFW’s DNS censorship. In the first approach, a censorship-circumvention component of software can implement the hold-on strategy to wait for all DNS responses to arrive and ignore those responses that carry known fake IP addresses as shown in Section 3. Another client-side strategy is to send two back-to-back DNS queries. Since the majority of censored domains are poisoned with dynamic IPs, the client can classify the legitimate responses, which typically point to the same IP (due to back-to-back queries) or the same Autonomous System (AS).

GFWatch Dashboard Exploration

To share our data with the public we have built a dashboard for data exploration. As shown in Figure 9, the dashboard’s frontpage highlights our main findings, including cumulative numbers of censored domains, forged IP addresses, and blocked domains’ categories, and notable censored domains detected recently.

Figure 9: GFWash dashboard summary page.

The dashboard is also designed to be interactive so that it is user-friendly for non-technical audiences to quickly gain insights into the collected data. More specifically, the censored domains tab can be used to search for a censored domain detected by GFWatch. For instance, vox.com was detected to be blocked on October 15th 2021, a day after the coverage of Hong Kong University’s order to remove the Pillar of Shame.

Figure 10: Interactive search for censored domains detected by GFWatch.

In addition, poisoned IP addresses used by the GFW can also be searched from the Fake IP addresses tab as shown in Figure 11.

Figure 11: Interactive search for faked IP addresses used by the GFW.

Conclusion

In this project, we developed GFWatch, a large-scale longitudinal measurement platform, to provide a constantly updated view of the GFW’s DNS-based blocking behavior and its impact on the global Internet. Over a nine-month period in 2020, GFWatch has tested 534M domains and discovered 311K censored domains, of which 41K are innocuous domains being overblocked. We show evidence that the GFW’s DNS censorship has a widespread negative impact on the global Internet, especially the domain name ecosystem and services that rely on DNS. GFWatch has detected more than 77K censored domains whose poisoned resource records have polluted many popular public DNS resolvers, including Google and Cloudflare. Based on insights gained from the data collected by GFWatch, we proposed strategies to effectively detect poisoned responses and evade the GFW’s DNS censorship. As GFWatch continues to operate, our data will not only cast new light on technical observations, but also timely inform the public about changes in the GFW’s blocking policy and assist other detection and circumvention efforts.

Acknowledgments

We are grateful to Ronald J. Deibert, Adam Senft, Lotus Ruan, Irene Poetranto, Hyungjoon Koo, Shachee Mishra, Tapti Palit, Seyedhamed Ghavamnia, Jarin Firose Moon, Md Mehedi Hasan, Thai Le, Eric Wustrow, Martin A. Brown, Siddharth Varadarajan, Ananth Krishnan, Peter Guest, and many others who preferred to remain anonymous for helpful discussions and suggestions.

We would like to thank all the anonymous reviewers for their thorough feedback on our USENIX Security 2021 research paper. We especially thank the team at GreatFire.org for helping to share our findings with related entities in a timely fashion.

This research was supported by the Open Technology Fund under an Information Controls Fellowship.

Availability

For more technical details regarding our study, the full paper and our presentation at the 30th USENIX Security Symposium are publicly available on the USENIX website. The dataset presented in our paper can be obtained from this Google Drive folder, and more up-to-date data can be found via the GFWatch Dashboard at https://gfwatch.org.

Media Coverage

“Exhaustive study puts China’s infamous Great Firewall under the microscope”, by John Leyden, The Daily Swig

“How the Great Firewall reflects Beijing’s policy”, by Katrina Northrop, The Wire China

“China’s Great Firewall is blocking around 311k domains, 41k by accident”, by Catalin Cimpanu, The Record by Recorded Future. This report has also been translated into numerous other languages.

“China suddenly blocked an Indonesian newspaper. No one knows why”, by Peter Guest, Rest of World