- How big is your index?
- Why FAROO displays only the Top-100 results?
- Are there other P2P web search engines?
- How is FAROO different from other P2P web search engines?
- How about other "private" search engines?
- Why is FAROO's privacy protection much stronger than just not collecting log files?
- Does FAROOs anonymization and encryption endorse fraudulent use?
- Is FAROOs Attention Based Ranking vulnerable against manipulations?
- FAROO indexes the webpages I visit in my browser. Does this hurt my privacy?
- How can I prevent that FAROO indexes my web site?
- Why is the crawler so slow?
- How much load puts FAROO onto my computer?
- Why you don't publish your product as Open Source?
- How democratizing the search and making money go together?
- FAROO - What is in the name?
- See also our answers on Quora
Q: How big is your index?
The index size of search engines is full of secretiveness and confusion of ideas.
Google says they know 1000 billion Urls. Here we have to distinguish between "known" and "indexed". Probably this is the number of indexed pages + number of filtered pages + number of Urls in the crawler queue (crawl frontier). Scientific research estimates the real index site of Google to be 40 billion pages.
Other search engines count crawled pages (which are not all indexed yet), or count identical links multiple times (in redundant copies, distributed crawlers, inverted index lists).
Faroo currently has 2 billion pages indexed.
Our goal is not the biggest index, hoarding all the spam & irrelevant pages. Our goal is to return the most relevant pages for every query, from the most compact index possible. The more carefully you select already at crawling & index time, the less you have to index & store, and the faster you are at search time.
We are crawling highly relevant pages first (focused crawling):
The web is huge, but most of it is spam and irrelevant content. Traditional search engines filter relevant content by ranking it at search time. We brought this step forward to the crawling and index time. So while indexing less pages, they are more relevant. This allows indexing more relevant results in a shorter time. This should be considered when comparing 2 billion pages (Faroo) to 40 billion pages (Google).
Q: Why FAROO displays only the Top-100 results?
A: For index compression there are only the top 100 results for each query displayed.
Barerely anyone looks beyond the top 100 results. Instead the query is refined. And for the refined query FAROO again guarantees the top-100 results. In this way despite a heavy index compression no relevant information gets lost.
Query example: apple iphone
- Top 100 results for apple
- Top 100 results for iphone
- Top 100 results for apple iphone
- Top 100 results for apple iphone event
We will display an estimated absolute result number in the future.
Q: Are there other P2P web search engines?
A: This is a question of definition, which a simple litmus test helps you to answer:
Does it really use P2P and DHT in practice?
- In a DHT (Distributed Hash table) every peer (or group of peers for redundancy) is responsible for a different part of the index. The response for different queries should come from different peers in a consistent way.
- If independent from the query term always the same peers (IP addresses) are queried, then simple meta search or load balancing from a server farm is implemented instead of p2p.
Meta search of several independent search instances is NOT p2p search, it is just not scaling. While you can meta search 40 peers, you can’t 1 million.
- Also responses coming only from a small number of similar ip addresses indicate a traditional server farm rather than a p2p network.
- Do a cross-check: Is there incoming traffic to your p2p client, indicating that something gets indexed to your peer or that your peer is answering queries?
Does it have significantly more than 40 peers (active, concurrently online)?
If not, then all peers are contained in the bucketlist. This means we have a simple one hop network, similar to accessing ordinary servers.
The recursive or iterative multi hop protocol which is required for a web scale p2p search engine, comes not into action for such a small number of peers. Then there is no proof whether the program would really scale with a large number of peers, while maintaining short response times.
Does it have a sufficient number of peers?
We estimate that at least 1 million peers are required to index the whole web, to provide redundancy for churn and being able to serve many users.
Is it built of normal PC or of static 24/7 servers?
Using servers is more convenient, but is not feasible for a web scale p2p search. Getting enough PC based peers is a hard task, but finding enough volunteers with dedicated servers is impossible.
Dealing with churn (the sudden arrival and departure of peers) is one of the most challenging and most essential parts in a truely scalable p2p system. By using mainly dedicated 24/7 servers there is no proof whether the program would really be able to work under churn.
Does it have an effective, zero configuration NAT traversal?
Today almost all PC are connected to the internet via routers using NAT (Network Address Translation), which per default prevents incoming connections.
NAT traversal is imperatively required for a peer to become active part of the search engine (host a index part). NAT traversal requiring user assistance does not scale, as most users are not willing or able to deal with it.
After installing your client, is there incoming traffic indicating that something gets indexed to your peer or that your peer is answering queries? Check yourself by using Microsoft Network Monitor, Wireshark or Fiddler (HTTP only).
Q: How is FAROO different from other P2P web search engines?
A: Speed, Scalability, Efficiency, Ranking and Simplicity.
FAROO is the fastest P2P Search Engine
- With a mean response Time below one second, FAROO is the fastest fully distributed P2P search engine of this size. The speed is achieved even for queries with multiple keywords without sacrificing completness. FAROO is able to answer queries with multiple keywords very fast. This is very important because only 15 % are single keyword searches. The search response time and traffic are independent from the number of query terms. FAROOs index structure eliminates the need of intersecting long posting lists for Boolean queries. Nevertheless also for a huge index size complete results are guaranteed.
- Distributed Crawling, Distributed Index, Distributed search, Distributed Bootstrap, Distributed Update. FAROO is the only fully distributed P2P Search engine without any centralized component.
- With 2 Million Peers FAROO is the largest p2p web search engine world wide.
- Most other peer to peer search engines store all results for a specific keyword at a single peer.
This architecture does not scale. One billion results for a frequent term do not fit on a single peer.
- And a search with multiple search terms is infeasible due to the amount of data to be transferred. To guarantee complete results for two search terms with each 1 billion results at two separate peers the transfer of at least 10 GByte would be required, some naive implementations require even two times of that. Even compressed by factor 10, the transfer of 1 GByte for a single search is still infeasible.
- So either those search engines are really slow or return absolute incomplete results despite of having a huge index.
- Index Efficiency = mean number of results / number of indexed pages.
- All crawled Pages are almost instantly available for search.
- There is no search horizon nor truncated result lists.
- Every page is counted uniquely across all peers, without repeatedly counting pages which are crawled by several peers, pages which are not yet distributed, redundantly stored pages or pages indexed across several peers.
- FAROO's attention based ranking leads to a more democratic, user centric ranking, while resistant against rank manipulation. For the first time the ranking of the web pages is automatically done by the target audience itself.
- FAROO offers an easy installation with zero configuration, a clean user interface and a seamless browser and OS integration. There are native clients for iPad, iPhone and Windows available, as well as a installation free Browser based access.
Q: How about other "private" search engines?
There are many words for the same desire: safe search, secure search, private search, unmonitored search, uncensored search, anonymous search.
HTTPS doesn’t help if the user data are accessed on the back end of the incumbents via PRISM.
It seems that US based search engines are legally obligated to disclose its user data to the authorities and to stay mum about it. While non-US citizens are completely unprotected also US citizens are affected by mass surveillance.
US companies are also forced to hand over HTTPS master keys to the authorities. If no PFS (Perfect forward security) is used this allows to decrypt all HTTPS encrypted communication of the past and the future, e.g. from the traffic collected at your ISP.
Proxy-like search engines
Proxy-like search engines and Metasearch engines don’t help either.
Some search engines promise privacy, and while they look like real search engines, they are just proxies. Their results don't come from their own index, but from the big incumbents (Google, Bing, Yahoo) instead (the query is forwarded to the incumbent, and the results from incumbent are relayed back to the user).
Not collecting logfiles (of your ip address and query) and using HTTPS encryption at the proxy search engine doesn't help if the search is forwarded to the incumbent. As revealed by Edward Snowden the NSA has access to the US based incumbents via PRISM. If the search is routed over a proxy (aka "search engine") the IP address logged at the incumbent is that from the proxy and not from the user. So the incumbent doesn't have the users IP address, and the search engine proxy promises not to log/reveal the user IP, while HTTPS prevents eavesdropping on the way from the user to the search engine proxy.
Sounds good? By observing the traffic between user and search engine proxy (IP and time and size are not protected by HTTPS) via PRISM, Tempora (GCHQ taps world's communications) et al. and combining that with the traffic between search engine proxy and the incumbent (query, time, size are accessible by PRISM), all those seemingly private and protected information can be revealed. This is a common method know as Traffic analysis.
The NSA system XKeyscore allows to recover search engine keywords and other communication just by observing connection data (meta data) and combining them with the backend data sourced from the the incumbents. The system is also used by the German intelligence services BND and BfS. Neither the encryption with HTTPS, nor the use of proxies, nor restricting the observation to meta data is protecting your search queries or other communication content.
Onion routing and Privacy proxies
For the same reason onion routing like TOR and Privacy proxies are still vulnerable to surveillance.
Q: Why is FAROO's privacy protection much stronger than just not collecting log files?
A: Some search engines do not log search queries, others delete or anonymize them after a certain time or at users wish.
The security gain of these measures is almost virtually only, if the authorities have real-time access to the backend via PRISM or if they can decrypt HTTPS.
By the following measures FAROO can provide superior privacy protection:
No search log
- This is by architecture, not only by policy.
- As FAROO has a completely distributed architecture, there does no central instance for monitoring exists. Therefore not collecting search logs is not just a promise, it's technically infeasible.
- Logging by ISP (connection data retention law) or system admin does not hurt your privacy as all queries are encrypted .
- Your search queries are immune to blocking or filtering by ISP or system admin as all queries are encrypted.
Q: Does FAROOs anonymization and encryption endorse fraudulent use?
A: No. Responsible is solely the person who is publishing the content or consuming the content. FAROO not responsible for a possible misuse of its technology, as the developer of a web server, a browser, the HTTPS-protocol or the AES-encryption, or the manufacturer of a monitor, a hard disk or a memory chip is not responsible for storing, transferring or displaying illegal content by this systems.
Every technology can be used beneficial or abusive. To the same extent the judicial conception are varying in the different countries. Freedom of speech and privacy protection are guaranteed in many constitutions, in other countries they are accusable, in some they are guaranteed and accused at the same time.
FAROO is not publishing content to the Internet, makes it available or provides anonymous access to it. FAROO is solely helping the users, to assist each other locating of information which already exists in the Internet while maintaining privacy.
Q: Is FAROOs Attention Based Ranking vulnerable against manipulations?
A: FAROOs attention based ranking is not so different from to Google' s Page Rank. While in Google webmasters are voting by linking to web pages, in FAROO users are voting by visiting webpages.
Therefore also the kind of ranking attacks and counter measures are similar. There are a lot of statistical measures by which a cheating peer could be identified. But of course, as with every anti-spam and anti-virus solution it's a continually ongoing fight. Therefore FAROO can instantly change the ranking algorithm and/or encryption by its auto-update feature, once it becomes compromised.
Q: FAROO indexes the webpages I visit in my browser. Does this hurt my privacy?
A: No. FAROO indexes only pages which are located in the Internet, but no Intranet pages or HTTPS protected pages. Through FAROO no personal data leave the computer of the user.
But it is important to be aware of the fact, that there is no privacy while visiting internet pages. The ISP (according to the connection data retention law), many intermediate stations in the Internet, and the visited site itself knows about your visit.
In contrary, FAROO has no central institution, which would be able to collect the click streams. The (anyway public) web pages are yet hashed and encrypted at the computer of the user and then stored to the distributed index. The index contains only encrypted information. It does not contain any information about, who stored the information into this index.
Q: How can I prevent that FAROO indexes my web site?
A: Because FAROO does not require a dedicated crawler, it is also not accessing files on your web server, so that there is no additional load for you. Anyway, as FAROO is a well behaving search engine, it respects the Robots Meta Tag according to www.robotstxt.org.
Q: Why is the crawler so slow?
A: FAROO's collaborative crawler swarm is very fast. The number of a crawled pages per single client is relatively low for two reasons:
- It is designed as low impact crawler for a smooth user experience, without using to much bandwidth, IO and processor load from a single user. The power is rather defined by its massive scalability.
- Other than in centralized systems the crawler speed is determinded by the distribution of the data to the decentalized index, not by fetching and parsing the web pages. FAROO is crawling only as much pages, as it is able to immediately distribute the contained information with a n-fold redundancy to the index. This prevents the distribution of outdated content and waste of bandwith and harddisk space (for crawled, but never distributed pages).
Q: How much load puts FAROO onto my computer?
A: FAROO is designed to not affect the performance of the computer.
It is only active, if no activity of the computer and no processor load is detected.
The hard disc storage to be donated can be specified in the options. If the hard disc space is becoming scarce, FAROO is automatically releasing the used hard disk space again.
In this way the full capacity of the user is solely available to the user, as soon as he needs it.
Q: Why you don't publish your product as Open Source?
A: The source code is not public for two reasons:
First: Open Source is perfect when competing by a cost advantage with a commercial product on the same technological level (Linux, OpenOffice). But it's not a good idea to hand over your technological advantage to a monopoly, when competing with its free service with enormous brand power.
Second: We, like others, don't believe in big development teams if you are aiming radical changes.
Nevertheless, we support the Open Source idea whenever possible: For our 1000x times faster spelling correction we both published the algorithm and released the c# code as Open Source.
Q: How democratizing the search and making money go together?
Well, we think everybody needs to make some money for living, as long there is no open source housing, food, clothing and transportation ;-)
Isn't it better to make a living from an idea you believe in, than to waste your time in a boring job and dream in your spare time only?
Q: FAROO - What is in the name?
Pharos, the Lighthouse of Alexandria, the seventh wonder of the world, was built 300 BC. at the island of Pharos. After the location the tower was later called 'Pharos'. The name was adopted in many languages as term for lighthouse: lat. 'pharus', ital. and span. 'faro', fr. 'phare' and port. 'farol'.
FAROO helps you to navigate and find your destination in the endless sea of information.