CommunityData:ORES

From CommunityData
Jump to: navigation, search

What ORES is and does[edit]

ORES is a tool that uses machine learning to make predictions about the quality of Wikipedia edits and articles. It's useful for people making tools to improve maintenance processes on Wikipedia, but is also useful for doing research about the community -- the dynamics of how work gets done, and who does it.

The main site is here: https://ores.wikimedia.org/

Google Scholar will give you some links about how ORES is used for research if you use search term 'ORES', this one is great: https://dl.acm.org/citation.cfm?id=3125475

Using ORES is not too hard -- see documentation here: https://www.mediawiki.org/wiki/ORES -- scroll to the bottom of the docs page for usage examples. You can see an ORES score just by hitting a custom URL, no coding required.

ORES is available across multiple languages, you can see the range of linguistic support here: https://tools.wmflabs.org/ores-support-checklist/

Interpretive note: the first column uses abbreviations by language -- enwiki is English Wikipedia, dewiki is German Wikipedia. ORES can analyze both revisions and articles overall. For revisions, "Basic" support means you can find out if ORES predicts that an edit will ultimately be reverted. When "Advanced" support is available, "Basic" is gone: but instead you can see if an edit is predicted to be made in "good faith" according to Wikipedia's definition, and whether it is "damaging". For the article measures wp10 and draftquality, ORES considers the page overall. The wp10 model uses the "WP 1.0" quality levels schema (Stub, Start, C, B, GA, FA). GA is Good Article, and FA is Featured Article.

If you are less interested in individual revisions but rather in articles, you might be interested in the dataset release here: https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 -- this gives a monthly view into quality per article.


Using ORES[edit]

I found it challenging to install ORES without root access because of the reliance on C libraries and system dictionaries, but it was trivial to do on a machine where I had root, so I installed it on Nada with pip.

On Nada, I use some scripts I wrote, which use the locally-installed ORES engine to make queries against the ORES environment run by the analytics team (i.e. you are not just hitting nada when you run code on nada -- it's making calls to the foundation's servers). A more detailed code walkthrough of my damaging-edits script is below, but the basic situation is:

       * I have a list of revision IDs in a tab-delimited format, I want a prediction of whether each revision is damaging.
       * ORES is expecting a command line invocation, but I have a dataset
       * ORES manages its own connection niceties, but I have to let it do so

Code Example[edit]

#!/usr/bin/env python3

##########################################################################
## This script runs the ORES scorer against revision ids by assembling many examples of the following shell command:
## echo -e '{"rev_id": 456789}\n{"rev_id": 3242342}\n{"rev_id": 618882377}' | ores score_revisions https://ores.wikimedia.org enwiki wp10 > thatfile.txt
##
## Inspired by the documentation located here: https://www.mediawiki.org/wiki/ORES
##
## Assumptions:
##
## This script assumes a tab-delimited file with a header, and that one element of that header is 'revid' -- a valid wikipedia revision id 
##
##########################################################################
## Warnings:
##
## You will need to edit the command line to reflect the wiki whose you want to score -- enwiki, frwiki, etc. See comment marked (A). 
## 
## The code is designed to allow ORES to load-balance your queries on your behalf. A group of 100 revids will likely result in two
## parallel threads of 50 revids each, which is the current recommended load. Don't change the way you throttle load without guidance
## from the development team.
## 
##########################################################################
## Components:
## (0) Modal Configs and Process Args
## (1) Read in Revision IDs
## (2) Assemble shell command and run repeatedly on groups of IDs.
##

## (0) Modal Configs and Process Args

#DEBUG=1
DEBUG=0

import argparse
import os
import csv


theList = []

parser = argparse.ArgumentParser(description='Generates a kajillion shell commands and runs them.')
parser.add_argument('-i', help="Infile containing revision IDs to look up.", required=True)
args = parser.parse_args()
## (1) Read in Revision IDs
givenInfile = args.i 

with open(givenInfile, 'r') as infileHandle:
	theInfile = csv.DictReader(infileHandle, delimiter="\t", quotechar='"')
	for currentLine in theInfile:
		theList.append(currentLine["revid"])    # makes a list of all the revids in the file

chunkSize = 100 # see note (B); it's not recommended to change this chunk size without guidance
for i in range(0, len(theList), chunkSize):             # iterates over theList in 100-revid chunks
	chunk = theList[i:i+chunkSize]
	if DEBUG:                                       # change the modal config to DEBUG=1 if you want to see these messages, leave it 0 if you don't
		print(chunk)
	uglyString = ""                                 # ORES is expecting a JSON format; we fake it here in a string I call uglyString
	for revid in chunk: 
		uglyString = uglyString + "{\"rev_id\": " + revid 
		uglyString = uglyString + "}\\n"
	if DEBUG:
		print(uglyString[-2])
	if uglyString[-2] == "\\": #we don't need the trailing linebreak
		uglyString = uglyString[:-2] 
	if DEBUG:
		print(uglyString)
        # see note (A); this is where you can change the language
	#theCommand = '''echo '%s' | ores score_revisions https://ores.wikimedia.org enwiki damaging >> predictDamaging.txt''' % uglyString
	#theCommand = '''echo '%s' | ores score_revisions https://ores.wikimedia.org ruwiki damaging >> predictDamaging.txt''' % uglyString
	theCommand = '''echo '%s' | ores score_revisions https://ores.wikimedia.org frwiki damaging >> predictDamaging.txt''' % uglyString
	if DEBUG:
		print(theCommand)
## (2) Assemble shell command and run repeatedly on groups of IDs.
	os.system(theCommand)