Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent labelling in regional products. #148

Closed
antaldaniel opened this issue Sep 18, 2019 · 24 comments
Closed

Inconsistent labelling in regional products. #148

antaldaniel opened this issue Sep 18, 2019 · 24 comments
Assignees

Comments

@antaldaniel
Copy link
Contributor

I think that Eurostat changed its regional statistical products to NUTS2016 extremely carelessly, and my supposedly reproducible research codes do not work from last year, when the same products with the same ID's were reported under the NUTS2013 region boundary definitions.

It took me quiet some time that I understood the full depth of the program, and I wrote some functions to correct it, but they may be very error prone. Nevertheless, I think this problem is worth at least a vignette, because it is extremely difficult to work with the regional data now.

The following main problems persists, but there may be more, in fact, each regional statistical product is its own case.

Inconsistent use of NUTS2013 – NUTS2013 labels.

  • In some cases, this is mislabelling, and can be relabelled.
  • In other cases, boundaries changed, and Eurostat reports under the wrong metadata mixed NUTS2016 and NUTS2013 data. Even though the metadata says that this is NUTS2016 data, in fact, in some years it is NUTS2013, which is only partly compatible. In this case, whenever there is a correspondence table, it can be fixed.

Inconsistent use of NUTS levels

  • In other cases, some of the data is available only on NUTS2 level, and in other cases, only at NUTS1 level. I think this is the case with originally Eurobarometer data, where Germany and the UK does not have a large enough sample to breakdown the results to NUTS2 level, however, Estonia, Malta, Luxembourg and Cyprus would have enough data to report at NUTS3 level. This may be also mislabelled. This is the most problematic case, and I will turn to it later. This is very confusing because the product name and the metadata calls this a NUTS2 product, but if you try to join it with NUTS2 level population, land are or GDP you loose Germany, the UK, Poland, and in some cases other areas, too.
  • In some cases, the product name and description refers to NUTS2 level, but the data contains all levels. This can be fixed with filtering.
  • Occasionally, when NUTS0 = NUTS1 = NUTS2, i.e. in the case of the aforementioned small countries with only NUTS3 level regions, the labels are not repreated at all level, and although the data is there with NUTS0 codes, the NUTS1 or NUTS2 filter shows it missing. And I think that sometimes this can be done with a fake imputation, i.e. looking up the same statistics as a national statistcs product and ’imputing’ the regions in the cases when the small country is a region itself.

The mixed NUTS2-NUTS1 case is the most annoying, although the choice of the data presentation makes sense, and it would make even more sense, if the small countries would be reported on a NUTS3 level. Obviously one logical solution is to aggregate up everything to NUTS1 level, but that is a very inefficient use of the data. However, if used together with other data, the mixed-level makes data joins very complicated.

I am not sure that all problems need a fix, and I am not confident that I have found all problems. I can imagine, for example, that the changes of COFOG or NACE or other changes are not well presented in the regional statistics, which can cause seemingly or actually missing cases.
I could make a vignette candidate blogpost, for example, with examples to all problems and my solutions to it.

I think a least problematic solution would be to have the NUTS2013 and NUTS2016 code definitions in the eurostat package, and automatically add this information to each 'geo' variable, regardless what Eurostat thinks about it. At least in this case the user would be able to explicitly filter for NUTS1, NUTS2 or NUTS3 levels, which is currently only possible if you directly filter the codes. It is not hard, because NUTS2 codes are always nchar(code) ==4, NUTS1 is nchar(code) == 3, but still, this misleading 'geo' labels make work extremely difficult.

@antagomir
Copy link
Member

Great job. A well-thought blog post would be very welcome. It can be added to rOpenGov blog or eurostat R pkg website if you like.

The problem seems worth reporting to Eurostat. If they see that there is interest and clear inconsistencies (blog post / vignette could greatly help here!), they could have the incentive to fix the problems.

@antaldaniel
Copy link
Contributor Author

I started to work on the vignette candidate in a mini-repo github.com/antaldaniel/eurostat_regional/ I will keep working on it heavily in the coming days, because I'd like to add mapping examples, and many concrete data examples, so I did not want to constantly work into a branch here. But any comments are very welcome. If you want to comment or add examples, just make a pull request to the .Rmd file. Pandoc currently does not support certain yaml headers, so the html version has no table of contents.

@antagomir
Copy link
Member

Perfect!

@antaldaniel
Copy link
Contributor Author

OK, I may add some nice maps, but I am finished. I have a solution that solves I think whatever can be solved, and it is more than 400 lines of code, it is included before the conclusions. If somebody knows whom to write at Eurostat, or can send the link to them. it would be much appreciated. Also, any comments of how to go further.

@antagomir
Copy link
Member

Seems excellent to me. Maps would be nice if you end up adding them.

I have no other contact info than what is available at Eurostat data user support:
https://ec.europa.eu/eurostat/help/support. I think this should be reported to them, and your detailed post will greatly help. Just let me know if I can help in any way.

How would you like to distribute this? This would be a welcome addition for instance on our package website under the "Articles" section and/or as a blog post at the rOpenGov website. We can see how to do that if the post is ready.

@antaldaniel
Copy link
Contributor Author

I think that the post is ready, except for the map. Before proof-reading and adding a map, it would be nice if anybody would read it carefully and ask potential questions.

I did not add the map, because I do not know if it is a good idea to add live evaluated download and mapping to a blogpost and a vignette. My haunch would be to add one example with eval=FALSE that goes from eurostat::get_eurostat() to all the way to the incorrect and correct map, but that is not a good idea to add to CRAN vignette or live evaluate on the website. If you know how to do this, and there are no further questions, I will add two chunks that shows on a real example the problem and the solution.

So, let me know how to handle the example.

@antagomir
Copy link
Member

Great.

I would propose waiting for 1-3 days for proof-read comments. This was fast and others may jump in, too, with some delay (@muuankarski?). I will also read, and comment separately.

If we add the post on the package website, this is separate from the package and will not cause CRAN problems. If we add on ropengov website it should be fine as well if eval is set to FALSE.

If you write to Eurostat it might help to cite also our formal publication in R Journal (?)

@antagomir
Copy link
Member

I have read it through in entirety and it seems very feasible overall. It might be good idea to add the extra columns for versions 13 & 16 as you proposed (need to be documented in the function roxygen/manpage as well), and the warning to get_eurostat. Pull requests are welcome, in other case we can leave this issue open to wait for the next round of revisions.

@antaldaniel
Copy link
Contributor Author

Thank you. Do you have a deadline for a new release? The warnings + extra columns would make sense if they would include at least all the EU countries, but preferably all EEA/candidate countries that provide data for NUTS. I wrote to Eurostat, as suggested, with citations, etc.

What I hope to get at least for them is some authoritative list of codes not included in their latest correspondence tables - so far I detected that this affects Slovenia and Greece, and every single non-member state. The other issue that is bothering me if there is a further problem with NUTS2010, 2006 and 2003 data. If the earlier transitions were well-managed, than the 13 16 columns make sense, but if not, than there is a very serious problem here. The NUTS system started in 2003 and already went through 5 revisions, i.e. roughly every third year; because the regional statistics are always lagging behind nation ones, they are usually released on n+2 or n+3 basis.

All in all, I really hope that they come back to me, because without understanding what they are doing I am afraid that whatever warnings are made will just create a false sense of safety.

@antagomir
Copy link
Member

No deadline. I am planning to make the next CRAN release very soon since they just reported a bug that has to be fixed asap. Other than that, new CRAN releases can be made roughly once per month if necessary.

Might be useful to see how Eurostat responds.

@antaldaniel
Copy link
Contributor Author

I'll wait for their reply, because I found further problems with data that we will not be able to patch here. I think we'll be only able to maybe recode, give a warning and link how to get started, there are so many problems here that it has to be resolved by Eurostat.

@antagomir
Copy link
Member

check

@antaldaniel
Copy link
Contributor Author

Hi, so I am wondering what would be the best way to incorporate this in the package. I received not very helpful comments from Eurostat, and I want to tidy everything up now. I want to write the article, a simple function to give at least warnings in the eurostat package, and my correction functions maybe in a tiny accompanying package, because nobody knows how long these problems will persist.

The following function gives back either the problem regions in a data frame, or gives back the input data fame with marking changed and unchanged region codes for further filtering.

It requires two input tables, the codes of the unchanged and the changed NUTS codes. First I thought that I just simply list them in the function, but this is not practical, as we are talking about more than 1600 rows.

So this should be either downloaded from somewhere, or added into the internal data of the Eurostat package. In this case, whenever there is a change in the list, it has to be updated. [Actually this problem gave me the idea to maybe wrap the metadata and some correction functions into a tiny package that could be separately released and maintained, like eurostat_regional.]

Another simple solution is to add an url to the warning, maybe to the long article, which would contain all references to data to be downloaded.

I think that this very simple check_geo_nuts2013 should be included in the main eurostat package, and the warning message updated with whatever solution we go for. It only depends on usual tidyverse functions, i.e. %>%, and dplyr, stringr functions. If stringr is not a dependency yet, I can change the stringr::str_sub to base R but it is, as usual with tidyverse functions, far less expressive.

dat: a data frame downloaded with get_eurostat()
unchanged_regions: a data frame that I created from the Eurostat correspondence table,
changed_regions: another data frame that I created from the Eurostat correspondence table, should be easy to deploy and maybe, but not necessarily in the eurostat package itself.

check_geo_nuts2013 <- function (dat, 
                            unchanged_regions,
                            changed_regions, 
                            return_changed_regions = FALSE ) {
 
  tmp <- dat  %>%
    mutate_if ( is.factor, as.character ) %>%
    left_join ( unchanged_regions %>% 
                  select ( code16 ) %>%
                  rename ( geo = code16 ) %>%
                  mutate ( change = 'unchanged'),
                by = 'geo') %>%
    mutate ( change  = ifelse (  geo %in% c("CH", "TR", "NO",
                                            "MK", "IS", "LI", 
                                            "AD", "SM", "VA", 
                                            "XK", "RS", "ME",
                                            "BA", "AL", "JP", 
                                            "US"), 
                                 'not_EU', change)) %>%
    mutate_if ( is.factor, as.character ) 
  
  tmp <- tmp %>%
    filter ( stringr::str_sub(geo, -3,-1) != "ZZZ", 
             stringr::str_sub(geo, -2,-1) != "ZZ", 
             stringr::str_sub(geo, -3,-1) != "XXX", 
             stringr::str_sub(geo, -2,-1) != "XX", )
  
  if ( any ( tmp$geo %in% changed_regions$code13) ) {
    
    warning("Some of the data has obsolete NUTS 2013 codes.")
    
    if (return_changed_regions) { 
      message("Returning changed region codes:")
      changed_regions %>%
        filter ( code13 %in% tmp$geo )
    } else {
        return(tmp)
      }
    
 }

@antagomir
Copy link
Member

PR is welcome and stringr is already imported. I see no immediate reason why the package could not host also the smaller helper functions that you are mentioning. Simple solution that does not demand constant active updating would be preferred..

@antaldaniel
Copy link
Contributor Author

I made the first batch of functions with two supporting data frames to be imported (elements of the Eurostat correspondence table.)

If these are working well and can be imported, I will make the long-form article, however, I don't know where should I put it. Do you have a separate repo for articles?

There are two more helper functions, but I am not sure if they should be only in the article, or internalized in the package. You will see it in the article, though.

@antaldaniel
Copy link
Contributor Author

I made a new PR, with the data files. Devtool will not run itself the data-raw folder files - the actual file that (re)creates the missing correspondence tables is there, but I made two copies of the file.

@antagomir
Copy link
Member

antagomir commented Jan 29, 2020

Thanks! I have now inocrporated all PRs and run checks. All seems to be OK but should be checked properly. Meanwhile I have also switched to API v2.1. Its structure is different. The coding aims to keep this internal so that the outputs are not changed; but the output correctness is not yet fully checked; if you notice anything suspicious that would be good to know.

We have a place for "articles". These are Rmd files in the folder "vignettes/website/". The (automatically) rendered html versions of these articles appear in the R pkg homepage in the upper panel under "Articles" (not sure if we could/should highlight this better). Very welcome if you have suggestions on new articles, or contributions to an existing one.

This issue is also related to #117 - need to check if we can close that as well.

@antagomir
Copy link
Member

antagomir commented Jan 29, 2020

Now all seems to go through without errors. There are still some issues in CRAN checks.

Rd files with duplicated alias 'regional_changes_2016':
‘eu_countries.Rd’ ‘regional_changes_2016.Rd’

Undocumented arguments in documentation object 'check_nuts2013'
‘return_changed_regions’

The following can be circumvented by adding regional_changes_2016 <- NULL etc. as the first line within the function, if the variable is defined in the function.

check_nuts2013: no visible binding for global variable
‘regional_changes_2016’
check_nuts2013: no visible binding for global variable ‘change’
check_nuts2013: no visible binding for global variable ‘code16’
check_nuts2013: no visible binding for global variable ‘code13’
check_nuts2013: no visible binding for global variable ‘geo’
harmonize_geo_code: no visible global function definition for
‘check_dat_input’
harmonize_geo_code: no visible binding for global variable
‘regional_changes_2016’
harmonize_geo_code: no visible binding for global variable ‘change’
harmonize_geo_code: no visible binding for global variable ‘tmp’
harmonize_geo_code: no visible binding for global variable ‘geo’
harmonize_geo_code: no visible binding for global variable
‘nuts_correspondence’
harmonize_geo_code: no visible binding for global variable ‘nuts_level’
harmonize_geo_code: no visible binding for global variable ‘code13’
harmonize_geo_code: no visible binding for global variable ‘code16’
harmonize_geo_code: no visible binding for global variable ‘resolution’
harmonize_geo_code: no visible global function definition for
‘full_join’
harmonize_geo_code: no visible binding for global variable
‘remaining_eu_data’

This might be related to the above and solved by it:

Undefined global functions or variables:
change check_dat_input code13 code16 full_join geo
nuts_correspondence nuts_level regional_changes_2016
remaining_eu_data resolution tmp

@antagomir
Copy link
Member

@antaldaniel could you have a look at these and try to fix? I can help where necessary.

@antagomir
Copy link
Member

The next DL for CRAN release is Feb 13. We need to update since some tests now fail with updates in the 2020 data.

@antaldaniel
Copy link
Contributor Author

antaldaniel commented Feb 3, 2020

Hi @antagomir, I had a few issues with running devtools::check because there seems to be an unrelated unite test failing, and there was an uncommitted file that referred out of the package. Right now I think that all the documentation is great except for one thing: The new data objects still raise a note with a 'missing binding'. However, if I load them directly with data(nuts_correspondence), for example, I got a different note that it is not good practice to load data into the global environment.

One issue may be is that the whole package has not been built and installed because of the failed unit test, and it does not recognize the new data objects nuts_correspondence & regional_changes_2016 as part of the package.

check_nuts2013: no visible binding for global variable
     'regional_changes_2016'
   harmonize_geo_code: no visible binding for global variable
     'regional_changes_2016'
   harmonize_geo_code: no visible binding for global variable
     'nuts_correspondence'
   Undefined global functions or variables:
     nuts_correspondence regional_changes_2016

These are of course not global variables but data in the package, and they are (I think) are correctly documented now, the checks are ticking them in the manuals.

@antagomir
Copy link
Member

Ok, then how about providing them in the function argument, and also showing in the @example section of the function manpage how the data is loaded before calling the function? I think it is problematic if these are being used within a function unless explicitly set by the user, and then passed on to the function.

Internal data could also be loaded within functions imo but this is less transparent.

@antaldaniel
Copy link
Contributor Author

@antagomir I am not sure. Would it be necessary to first load a constant into the global environment, and then adding it as a parameter to the function, like check_nuts2013(dat - mydat, regional_changes_2016)?

I instead apply this solution:
https://support.bioconductor.org/p/24756/

The necessary data is loaded into the new, local environment created by the function, instead of the local one. In the vignette and the documentation it is clearly explained that the user can actually load this information into her global environment with the call data(nuts_correspondence).

@antagomir
Copy link
Member

Ok, I think this is fine. You can consider mentioning in the @details section that the function takes advantage of precalculated data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants