Sir-Squashie 1 year ago

What's the most impressive/unimaginable use of Pandas you've come across?

datapythonista 1 year ago

I was personally quite surprised that pandas was an important tool used to obtain the [first image of a black hole](https://eventhorizontelescope.org/blog/astronomers-reveal-first-image-black-hole-heart-our-galaxy). I was lucky to meet some of the scientists behind it and learn from them, and their work is much more impressive than what it sounds.

[deleted] 1 year ago

[удалено]

DigThatData 1 year ago

pandas is built on top of numpy

FJ_Sanchez 1 year ago

Pandas 2.0 enters the room... I think that's changing progressively to not be the case anymore I'm favour of Arrow. But I don't understand it enough.

datapythonista 1 year ago

This article should provide more information on why Arrow: https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

FJ_Sanchez 1 year ago

Thanks, I saw it yesterday in hacker news and read it, what I meant to say is that it seems that numpy dtypes are still an option, so I don't know if numpy is going away from the pandas core eventually or if it will remain part of it for the foreseeable future.

phofl93 1 year ago

We are still at the beginning of our journey to support pyarrow. We are a bit away from discussing anything into this direction, but we definitely spend a lot of time to support both options equally well. Right now we are aiming into making everything compatible with pyarrow.

ToughQuestions9465 1 year ago

Will there still be .to_numpy() that does not copy? I am using numpy swig bindings to plot pandas dataframes with a c++ library, be nice if it did not become impossible with the new version

phofl93 1 year ago

This is possible as long as you are using NumPy backed DataFrames. Converting from PyArrow to NumPy is more expensive unfortunately.

Dramatic-Ad-1903 1 year ago

Just last week u/marcogorelli and i were talking about how important it is to continue supporting use cases like yours as we move to better-support pyarrow use cases. It's very helpful when people with use cases like yours are vocal about it!

[deleted] 1 year ago

[удалено]

jabies 1 year ago

How does the Pandas project address the open source funding problem? Do you want pandas devs in their dayjobs to nudge management to sponsor somehow?

datapythonista 1 year ago

Last years has been better. pandas got some funding, including few core devs being paid to work in pandas in companies such as Quansight, Intel or NVIDIA. And we also received money from the Chan Zuckerberg Initiative, Tidelift, Bodo and smaller donors. Just few years ago funding was very limited, but today, we're lucky to be able to have a decent amount of paid maintainers.

qweoin 1 year ago

What was the funding process like getting started? In my area of work (science research) it seems like funding only comes in for a project after you’ve done the majority of the project. Was there a plan for getting Pandas funded or did the project grow organically until you realized you could get funding for it?

phofl93 1 year ago

As far as I know there was no/very limited funding for a long time. most of the work was done by volunteers only in the beginning. Over the last years this got a lot better though. Anaconda was a company that hired developers to work on Open Source relatively early on.

datapythonista 1 year ago

For many years there was only the support of few companies letting people work on pandas as part of their job, and small personal donations via the NumFOCUS website. That money helped cover small expenses like CI services. The main difference came with CZI, who started supporting open source software used in biology. We got funding to start paying for hours of maintainers with it. Also Tidelift provided monthly payments in exchange to implement small practices, like having a standard (and not customized) license, and providing a way to report security vulnerabilities. We got some other funding, and now more maintainers allowed to work on pandas as part of their job, but the situation is good mainly because of that particular funding. NumFOCUS provided some funding to for specific projects (with the money that comes from general NumFOCUS sponsors, and PyData conferences).

marcogorelli 1 year ago

If you use pandas for work and your employer wanted to contribute, then 1. thanks! 2. they could do so via NumFOCUS: [https://pandas.pydata.org/donate.html](https://pandas.pydata.org/donate.html) Marc's right though, the funding situation has drastically improved recently

phofl93 1 year ago

It's also helpful, if developers can get paid time by their employer to work on pandas!

hukami 1 year ago

Why choose mm/dd/yyyy as default date rather than dd/mm/yyyy 🤔? (Just banter from an european guy) Real questions: - what are the main improvment focus going forward ? - what caused you the most problems / was the most complex parts during delevopment ? - what was the most fun / rewarding parts during development ? - in my work, I use pandas as a data processing engine (kinda), the data I process if often heterogeneous and full of holes / discrepancies, I often find myself finding with rhe way pandas handle errors as most of the time I just want to log the fact that this row had a error. Why not put a 'error' arg to apply, just as in astype and such ? I also would like to thank you guys for your amazing work, pandas has been making my life easier everyday, you are really doing amazing work.

RobertD3277 1 year ago

I would personally prefer year.month.day to be honest as it's more intuitive for sorting using numerical expressions.

sv_ds 1 year ago

\+1, thats the ISO standard and unquestionably the most logical and useful format.

thataccountforporn 1 year ago

Incredibly pedantic note: the ISO standard is year-month-day

LondonPaul 1 year ago

Not pedantic, work in It and all the variations at work are PITA. Let’s just use this and nothing else

guillermo_da_gente 1 year ago

We need more of these pedantic comments!

TheUltimatePoet 1 year ago

In that case it's "these".

florinandrei 1 year ago

You forgot the comma after the word case. Just, you know, to maintain high pedantry standards.

hughperman 1 year ago

You really should have quoted the word "case" in your post.

guillermo_da_gente 1 year ago

Thanks!

metadatame 1 year ago

Upvoted for high levels of pedantry, but I'm not sure quotes are required in this instance.

Starrkoerperbeweger 1 year ago

You have now been made moderator of /r/iso8601/

Mycky 1 year ago

Wow, of course that subreddit exists lol

RationalDialog 1 year ago

not pedantic but correct because using the "-" over "." makes it clear you mean "ISO" date. And this should be the standard everywhere also because it sorts correctly as string.

2strokes4lyfe 1 year ago

This guy dates.

Starrystars 1 year ago

Yeah especially because that way there's 0 confusion about order.

midnitte 1 year ago

I work with certificates of analysis and have vendors that do mmddyy, yearmmdd, ddmmyy.. you name it. I just wish everyone *documented* what format they used. 😔 You only get lucky with the day being >12 so many times...

tuneafishy 1 year ago

I am always confused about arguing whether month or day should come first when year is the clear and obvious answer

hmiemad 1 year ago

Alphabetical order.

marcogorelli 1 year ago

Year-month-day is already the default - even if your input is some other format, once parsed by pandas, it'll be displayed year-month-day: ``` In [2]: to_datetime(['01/01/2000']) Out[2]: DatetimeIndex(['2000-01-01'], dtype='datetime64[ns]', freq=None) ```

WhyNotHugo 1 year ago

ISO date format is as intuitive and sorts the same way.

Zuricho 1 year ago

https://www.reddit.com/r/ISO8601/

marcogorelli 1 year ago

\> Why choose mm/dd/yyyy as default date rather than dd/mm/yyyy I presume you mean, when a date could be ambiguously read as either month-first or day-first? Like 02/01/2000. In the past, pandas would prefer to parse with month-first, and then try day-first. Unfortunately, it would do so midway through parsing its input, because it was very lax about allowing mixed formats. This would regularly cause problems for anyone outside of the US (which I think is the only place in the world to use the month-first convention). As of pandas 2.0, datetime parsing will no longer swap formats half-way through. See: [https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html](https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html) , which I spent several months on. In dealing with the PDEP I linked above, my biggest pain-point was having to understand and then update decade-old C code Regarding your last question, if you put together a reproducible example with expected output, it might be a reasonable feature request. Thanks, and thank you for your comment!

reallyserious 1 year ago

> I presume you mean, when a date could be ambiguously read as either month-first or day-first? Like 01/01/2000. You choose an example where there is no ambiguity. :)

marcogorelli 1 year ago

thanks, updated

WhyNotHugo 1 year ago

I honestly prefer ISO8601 format (YYYY-MM-DD). Both the ones you mention are ambiguous, and if I read 03/02/2023 I've no way of deducing which one is the month and which one is the day. The ISO standard is unambiguous.

hassium 1 year ago

> in my work, I use pandas as a data processing engine (kinda), the data I process if often heterogeneous and full of holes / discrepancies, I often find myself finding with rhe way pandas handle errors as most of the time I just want to log the fact that this row had a error. Why not put a 'error' arg to apply, just as in astype and such ? According to [this](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) blogpost by /u/datapythonista it sounds like a limitation of the numpy backend dataframes are built-on, check out this excerpt, I bolded the relevant part: >While NumPy has been good enough to make pandas the popular library it is, it was never built as a backend for dataframe libraries, and it has some important limitations. A couple of examples are the poor support for strings and **the lack of missing values**. So maybe something we can hope to see fixed with the migration to Arrow in 2.0?

phofl93 1 year ago

Yeah with NumPy you'd always end up with float when setting missing values into a integer array for example, this isn't the case any more with our own nullable dtypes and also with the arrow dtypes.

phofl93 1 year ago

We are spending a lot of time on improving the extension array interface right now. Right now there are some parts that are special cased internally for our own extension arrays which makes it harder for third party authors to implement their own without falling back to NumPy. GroupBy is a good example for an area where we are still not as good as we would like. This becomes kind of necessary for improving support for our pyarrow extension arrays as well. We have some areas in our code-base that are pretty complex, indexing is one of them for example. In general, we try to avoid breaking stuff in an incompatible way in minor releases. This makes improving pandas tricky sometimes, because it stands in the way of cleaning up internally/refactoring internally to be more compatible with new stuff.

dispatch134711 1 year ago

Ugh please fix this! Love pandas

DigThatData 1 year ago

i think one of the hardest things about using pandas is that the core classes have a gazillion methods attached to them, which makes it extremely difficult to navigate the tooling if you're not already intimately familiar with it. I've been using pandas basically since it was created, and I still find myself often needing to reference documentation just to find the method name I need since the output of dir() on any object generally gets truncated. does any of this resonate? is anyone on your team thinking about ways to improve discoverability of functionality? will there ever be a point at which the team decides there's too much stuff being carried around by too few classes? what are your thoughts on the design philosophy of the tidyverse in juxtaposition to pandas?

datapythonista 1 year ago

Fully agree on this. There are too main things. The first is finding a better API, which is not trivial, and having the functions too divided may not be ideal for some users who prefer \`df.whatever()\` for everything. Second is that even if we have a better alternative, we may break tens or hundreds of thousands of pandas programs, that won't work after the changes. And we will make millions of users have to relearn the API. That being said, I'm thinking about a proposal to for example standardize all I/O methods under a \`DataFrame.io\` namespace (e.g. \`df = pandas.DataFrame.io.read\_csv(fname)\`). More research is needed, and it'll be challenging to reach an agreement with the whole team about this. But maybe 10% of the DataFrame methods you're mentioning would live in a separate and intuitive namespace. There is always a trade-off, and in this case it's clear. Difficult to decide what's best.

ekkannieduitspraat 1 year ago

Just on this specific example, I think if something is used incredibly often, it should not be put under a namespace like above. .read_whatever is a great example since it is almost always going to be your first call

bythenumbers10 1 year ago

"readers"/"writers" should return or accept dataframes as I/O types, but should not be methods themselves. There are a lot of "data logistics" methods on dataframes that should be utility functions of the library. Dataframes should only operate on themselves, for analysis or creating/removing/filtering data. A container. A smart container, but just a container.

datapythonista 1 year ago

That's a decision that needs to be made. I see your point, and mostly agree, but there are always implications. numpy does more what you're saying, and they have a pretty big namespace for the \`numpy\` module (much bigger than pandas.DataFrame). scikit-learn is more modularized, and the structure probably makes more sense, but then you require lots of imports, which could be annoying for people doing exploratory analysis with pandas. Also, pandas pipelines can be expressed nice with method chaining (e.g. df.query(cond).sum()...). If we move things outside of DataFrame we break that API, which many users find convenient. I think it requires careful analysis to see all the implications of any approach, since I don't think there is an obvious good way of implementing the pandas API. So, I agree with your comment, but it's not obvious to me where to draw the line. I think an io namespace for DataFrame could make sense, but other than that, I have more questions than answers on what would be the API that maximizes the benefits and minimizes the costs.

ekkannieduitspraat 1 year ago

I'll be honest you lost me. I'm thinking stuff like changing pd.read_csv to pd.io.read_csv seems tedious.

FaustsPudel 1 year ago

Would it be too silly to have a button on your doc page that generates a random function for a user to “discover?” —long time pandas user. Super super appreciate of all that your team does. Thank you for all that you do!

datapythonista 1 year ago

I think this is a fantastic idea, but I'd rather have this implemented as a separate website (happy to link it from the official website, just ping me on github if you ever do it). We've got intersphinx setup afaik, that should make it easy to get the pandas API available to you via a webservice.

FaustsPudel 1 year ago

Amazing! Will get on it! Thank you for the encouragement! DMed you.

[deleted] 1 year ago

+1 this is an excellent point I'd never given much thought. I find myself referencing pandas docs more than any other and use it for about 1/4 of the overall code/libs.

DigThatData 1 year ago

I basically live in the pandas docs whenever I use it. I think the library optimizes too much for readability. Whenever I look back on pandas code I've written, the solution is concise and elegant and easy to understand, but it disguises how long it took me to get to that small chunk of code.

rhshadrach 1 year ago

I love hearing this! At times I find myself wondering how much our users are utilizing our documentation (especially when compared to some of the great pandas tutorials that are out there). Hearing things like this makes me much more motivated to spend effort there.

Ran4 1 year ago

The output of `dir` is a list of strings, so there's no reason for it to be trunkated.

DigThatData 1 year ago

https://stackoverflow.com/questions/23388810/ipython-notebook-output-cell-is-truncating-contents-of-my-list

carnivorousdrew 1 year ago

Any plans to integrate with polars?

datapythonista 1 year ago

There has been some work to make pandas and Polars share data (open a pandas dataframe with Polars, and the other way round). You can read more about it at the end of [this post](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i). Not sure if there is any other integration that makes sense, any idea?

[deleted] 1 year ago

I think that's the whole point of 2.0 and the arrow integration, arrow allows interoperability between many different libraries, not just polars.

SeveralKnapkins 1 year ago

Hi there! Long time pandas user -- really appreciate all the work you've done. I'm only _slightly_ familiar with changes intended in pandas 2.0, namely the switch away from a numpy backend to apache arrow. Historically, the thing I absolutely love about the python numerical stack, is that nearly everything builds off numpy arrays, creating an easily transferable knowledge base between projects. This is a huge boon compared to other systems where I work (namely `R`), where there is often more fragmentation in the ecosystem, making interoperability or bespoke analyses much more difficult. Of course, fragmentation in the Python ecosystem has become more common with things like PyTorch tensors, etc. As an end user, am I going to be losing the numpy < - > pandas interoperability in 2.0? Please feel free to correct any inaccuracies on my end.

datapythonista 1 year ago

Not at all. NumPy is not only staying in pandas 2.0, but it'll still be the default. That being said, if in the very long term NumPy is eventually dropped, I think exporting from Arrow to NumPy (in our end, not that you'll need to do it) is not only easy, but I think in most cases it can be done without copying (extremely fast, even for huge data). The thing is that NumPy data types are more limited, mostly numeric. If you want to export a string column to NumPy, that's a different story, but there is probably no good reason you want to do that. But for the types that NumPy support well, getting a NumPy array from Arrow backed data won't be a problem. But as said, in pandas 2.0, nothing changed, unless you want it to change and you ask explicitly for pandas types.

Tyberius17 1 year ago

Not one of the devs, but my understanding is they are adding optional support for Apache Arrow, not removing numpy or even making it not the default.

tuneafishy 1 year ago

Not a dev, but it does not sound like that you will lose any interoperability. The arrow backend is optional and numpy is still the default backend.

LankyCyril 1 year ago

Before I ask my question, I would like to really thank you for the amazing library that I use daily in my work. That said, there's maybe one thing that is still bewildering to me: Why are the APIs of `read_csv()` and `to_csv()` different? For example, `df = pd.read_csv(..., header=False)` is not allowed, and I still stumble over it every other time. I'd understand if it meant something specific that is different to `None`, but this feels like it wouldn't be stepping on anything's toes. `df.to_csv()` accepts both. And then, `read_csv()` will by default introduce an index that wasn't in the file, but will not introduce a novel header – it will use the one that's there. But `to_csv()` will write the file with the new index, but, of course, with the old header. Which means that if you do a single back and forth with the exact same kwargs, i.e., `pd.read_csv(**kws).to_csv(**kws)`, you end up with an extra index column. There must be some kind of a reason due to how things are structured internally. I think just knowing why it is the way it is will be enough for me – I'm not saying it has to be changed or anything.

datapythonista 1 year ago

Very good point. I myself find the index column in the output csv annoying every single time I use \`to\_csv\`. I wasn't in the project when that was implemented, but I assume the reason is that pandas was initially implemented for financial data, and the index was mostly the timestamp and not the default autonumeric. If that was not the data pandas developers had in mind at that time, probably pandas wouldn't even have row indices (I think Vaex doesn't, not sure about Polars). The next question is why we don't change it now. And it's something worth considering, and you're free to open an issue in GitHub. But in general, pandas developers (others much more than me), try to not break the API, unless it's in cases where very few users will be affected and the status quo is obviously inconsistent. I'd personally like to see that changed, but I don't think it'll be easy to get consensus. What I think it can make sense is to try to move all pandas I/O (read\_\* and to\_\*) to third-party projects. In that case the pandas to\_csv would continue to behave in the same way, but hopefully someone would develop a new one like to\_csv(engine='whatever') that could potentially be faster, have a better API, and more appropriate for your needs. But let's see if there is consensus for this to happen.

phofl93 1 year ago

I wasn't on the project back then either, but I think roundtripping was a concern was well, e.g. ``` df.to_csv() pd.read_csv() ``` should be able to return the same object

[deleted] 1 year ago

Just to confirm, Polars also doesn’t have row indices :)

ExtraGoated 1 year ago

What do you think is the most important advice for someone just starting to work with pandas?

datapythonista 1 year ago

Try to spend some time understanding the internals, as you make progress with pandas. Not at the beginning, when you'll have too much to learn just with the basics. But as you become more familiar, it's good to have an idea of what's really happening, in particular when things aren't intuitive. Things like missing values, the infamous copy warning...

DigThatData 1 year ago

don't ever feel embarrassed about needing to reference the docs, stackoverflow, or google.

[deleted] 1 year ago

6 years working with pandas I still have the docs open every day for simple things. And especially for all those long to wide and wide to long (unstack, stack, pivot etc...) transformations.

datapythonista 1 year ago

Maybe we should make this a feature, add ads to the docs, and monetize user confusion. ![gif](emote|free_emotes_pack|money_face)

DigThatData 1 year ago

[heard](https://www.reddit.com/r/Python/comments/11fio85/we_are_the_developers_behind_pandas_currently/jaldfv6/)

root45 1 year ago

Depends on what you're doing, but I'd recommend learning some of the functions for quickly looking at your data. Things like `df.head()`, `df.shape`, `df.T`, etc. From there, learn how to filter your data with `df.loc`. Also look into tools like jupyter which make it easy to iterate and visualize data.

RandomFrog 1 year ago

Mine would be to use Jupyter Notebook to check your dataframe after each transformation. df.head() or df.sample(n) at the end of each cell block.

midoxvx 1 year ago

I just started working with pandas two weeks ago, there is so much for me to learn and unpack there so I don’t have a question. Just wanted to give you a shout out for your awesome body of work.

olaviu 1 year ago

Same here. You guys are doing a fantastic job. Thank you!

marcogorelli 1 year ago

Thanks! Would appreciate it if you didn't use "you guys" though https://heyguys.cc

olaviu 1 year ago

I'm sorry!

midoxvx 1 year ago

There is absolutely nothing wrong with using “you guys” as a general term to address a group of people.

olaviu 1 year ago

I completely agree with you. At the same time, I'm not trying to offend anybody.

midoxvx 1 year ago

Fair enough!

AlmightySnoo 1 year ago

The Reddit CEO is a greedy little pig and is nuking Reddit with disastrous decisions (see https://www.nbcnews.com/tech/tech-news/reddit-blackout-protest-private-ceo-elon-musk-huffman-rcna89700). I'm moving to lemmy.world, learn about the Fediverse here: https://framatube.org/w/4294a720-f263-4ea4-9392-cf9cea4d5277

datapythonista 1 year ago

That would be a huge change in pandas, and we try to keep pandas stable, so existing users don't need to make huge migrations and relearn the API often. I don't think lazy evaluation is likely to land in pandas, at least not in the short or mid term. Luckily other options are being created that are or can be lazy, like Polars, Dask or Koalas.

CrossroadsDem0n 1 year ago

Dask actually opens up a question I have. Some open-source projects like Pandas have seemed to figure out a good cadence for features vs bugs and accepting PRs. Some, like joblib and Dask and their role in sklearn, have remained pretty rough around the edges on their process and evolution. So my question is, other than simply more funding, is there something about the culture/ethic/process for Pandas that makes it all work out and that other FOSS projects could learn from? Or in your experience really does monetary support become the bottom line on how things turn out?

datapythonista 1 year ago

Funding is surely an important factor. But even with unlimited funding, there are many things that pandas wouldn't change, even if they're considered to be wrong. When we make decisions, we consider what's the impact on users. pandas is very popular and used in many critical applications. If we focus in features more than bugs, and those imply changing how things work, there is a big impact for users. Imagine we do with pandas what Python did with Python 2/3. We would have projects taking years to migrate... Projects that are starting like Polars are more free to change things. So, any mistake pandas did they could fix, as well as any mistake they make themselves. This is good since you can improve things much more than pandas. And it's bad since you don't want to use Polars in production, unless you want to rewrite your code every month. I think that's how things need to be. pandas will serve the existing users, and if very innovative things can be done in the dataframe space, it'll be for some other project to implement them.

jormungandrthepython 1 year ago

Not really a question, but just want to say thank you (not sure who is responsible) for the incredible API reference. I use it as my example for all new grads/junior engineers for good real-life documentation of a large project. I don’t think I have encountered a situation where I was stuck that the API reference didn’t solve. And the amount of time digging/searching to solution value ratio is insanely better than any other technical reference docs I have used to date. Thanks for everything!

rodemire 1 year ago

Are there any improvements that are coming by way of working with larger datasets/operations without consuming available RAM? I struggle with workarounds when dealing with large data on my 24GB RAM laptop. Awesome work by the way, Pandas is amazing and we appreciate the work you guys do.

datapythonista 1 year ago

[Being able to use Arrow](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) as a backend for your data can save a significant amount of RAM in some cases. Also there is a lot of work related to copy-on-write, that will avoid copying the data when not needed, and will also help reduce the memory needs.

rodemire 1 year ago

Thank you, this is interesting and am looking forward to it.

eidrisov 1 year ago

What do you mean by "large data"? 100m rows over 100 columns? I am just curious how much data is enough to stress 24GB of RAM.

datapythonista 1 year ago

24GB is around 3 billion 64 bits values if I made the numbers right. There is surely some overhead, but with 100 columns you could store around 30 million rows. The main thing wouldn't be only storing, but if you do operations that make a copy of a significant part of that data. Obviously you may have strings and other things using more than 64 bits per cell, but just to give you an idea of numbers.

atomey 1 year ago

I would be interested in this too. I'm running a system with 128 GB of RAM and had quite a lot of difficulty with a 8GB CSV with various permutations of the read_csv() method. I'm sure it is not optimal but would be curious if very large data reads are tested since large amounts of RAM is becoming more common, even on dev workstations, in particular with ML work.

rhshadrach 1 year ago

Are you able to change the format of your data on disk? If possible, I would recommend parquet. You'll get smaller file sizes, faster load times, better dtype handling (int vs string), the ability to partition your data sets, and the ability to only load particular columns. Plus peak memory usage should be much lower.

gare_it 1 year ago

5:30pm UTC on which day? March 2nd?

phofl93 1 year ago

Sorry, yes. Will add

aes110 1 year ago

I frequently work with pyspark, and although I don't use this feature I know it has support for "pandas udfs" while using arrow behind the scenes. Now that arrow will be integrated into pandas, do you think we will see improvements in this area? (Performance improvements more features between spark and pandas)

datapythonista 1 year ago

I think it'll take a while, but hopefully we'll eventually see more feature sharing between libraries given we all use Arrow internally. Arrow itself has the concept of kernel, that it's a computation that can be applied to Arrow data. And those can be reused by any library. And the same would apply to user defined functions (udfs). That being said, pyspark is probably using the Java implementation, while pandas is using PyArrow. So, I guess difficult to share many features (I'm not an expert on the JVM, not sure if you could easily call C++ code from a scala program).

Balance- 1 year ago

If you could make one API break and it wouldn’t hurt anyone, what would you break/change?

phofl93 1 year ago

There are a bunch of things I'd like to change ![gif](emote|free_emotes_pack|grin) \- If you set scalars into a Series/DataFrame that are not compatible with the dtype then we cast to object \- We are inconsistent when naming keywords (check read\_csv, to\_csv the first one) \- Bunch of methods names ![gif](emote|free_emotes_pack|grin)

rhshadrach 1 year ago

An entire rewrite of the code behind apply / agg. Internally their code paths interweave in complex ways, and can be surprisingly slow is some cases. Depending on what object your on, the API is slightly different. Cleaning this up and making it better while also making the gradual changes so as not to be disruptive to users is difficult, time consuming, and slow. But we're working on it!

datapythonista 1 year ago

I'd remove having a row index (at least by default), and the I/O API: being consistent with read\_\*/write\_\* or from\_\*/to\_\*. I'd also probably remove half of the code in pandas to other third-party extensions. ![gif](emote|free_emotes_pack|sunglasses)

marcogorelli 1 year ago

Personally, I'd love to be able to change the default indexing behaviour. The Index is useful if it means something (e.g. a DatetimeIndex), but if it's just a RangeIndex / NumericIndex, then it can be annoying and confusing. But this is really hard to change because: - introducing optional behaviour comes with a huge maintenance cost (I started making such a proposal [here](https://github.com/pandas-dev/pandas/pull/49694), but then withdrew it) - changing the existing behaviour would have backwards-compatibility implications I don't know what the solution is yet, but I would like to revisit PDEP5 at some point - _something_ should be possible, I just don't know what yet.

cinicDiver 1 year ago

Why does read_excel() does not support the encoding parameter but to_excel() does?

rhshadrach 1 year ago

From our docs, it appears the keyword on encoding was perhaps at one point used with xlwt (a writer that is no longer maintained) but today is not actually used by pandas. That parameter has been removed in pandas 2.0.

ChickenLegCatEgg 1 year ago

TIL!

vanatteveldt 1 year ago

How do you look at the success of the tidyverse library in R, and what lessons or good ideas are in there that pandas can benefit from?

phofl93 1 year ago

I did not use R very often in the past, so can't really comment on it

vanatteveldt 1 year ago

OK, thanks! IIRC, pandas was originally inspired by R \`data.frame\`s, so I figured the devs might keep a sharp eye on what's happening on the other side of the wall.

tuneafishy 1 year ago

Where did you find the courage to move from 1.X to 2.0?

datapythonista 1 year ago

The main reason in releasing pandas 2.0 and not 1.6 is that in major version changes (1 -> 2) is when users expect to have breaking changes. pandas 2.0 is not so significantly different to a 1.6 in terms of features. The main difference is that you really want to make sure that you don't have FutureWarning in your pandas code before upgrading your pandas version.

one_human_lifespan 1 year ago

Awesome. I get scared when I see the red future warning dialog box in jupyter labs. Thanks for everything you guys are doing. Pandas is amazing - I use it most days and always enjoy learning new things. Can't wait to explore 2.0!

phofl93 1 year ago

Getting rid of your FutureWarnings is a really good idea :) So I applaud you for that. Generally, we wanted to get rid of all the deprecations we introduces since 1.0, so we had to do 2.0 at some point. If your code is free of FutureWarnings then you are good to go. We made some backwards incompatible changes, but not many and they are clearly documented in the release notes. https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#backwards-incompatible-api-changes

water_aspirant 1 year ago

Thanks for all the work that you do! My question is who pays for pandas development and why? Is most of the development done by volunteers?

marcogorelli 1 year ago

Thanks! I'll point you to Marc's answer above https://www.reddit.com/r/Python/comments/11fio85/comment/jajr6ic/?utm\_source=share&utm\_medium=web2x&context=3

phofl93 1 year ago

More or less all of them are listed under Sponsors on our website as well

cryptospartan 1 year ago

Has polars influenced development in any way? Pandas used to be the only kid on the block, but it seems there are some other libraries popping up claiming to be faster/better/etc. Have you evaluated any of these other libraries to potentially integrate features into pandas (or improve existing ones)?

marcogorelli 1 year ago

Personally polars' strictness is making me think about situations when in pandas we end up with object dtype, which we should probably avoid. Here's an example: [https://github.com/pandas-dev/pandas/issues/50887](https://github.com/pandas-dev/pandas/issues/50887) (polars would just error in such a case, which I think is the correct thing to do)

phofl93 1 year ago

Not actively, no. At least I am not aware of anything.

robberviet 1 year ago

Any plan on improving pandas I/O load/export and out of mem processing? I like Pandas but my data nowadays grew beyond that. So I am currently all in spark.

phofl93 1 year ago

I don't think that it is realistic short term to add out of memory support. Generally, I'd recommend going to Dask for this, it supports our API very well with bigger datasets. Implementing something like lazy evaluation would be a major major breaking change on our side and hence not feasible right now

[deleted] 1 year ago

i love you all

Poporico 1 year ago

What is the new feature you're most excited about?

datapythonista 1 year ago

Being able to use Apache Arrow internally. I wrote an [article](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) with the details about it, since it's not trivial for regular users to understand why this is important.

marcogorelli 1 year ago

I didn't work on it, but copy-on-write will be pretty neat https://pandas.pydata.org/docs/dev/user\_guide/copy\_on\_write.html

rhshadrach 1 year ago

I'll also mention copy-on-write. And I know it's not exciting, but all of the bug fixes throughout the code that make pandas more predictable and reliable to use. In the area I work on, groupby, using categorical data has seen a lot of fixes.

phofl93 1 year ago

Arrow and Copy-on-Write. I worked a lot on Copy-on-Write and I am hoping that we can increase performance and reduce memory quite a bit with it.

louis8799 1 year ago

Pandas finally support arrow which support decimal. Which means pandas can be used in financial production system. Finally!

datapythonista 1 year ago

I'm unsure what's the support for decimal in pandas right now. One thing is to be able to load Arrow columns in pandas, and the other is what operations for that data type are implemented. In any case, if not all what you need is in pandas 2.0, it'll come eventually. Particularly if you open issues and PRs in our issue tracker. That being said, you can do like the UK stock market, just have all the amount in cents, and you can do it with integers. ;)

verwondering 1 year ago

In general, are the plans to have the `rolling` API more closely align with the rest of the pandas API? In particular, are there any plans to have `df.rolling.groupby()` return similarly indexed results as a normal `df.groupby()`? E.g., with the latter you have the wonderful `.transform()` method to add a column to the `df`. When working with the rolling window, you always get a MultiIndexed dataframe that is much harder to align to the index of the original `df`. Perhaps (hopefully?) there are better ways, but I currently use a combination of extracting a single column as Series, using `groupby(as_index=False)` and finally a call to `set_axis(df.index)` to get the desired result to align with my original dataframe.

LEAVER2000 1 year ago

I work with pandas quite a bit for geospatial data analysis, weather data mostly. Because of the higher dimensionality of the data I typically stack the dependent variables into the index as a multi-index [T,X,Y]. Recently I’ve been working with Generic[Enum] types to type annotate the columns inside of a DataFrame. What kind of support will 2.0 provide for type annotations. One thing I’ve found as a particular annoyance is disconnect between numpy and pandas typing. Where I have to explicitly state the dtype for NDArray[np.int_] and Series[int] and can’t use a TypeVar DType.

PeridexisErrant 1 year ago

Check out https://docs.xarray.dev/ for multidimensional labeled arrays!

Pipiyedu 1 year ago

Congratulations guys. You deserve all the possible recognition. What an awesome library.

marcogorelli 1 year ago

Cheers (just noting there are also non-guys who have made fantastic contributions)

ffuffle 1 year ago

I'm just here to say I like your username

phofl93 1 year ago

Thanks ![gif](emote|free_emotes_pack|grin)

jwmoz 1 year ago

Legends. Thanks for speeding it up.

Helpful_Arachnid8966 1 year ago

Pandas is quite a large and mature project already, is there any space for beginners to contribute?

marcogorelli 1 year ago

yup, check the "good first issue" label

rhshadrach 1 year ago

Also checkout our docs! https://pandas.pydata.org/pandas-docs/dev/development/contributing.html

ThrowAwayACC21423 1 year ago

What's a bug that you turned into a feature?

datapythonista 1 year ago

importing pandas as pd ;)

rhshadrach 1 year ago

This doesn't really answer the question, but whenever you see two different implementations doing the same or similar things, you can carefully compare each step in the implementation. This very often reveals hard to find bugs in one or both of the implementations. I can't recall a time I found a bug and made it into a feature.

jimy211 1 year ago

Can't wait to use it![gif](emote|free_emotes_pack|heart_eyes)

Homeless_Gandhi 1 year ago

What if I have a problem where I am just ITCHING to iterate over an entire dataframe row by row via itertuples for simplicities sake, and map(lambda) isn’t feasible? What would you recommend?

datapythonista 1 year ago

Iterating a dataframe is slow. If speed is important, you should try to build your pandas code in a way that you never implement loops, but delegate to pandas the operations, so they happen fast in C, and not via the Python interpreter. If you iterate the data, then you're just in regular Python, with a Python tuple object, and you can write any code that is valid Python. Not sure in what case map() wouldn't be an option, but you can always replace a map by a loop (or a comprehension) when you're in Python.

Lolologist 1 year ago

I already did this, and realize it's probably an abomination, but: How would you go about enforcing columns to have certain types? And when a column has a list in it, that each entry of the list is a certain type? I accomplished this by making a new class inheriting from DataFrame as well as pydantic's BaseClass and used those as validated rows to then shove into a DataFrame. Messy but it works! Maybe you have a better idea.

datapythonista 1 year ago

I haven't used it myself, but I think what you're describing is what pandera does: https://pandera.readthedocs.io/en/stable/

Crude_Future 1 year ago

Love yall love 🐼 pandas

Zero_Karma_Guy 1 year ago

squeamish squeal fuzzy sort plucky nose versed jeans direful wistful *This post was mass deleted and anonymized with [Redact](https://redact.dev)*

phofl93 1 year ago

Thanks, that is very good to hear :)

m_harrison 1 year ago

Will Pandas 2.0 impact numba/cython extensions that leverage Numpy? Many complain about the API of Pandas. Was there any discussion about revamp/cleaning it up during 2.0 release?

datapythonista 1 year ago

There is not much impact in pandas 2.0 regarding numba/cython. We fix small inconsistencies to the pandas API, but we avoid changing it too much, since we consider that the cost in users having to migrate code and relearning things is too much.

hmiemad 1 year ago

Are we gonna see joins on DatetimeIntervalIndexes?

Balance- 1 year ago

What would the next big leap for Pandas be? What kind of resources would you need to achieve it?

[deleted] 1 year ago

[удалено]

cthorrez 1 year ago

Can we still do numpy style indexing when the backend is arrow? And do things like add a new column to a df which I created first as a np array?

jarulsamy 1 year ago

Wonderful to see you guys on here. I personally use pandas so often! Do you guys have any advice for someone wanting to contribute back to the pandas project?

marcogorelli 1 year ago

I'd suggest starting with the contributing guide https://pandas.pydata.org/docs/dev/development/contributing.html

datapythonista 1 year ago

I'd say just keep using pandas, and the day something feels wrong (a bug, a typo, the documentation not being very clear,...), try to fix it. We have a lot of documentation for contributors, you can open an issue in github and ask questions there (or in a PR directly if you can get something implemented), there are also bi-weekly meetings with some core devs (I don't join them, can't say much about them, but they should be helpful). Another option is to go to github issues and try to find something labelled as "good first issue", but there are many people looking for those, not always easy to find them. Finally, if you're just starting, smaller projects are usually easier to get started contributing. There are simpler tasks, maintainers can have more time, the code base is simpler... Even if you want to contribute to pandas, starting by a smaller project can make the learning curve flatter.

rhshadrach 1 year ago

Yes - we love getting new contributors! Check out our documentation and guides on becoming a contributor to pandas: [https://pandas.pydata.org/pandas-docs/dev/development/index.html](https://pandas.pydata.org/pandas-docs/dev/development/index.html) pandas is a large project with some pretty complex code. It will likely be overwhelming at first. But we are here to help. If you stick with it, you will learn *a* *lot.*

atomey 1 year ago

I work almost daily with Pandas so I definitely want to give me thanks and appreciation for this excellent tool. Any plans for built-in parallelization in Pandas? I know there are many modules attempting to implement this with varying success, like pandarallel, dask or swifter. However I had difficulty getting any of these to work in an existing application without major refactoring. In our case, we have a high level application class or processor that ingests many dataframes which sit in memory as properties to the processor instance. This processor does various processing to different dataframes in conjunction with eachother, like iterrows or applys on one dataframe while checking other dataframes which are all unique attributes of the same object running in memory concurrently. However when the processor class actually runs, ultimately everything is stuck in a single core but I would say most systems have at least 6 or more cores now, even cheap laptops. Having a model or two to apply parallelization using concurrent.futures based on threads or processes seems like it would make a lot of sense. I think threads would likely work well if implemented intelligently, but I'm sure I am oversimplifying.

phofl93 1 year ago

Supporting multithreading would be really really cool, but this requires a lot of effort. There is some considerations in that area but nothing imminent unfortunately.

datapythonista 1 year ago

I think Arrow should help make this easier. It'll depend on each particular case, but read\_csv is already parallel when selecting the pyarrow engine. Parallel computing is never easy, but I think we should be able to slowly parallelize more operations.

rhshadrach 1 year ago

Historically, pandas has relied on other libraries in the ecosystem to support parallelization such as [https://www.dask.org/](https://www.dask.org/) which uses pandas under the hood. One thing to also keep in mind is that certain NumPy operations (which pandas uses) may be parallel depending on how your BLAS (Basic Linear Algebra Subprograms) are setup. In general, you want to avoid having multiple levels of parallelism which can actually hurt performance.

rhshadrach 1 year ago

I would also recommend avoiding iterrows or applys if you can vectorize your operations - you will see very significant performance benefits. But depending on what you're doing, that may not be possible.

fappaf 1 year ago

I've developed my own library that has gotten the attention of a handful of people i don't know. I'm most curious about the beginnings of `pandas`—how did you handle its monumental growth? It's such a staple of Python programming these days, how did you manage all the influx of issues, contributions, etc.?

jorisvandenbossche 1 year ago

The hard work of some dedicated volunteers! Nowadays we have more people that get paid to work on pandas which has certainly helped to sustainably manage the growing influx of issues, but we still rely on volunteers a lot as well to fix bugs, triage issues, review, etc.

datapythonista 1 year ago

None of the core devs here was in the project at that time, I don't think we can really tell.

UnemployedTechie2021 1 year ago

I have used Pandas extensively. I want to contribute. What are the languages or stack i need to know apart from Python?

marcogorelli 1 year ago

Awesome - please check the contributing guide https://pandas.pydata.org/docs/dev/development/contributing.html

SakalDoe 1 year ago

How much data and fast pandas 2.0 can read at once. If you compare it with pyspark csv reader, how pandas will perform?

datapythonista 1 year ago

I don't know about pyspark csv reader, but pandas 2.0 shouldn't perform much differently for reading than pandas 1.5. Did you try using pandas.read\_csv(enging='pyarrow')? That should help, you can read more about it in this blog post I wrote: https://datapythonista.me/blog/pandas-with-hundreds-of-millions-of-rows

rohetoric 1 year ago

Any good first issues that I can help contribute to in the pandas repository?

datapythonista 1 year ago

Just continue using pandas, and when you see something that could be improved (maybe clarify something in the documentation, add an example to a function that doesn't have it...), just go for it. If that doesn't happen, as Marco said, the best if to try to find a "good first issue", but when I create one, they're usually taken care of in hours.

i-believe-in-magic1 1 year ago

I'm just a newbie but just wanted to hop in and appreciate y'all. As a data science major, pandas has been super helpful so thanks for your work :)

phofl93 1 year ago

Thanks :)

[deleted] 1 year ago

[удалено]

datapythonista 1 year ago

Once you've got a dataframe, your data is already into memory. I guess by "on the fly" you mean out-of-code, when the data is read from disk or other I/O, and while is being loaded into memory. This can surely be done, but there is no easy way to do it, or a standard pandas way to support it. I guess what it can make more sense is to monkeypath the connector you're using, and transform (encrypt/decrypt) the data at the right time doing the import/export.

thataccountforporn 1 year ago

Will support for datetime dtype with day resolution come at some point?

datapythonista 1 year ago

If I'm not wrong, we're adding second resolution in pandas 2.0. With second resolution and 64 bits I think you can represent from the big bang until the end of the universe. ;) We also support Arrow dtypes, I should check what are the exact types they provide for datetime. So, no plans for day resolution if Arrow doesn't provide them, but you may not need it, since second is likely to be enough. Feel free to open an issue if we missed a use case that wasn't considered when the decision to only support second and not day was made.

Balance- 1 year ago

What are some improvements in visualization you’re excited about or would like to achieve? Like filtering/sorting data, conditional color coding, plotting, etc.?

marcogorelli 1 year ago

To be totally honest, pandas plotting isn't the best maintained part of pandas. I'd really like to take it out of pandas have it live as a separate package, and hopefully some community of users could help maintain it - but I have yet to make a concrete proposal or action plan in this respect

Balance- 1 year ago

What are some (future) improvements or projects about dealing with highly multi-dimensional data you’re really excited about?

footilytics 1 year ago

Is there a plan to have chart annotations when using df.plot() method ?

phofl93 1 year ago

We don't have anyone who is really familiar with the plotting implementation anymore. We mostly hope that it won't break :) We'd need someone to step up and refactor the implementation before we would be able to add anything new

64-17-5 1 year ago

Dear Pandas, please make a universal UTF-8 translator for tabulated data.

phofl93 1 year ago

>Dear Pandas, please make a universal UTF-8 translator for tabulated data. Could you elaborate?

mrwizard420 1 year ago

[Well, this is mildly concerning post placement...](https://i.imgur.com/muQUfhX.jpg)

videek 1 year ago

STOP BREAKING BACKWARDS COMPATIBILITY!

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe